Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>
Asmus Freytag (c)
asmusf at ix.netcom.com
Tue Feb 7 23:48:19 CST 2017
On 2/7/2017 9:38 PM, Manish Goregaokar wrote:
> > The very first one কিী (0995 09BF 09C0) had 1090 hits and shows up
> in a book of short stories:
>
> That's bad OCR, that's an apostrophe, a Ka, and an E, with the
> apostrophe being interpreted as a matra somehow.
Interesting suggestion. Would explain a lot.
A./
>
> I bet there are only a couple of OCR algorithms out there handling
> Bangla. Indic scripts aren't something you can OCR glyph by glyph in
> such a straightforward way due to ligatures, so these algorithms are
> probably noticing components of a character and producing it. It sees
> a preceding line and the curve above, and interprets that as an I. It
> also sees the proceeding line and curve above, and interprets that as
> an EE. It then just puts the two together. It shouldn't, but it does.
>
> Given a small set of OCR algorithms I think it's reasonable to assume
> that such aberrations would be common across outputs -- so hundreds of
> hits for a typo doesn't sound out of the ordinary to me.
>
> > Tried a random one: ঘিা (0998 09BF 09BE)
>
> I went through the results for ঘিা (0998 09BF 09BE). Most occurrences
> are actually ঘন্টা (0998 09A8 09CD 099F 09BE), "ghanta" which can mean
> "hour" or "bell". Reasonably common word. These documents don't look
> scanned -- the text isn't garbled or anything, but it could be a
> cleaned up scanned document because I copied out some more of the text
> and there were similar aberrations all over the place. For example, in
> [1] the letter ব ("ba") is used frequently, but is written with a
> fancier script where it has an extra line through it. Many occurrences
> of it have been interpreted as sequences of vowel diacritics. The last
> line of the second-last stanza on page 5 has an absolutely ridiculous
> number of consecutive diacritics in the PDF text.
>
>
> [1]:
> http://yousigma.com/religionandphilosophy/poojasloka/Sri%20Hari%20Kathamruta%20Sara%20Datta%20Swatantrya%20Sandhi%20(Sri%20Jagannatha%20Vittala%20Dasaru)%20-%20Assamese.pdf
> <http://yousigma.com/religionandphilosophy/poojasloka/Sri%20Hari%20Kathamruta%20Sara%20Datta%20Swatantrya%20Sandhi%20%28Sri%20Jagannatha%20Vittala%20Dasaru%29%20-%20Assamese.pdf>
>
>
> -Manish
>
> On Tue, Feb 7, 2017 at 7:53 PM, Asmus Freytag <asmusf at ix.netcom.com
> <mailto:asmusf at ix.netcom.com>> wrote:
>
> On 2/7/2017 10:08 AM, Eric Muller wrote:
>> In looking at the wiki{pedia,book.source,tionary} corpus for
>> Bengla, I see a relatively large number of syllables with <...
>> 09BF 09BE> or <... 09BF 09C0>. I checked a couple of sources, and
>> I did not find them listed anywhere as being normally used.
>>
>> Are they in normal use or are those all typos?
> Tried a random one: ঘিা (0998 09BF 09BE) and got 385 hits in google.
> Would surprise me if all of these were typos.
>
> The very first one কিী (0995 09BF 09C0) had 1090 hits and shows
> up in a book of short stories:
>
> where it starts a paragraph.
>
> A./
>
>>
>> I did not find any occurrence in the Assamese corpus.
>>
>> Thanks,
>> Eric.
>>
>> The syllables (o is the number of occurrences):
>>
>>
>> <string s='কিী' o='198'/>
>> <string s='ক্তিা' o='262'/>
>> <string s='ক্রিা' o='447'/>
>> <string s='ক্রিী' o='77'/>
>> <string s='ক্লিা' o='245'/>
>> <string s='ক্ষিী' o='161'/>
>> <string s='ক্সিা' o='138'/>
>> <string s='খিা' o='949'/>
>> <string s='গিা' o='2671'/>
>> <string s='গিী' o='250'/>
>> <string s='গ্নিা' o='57'/>
>> <string s='গ্নিী' o='110'/>
>> <string s='গ্রিা' o='143'/>
>> <string s='ঘিা' o='83'/>
>> <string s='ঙ্কিা' o='403'/>
>> <string s='ঙ্গিা' o='267'/>
>> <string s='ঙ্গিী' o='150'/>
>> <string s='চিা' o='905'/>
>> <string s='চিী' o='135'/>
>> <string s='চ্চিা' o='91'/>
>> <string s='চ্ছিা' o='323'/>
>> <string s='ছিা' o='712'/>
>> <string s='ছিী' o='61'/>
>> <string s='জিা' o='527'/>
>> <string s='জিী' o='140'/>
>> <string s='জ্জিা' o='56'/>
>> <string s='ঝিা' o='81'/>
>> <string s='ঞিা' o='71'/>
>> <string s='ঞ্চিা' o='175'/>
>> <string s='ঞ্জিা' o='270'/>
>> <string s='ঞ্জিী' o='316'/>
>> <string s='টিা' o='807'/>
>> <string s='টিী' o='586'/>
>> <string s='ঠিা' o='549'/>
>> <string s='ঠিী' o='89'/>
>> <string s='ড়িা' o='1361'/>
>> <string s='ড়িী' o='135'/>
>> <string s='ডিা' o='257'/>
>> <string s='ঢ়িা' o='71'/>
>> <string s='ণিা' o='354'/>
>> <string s='তিী' o='270'/>
>> <string s='তি্যু' o='75'/>
>> <string s='ত্তিা' o='143'/>
>> <string s='ত্তিী' o='144'/>
>> <string
>> s='ত্ত্বিা'
>> o='54'/>
>> <string s='ত্বিা' o='72'/>
>> <string s='ত্মিা' o='161'/>
>> <string s='ত্যিা' o='129'/>
>> <string s='ত্রিা' o='217'/>
>> <string s='ত্রিী' o='264'/>
>> <string s='ত্ৰিা' o='102'/>
>> <string s='থিা' o='290'/>
>> <string s='থিী' o='127'/>
>> <string s='দিী' o='514'/>
>> <string s='দ্ধিা' o='228'/>
>> <string s='দ্বিা' o='505'/>
>> <string s='দ্বিী' o='121'/>
>> <string s='দ্যিা' o='53'/>
>> <string s='ধিী' o='235'/>
>> <string s='নিী' o='551'/>
>> <string s='ন্তিা' o='100'/>
>> <string
>> s='ন্ত্রিা'
>> o='93'/>
>> <string
>> s='ন্ত্রিী'
>> o='171'/>
>> <string s='ন্দিা' o='102'/>
>> <string
>> s='ন্দ্রিা'
>> o='238'/>
>> <string
>> s='ন্দ্রিী'
>> o='79'/>
>> <string s='ন্ধিা' o='109'/>
>> <string s='ন্মিা' o='98'/>
>> <string s='পিা' o='1199'/>
>> <string s='প্তিা' o='67'/>
>> <string s='প্রিা' o='203'/>
>> <string s='ফিা' o='174'/>
>> <string s='ফ্রিা' o='60'/>
>> <string s='বিী' o='715'/>
>> <string s='ব্রিা' o='87'/>
>> <string s='ভিা' o='908'/>
>> <string s='ভিী' o='80'/>
>> <string s='মিী' o='373'/>
>> <string s='ম্পিা' o='55'/>
>> <string s='ম্বিা' o='117'/>
>> <string s='ম্মিা' o='67'/>
>> <string s='যিা' o='204'/>
>> <string s='রিা' o='4703'/>
>> <string s='র্ণিা' o='55'/>
>> <string s='র্তিী' o='56'/>
>> <string s='র্বিা' o='105'/>
>> <string s='র্মিা' o='68'/>
>> <string s='র্মিী' o='70'/>
>> <string s='র্ষিা' o='65'/>
>> <string s='লিী' o='419'/>
>> <string s='ল্পিী' o='113'/>
>> <string s='শিী' o='216'/>
>> <string s='শ্বিা' o='145'/>
>> <string s='ষিা' o='376'/>
>> <string s='ষ্টিা' o='269'/>
>> <string
>> s='ষ্ট্যিা'
>> o='75'/>
>> <string s='ষ্ঠিী' o='99'/>
>> <string s='সিা' o='760'/>
>> <string s='সিী' o='117'/>
>> <string s='স্কিা' o='106'/>
>> <string
>> s='স্ট্রিী'
>> o='157'/>
>> <string s='স্তিা' o='311'/>
>> <string s='স্তিী' o='50'/>
>> <string s='স্থিা' o='1946'/>
>> <string s='স্বিা' o='97'/>
>> <string s='স্মিা' o='74'/>
>> <string s='হিী' o='424'/>
>> <string s='হ্যিা' o='89'/>
>> <string s='ৰিী' o='204'/>
>> <string
>> s='ৰ্ত্তিা'
>> o='125'/>
>> <string
>> s='ৰ্ত্তিী'
>> o='118'/>
>> <string
>> s='ৰ্ম্মিা'
>> o='58'/>
>> <string s='ৱিা' o='264'/>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170207/1254b6d0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 3143 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170207/1254b6d0/attachment.png>
More information about the Unicode
mailing list