Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

Asmus Freytag (c) asmusf at ix.netcom.com
Tue Feb 7 23:48:19 CST 2017


On 2/7/2017 9:38 PM, Manish Goregaokar wrote:
> > The very first one কিী‎ (0995 09BF 09C0) had 1090 hits and shows up 
> in a book of short stories:
>
> That's bad OCR, that's an apostrophe, a Ka, and an E, with the 
> apostrophe being interpreted as a matra somehow.
Interesting suggestion. Would explain a lot.

A./
>
> I bet there are only a couple of OCR algorithms out there handling 
> Bangla. Indic scripts aren't something you can OCR glyph by glyph in 
> such a straightforward way due to ligatures, so these algorithms are 
> probably noticing components of a character and producing it. It sees 
> a preceding line and the curve above, and interprets that as an I. It 
> also sees the proceeding line and curve above, and interprets that as 
> an EE. It then just puts the two together. It shouldn't, but it does.
>
> Given a small set of OCR algorithms I think it's reasonable to assume 
> that such aberrations would be common across outputs -- so hundreds of 
> hits for a typo doesn't sound out of the ordinary to me.
>
> > Tried a random one: ঘিা (0998 09BF 09BE)
>
> I went through the results for ঘিা (0998 09BF 09BE). Most occurrences 
> are actually ঘন্টা (0998 09A8 09CD 099F 09BE), "ghanta" which can mean 
> "hour" or "bell". Reasonably common word. These documents don't look 
> scanned -- the text isn't garbled or anything, but it could be a 
> cleaned up scanned document because I copied out some more of the text 
> and there were similar aberrations all over the place. For example, in 
> [1] the letter ব ("ba") is used frequently, but is written with a 
> fancier script where it has an extra line through it. Many occurrences 
> of it have been interpreted as sequences of vowel diacritics. The last 
> line of the second-last stanza on page 5 has an absolutely ridiculous 
> number of consecutive diacritics in the PDF text.
>
>
>  [1]: 
> http://yousigma.com/religionandphilosophy/poojasloka/Sri%20Hari%20Kathamruta%20Sara%20Datta%20Swatantrya%20Sandhi%20(Sri%20Jagannatha%20Vittala%20Dasaru)%20-%20Assamese.pdf 
> <http://yousigma.com/religionandphilosophy/poojasloka/Sri%20Hari%20Kathamruta%20Sara%20Datta%20Swatantrya%20Sandhi%20%28Sri%20Jagannatha%20Vittala%20Dasaru%29%20-%20Assamese.pdf>
>
>
> -Manish
>
> On Tue, Feb 7, 2017 at 7:53 PM, Asmus Freytag <asmusf at ix.netcom.com 
> <mailto:asmusf at ix.netcom.com>> wrote:
>
>     On 2/7/2017 10:08 AM, Eric Muller wrote:
>>     In looking at the wiki{pedia,book.source,tionary} corpus for
>>     Bengla, I see a relatively large number of syllables with  <...
>>     09BF 09BE> or <... 09BF 09C0>. I checked a couple of sources, and
>>     I did not find them listed anywhere as being normally used.
>>
>>     Are they in normal use or are those all typos?
>     Tried a random one: ঘিা (0998 09BF 09BE) and got 385 hits in google.
>     Would surprise me if all of these were typos.
>
>     The very first one কিী‎ (0995 09BF 09C0) had 1090 hits and shows
>     up in a book of short stories:
>
>     where it starts a paragraph.
>
>     A./
>
>>
>>     I did not find any occurrence in the Assamese corpus.
>>
>>     Thanks,
>>     Eric.
>>
>>     The syllables (o is the number of occurrences):
>>
>>
>>     <string s='&#x0995;&#x09bf;&#x09c0;' o='198'/>
>>     <string s='&#x0995;&#x09cd;&#x09a4;&#x09bf;&#x09be;' o='262'/>
>>     <string s='&#x0995;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='447'/>
>>     <string s='&#x0995;&#x09cd;&#x09b0;&#x09bf;&#x09c0;' o='77'/>
>>     <string s='&#x0995;&#x09cd;&#x09b2;&#x09bf;&#x09be;' o='245'/>
>>     <string s='&#x0995;&#x09cd;&#x09b7;&#x09bf;&#x09c0;' o='161'/>
>>     <string s='&#x0995;&#x09cd;&#x09b8;&#x09bf;&#x09be;' o='138'/>
>>     <string s='&#x0996;&#x09bf;&#x09be;' o='949'/>
>>     <string s='&#x0997;&#x09bf;&#x09be;' o='2671'/>
>>     <string s='&#x0997;&#x09bf;&#x09c0;' o='250'/>
>>     <string s='&#x0997;&#x09cd;&#x09a8;&#x09bf;&#x09be;' o='57'/>
>>     <string s='&#x0997;&#x09cd;&#x09a8;&#x09bf;&#x09c0;' o='110'/>
>>     <string s='&#x0997;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='143'/>
>>     <string s='&#x0998;&#x09bf;&#x09be;' o='83'/>
>>     <string s='&#x0999;&#x09cd;&#x0995;&#x09bf;&#x09be;' o='403'/>
>>     <string s='&#x0999;&#x09cd;&#x0997;&#x09bf;&#x09be;' o='267'/>
>>     <string s='&#x0999;&#x09cd;&#x0997;&#x09bf;&#x09c0;' o='150'/>
>>     <string s='&#x099a;&#x09bf;&#x09be;' o='905'/>
>>     <string s='&#x099a;&#x09bf;&#x09c0;' o='135'/>
>>     <string s='&#x099a;&#x09cd;&#x099a;&#x09bf;&#x09be;' o='91'/>
>>     <string s='&#x099a;&#x09cd;&#x099b;&#x09bf;&#x09be;' o='323'/>
>>     <string s='&#x099b;&#x09bf;&#x09be;' o='712'/>
>>     <string s='&#x099b;&#x09bf;&#x09c0;' o='61'/>
>>     <string s='&#x099c;&#x09bf;&#x09be;' o='527'/>
>>     <string s='&#x099c;&#x09bf;&#x09c0;' o='140'/>
>>     <string s='&#x099c;&#x09cd;&#x099c;&#x09bf;&#x09be;' o='56'/>
>>     <string s='&#x099d;&#x09bf;&#x09be;' o='81'/>
>>     <string s='&#x099e;&#x09bf;&#x09be;' o='71'/>
>>     <string s='&#x099e;&#x09cd;&#x099a;&#x09bf;&#x09be;' o='175'/>
>>     <string s='&#x099e;&#x09cd;&#x099c;&#x09bf;&#x09be;' o='270'/>
>>     <string s='&#x099e;&#x09cd;&#x099c;&#x09bf;&#x09c0;' o='316'/>
>>     <string s='&#x099f;&#x09bf;&#x09be;' o='807'/>
>>     <string s='&#x099f;&#x09bf;&#x09c0;' o='586'/>
>>     <string s='&#x09a0;&#x09bf;&#x09be;' o='549'/>
>>     <string s='&#x09a0;&#x09bf;&#x09c0;' o='89'/>
>>     <string s='&#x09a1;&#x09bc;&#x09bf;&#x09be;' o='1361'/>
>>     <string s='&#x09a1;&#x09bc;&#x09bf;&#x09c0;' o='135'/>
>>     <string s='&#x09a1;&#x09bf;&#x09be;' o='257'/>
>>     <string s='&#x09a2;&#x09bc;&#x09bf;&#x09be;' o='71'/>
>>     <string s='&#x09a3;&#x09bf;&#x09be;' o='354'/>
>>     <string s='&#x09a4;&#x09bf;&#x09c0;' o='270'/>
>>     <string s='&#x09a4;&#x09bf;&#x09cd;&#x09af;&#x09c1;' o='75'/>
>>     <string s='&#x09a4;&#x09cd;&#x09a4;&#x09bf;&#x09be;' o='143'/>
>>     <string s='&#x09a4;&#x09cd;&#x09a4;&#x09bf;&#x09c0;' o='144'/>
>>     <string
>>     s='&#x09a4;&#x09cd;&#x09a4;&#x09cd;&#x09ac;&#x09bf;&#x09be;'
>>     o='54'/>
>>     <string s='&#x09a4;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='72'/>
>>     <string s='&#x09a4;&#x09cd;&#x09ae;&#x09bf;&#x09be;' o='161'/>
>>     <string s='&#x09a4;&#x09cd;&#x09af;&#x09bf;&#x09be;' o='129'/>
>>     <string s='&#x09a4;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='217'/>
>>     <string s='&#x09a4;&#x09cd;&#x09b0;&#x09bf;&#x09c0;' o='264'/>
>>     <string s='&#x09a4;&#x09cd;&#x09f0;&#x09bf;&#x09be;' o='102'/>
>>     <string s='&#x09a5;&#x09bf;&#x09be;' o='290'/>
>>     <string s='&#x09a5;&#x09bf;&#x09c0;' o='127'/>
>>     <string s='&#x09a6;&#x09bf;&#x09c0;' o='514'/>
>>     <string s='&#x09a6;&#x09cd;&#x09a7;&#x09bf;&#x09be;' o='228'/>
>>     <string s='&#x09a6;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='505'/>
>>     <string s='&#x09a6;&#x09cd;&#x09ac;&#x09bf;&#x09c0;' o='121'/>
>>     <string s='&#x09a6;&#x09cd;&#x09af;&#x09bf;&#x09be;' o='53'/>
>>     <string s='&#x09a7;&#x09bf;&#x09c0;' o='235'/>
>>     <string s='&#x09a8;&#x09bf;&#x09c0;' o='551'/>
>>     <string s='&#x09a8;&#x09cd;&#x09a4;&#x09bf;&#x09be;' o='100'/>
>>     <string
>>     s='&#x09a8;&#x09cd;&#x09a4;&#x09cd;&#x09b0;&#x09bf;&#x09be;'
>>     o='93'/>
>>     <string
>>     s='&#x09a8;&#x09cd;&#x09a4;&#x09cd;&#x09b0;&#x09bf;&#x09c0;'
>>     o='171'/>
>>     <string s='&#x09a8;&#x09cd;&#x09a6;&#x09bf;&#x09be;' o='102'/>
>>     <string
>>     s='&#x09a8;&#x09cd;&#x09a6;&#x09cd;&#x09b0;&#x09bf;&#x09be;'
>>     o='238'/>
>>     <string
>>     s='&#x09a8;&#x09cd;&#x09a6;&#x09cd;&#x09b0;&#x09bf;&#x09c0;'
>>     o='79'/>
>>     <string s='&#x09a8;&#x09cd;&#x09a7;&#x09bf;&#x09be;' o='109'/>
>>     <string s='&#x09a8;&#x09cd;&#x09ae;&#x09bf;&#x09be;' o='98'/>
>>     <string s='&#x09aa;&#x09bf;&#x09be;' o='1199'/>
>>     <string s='&#x09aa;&#x09cd;&#x09a4;&#x09bf;&#x09be;' o='67'/>
>>     <string s='&#x09aa;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='203'/>
>>     <string s='&#x09ab;&#x09bf;&#x09be;' o='174'/>
>>     <string s='&#x09ab;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='60'/>
>>     <string s='&#x09ac;&#x09bf;&#x09c0;' o='715'/>
>>     <string s='&#x09ac;&#x09cd;&#x09b0;&#x09bf;&#x09be;' o='87'/>
>>     <string s='&#x09ad;&#x09bf;&#x09be;' o='908'/>
>>     <string s='&#x09ad;&#x09bf;&#x09c0;' o='80'/>
>>     <string s='&#x09ae;&#x09bf;&#x09c0;' o='373'/>
>>     <string s='&#x09ae;&#x09cd;&#x09aa;&#x09bf;&#x09be;' o='55'/>
>>     <string s='&#x09ae;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='117'/>
>>     <string s='&#x09ae;&#x09cd;&#x09ae;&#x09bf;&#x09be;' o='67'/>
>>     <string s='&#x09af;&#x09bf;&#x09be;' o='204'/>
>>     <string s='&#x09b0;&#x09bf;&#x09be;' o='4703'/>
>>     <string s='&#x09b0;&#x09cd;&#x09a3;&#x09bf;&#x09be;' o='55'/>
>>     <string s='&#x09b0;&#x09cd;&#x09a4;&#x09bf;&#x09c0;' o='56'/>
>>     <string s='&#x09b0;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='105'/>
>>     <string s='&#x09b0;&#x09cd;&#x09ae;&#x09bf;&#x09be;' o='68'/>
>>     <string s='&#x09b0;&#x09cd;&#x09ae;&#x09bf;&#x09c0;' o='70'/>
>>     <string s='&#x09b0;&#x09cd;&#x09b7;&#x09bf;&#x09be;' o='65'/>
>>     <string s='&#x09b2;&#x09bf;&#x09c0;' o='419'/>
>>     <string s='&#x09b2;&#x09cd;&#x09aa;&#x09bf;&#x09c0;' o='113'/>
>>     <string s='&#x09b6;&#x09bf;&#x09c0;' o='216'/>
>>     <string s='&#x09b6;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='145'/>
>>     <string s='&#x09b7;&#x09bf;&#x09be;' o='376'/>
>>     <string s='&#x09b7;&#x09cd;&#x099f;&#x09bf;&#x09be;' o='269'/>
>>     <string
>>     s='&#x09b7;&#x09cd;&#x099f;&#x09cd;&#x09af;&#x09bf;&#x09be;'
>>     o='75'/>
>>     <string s='&#x09b7;&#x09cd;&#x09a0;&#x09bf;&#x09c0;' o='99'/>
>>     <string s='&#x09b8;&#x09bf;&#x09be;' o='760'/>
>>     <string s='&#x09b8;&#x09bf;&#x09c0;' o='117'/>
>>     <string s='&#x09b8;&#x09cd;&#x0995;&#x09bf;&#x09be;' o='106'/>
>>     <string
>>     s='&#x09b8;&#x09cd;&#x099f;&#x09cd;&#x09b0;&#x09bf;&#x09c0;'
>>     o='157'/>
>>     <string s='&#x09b8;&#x09cd;&#x09a4;&#x09bf;&#x09be;' o='311'/>
>>     <string s='&#x09b8;&#x09cd;&#x09a4;&#x09bf;&#x09c0;' o='50'/>
>>     <string s='&#x09b8;&#x09cd;&#x09a5;&#x09bf;&#x09be;' o='1946'/>
>>     <string s='&#x09b8;&#x09cd;&#x09ac;&#x09bf;&#x09be;' o='97'/>
>>     <string s='&#x09b8;&#x09cd;&#x09ae;&#x09bf;&#x09be;' o='74'/>
>>     <string s='&#x09b9;&#x09bf;&#x09c0;' o='424'/>
>>     <string s='&#x09b9;&#x09cd;&#x09af;&#x09bf;&#x09be;' o='89'/>
>>     <string s='&#x09f0;&#x09bf;&#x09c0;' o='204'/>
>>     <string
>>     s='&#x09f0;&#x09cd;&#x09a4;&#x09cd;&#x09a4;&#x09bf;&#x09be;'
>>     o='125'/>
>>     <string
>>     s='&#x09f0;&#x09cd;&#x09a4;&#x09cd;&#x09a4;&#x09bf;&#x09c0;'
>>     o='118'/>
>>     <string
>>     s='&#x09f0;&#x09cd;&#x09ae;&#x09cd;&#x09ae;&#x09bf;&#x09be;'
>>     o='58'/>
>>     <string s='&#x09f1;&#x09bf;&#x09be;' o='264'/>
>>
>>
>>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170207/1254b6d0/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/png
Size: 3143 bytes
Desc: not available
URL: <http://unicode.org/pipermail/unicode/attachments/20170207/1254b6d0/attachment.png>


More information about the Unicode mailing list