Bengla syllables <... 09BF 09BE> and <... 09BF 09C0>

Sat Feb 11 18:45:45 CST 2017

On Tue, 7 Feb 2017 21:38:26 -0800
Manish Goregaokar <manish at mozilla.com> wrote:

> I went through the results for ঘিা (0998 09BF 09BE). Most occurrences
> are actually ঘন্টা (0998 09A8 09CD 099F 09BE), "ghanta" which can
> mean "hour" or "bell". Reasonably common word. These documents don't
> look scanned -- the text isn't garbled or anything, but it could be a
> cleaned up scanned document because I copied out some more of the
> text and there were similar aberrations all over the place.

I think OCR problems aren't the only cause.  I had a detailed look at a
PDF generated using Version 5.90 of the TrueType-outline Vrinda font
(available with Windows 7, at least), found on the web at
http://www.bsci-intl.org/sites/default/files/Terms%20of%20Implementation%20for%20Business%20Partners-Producers_2014_BN.pdf .
The looking included decompressing the compressed streams in the file,
which I haven't yet automated.  The font name is visible in
uncompressed part of the PDF.

There was very little in the way of 'ActualText', so it seems that the
actual text has to be deduced using ToUnicode entries.  I looked for
text allegedly matching কিী‎ <U+0995, U+09BF, U+09C0> according to the
Firefox (Version 51.0.1 as prepared for Ubuntu Xenial) 'preview' and
analysed the second occurrence.  It was on line 4 of p3, which I
identified by a position in the page of y=684.22.  The problem was that
the ToUnicode mapping (Object 283 within the PDF) said glyph 0x0107
(=263) was for U+09BF BENGALI VOWEL SIGN I, whereas according to the
cmap tables for the font, it is for U+09AE BENGALI LETTER MA, which is
what I saw in the text as displayed.

Now, although the font's post table has glyph names, I'm not sure that
the semantics of names like "bn_ma" are defined.  I suspect something
went wrong when the PDF generator deduced corresponding characters from
the GSUB table, though it could be a subsetting problem.

Richard.