Dedotted I and dotlessi

Khaled Hosny dr.khaled.hosny at gmail.com
Mon Aug 17 18:33:52 CDT 2020



> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> On Mon, 17 Aug 2020 20:53:01 +0200
> Khaled Hosny via Unicode <unicode at unicode.org> wrote:
> 
>> Easier said than done. Even for tools that want to do this, the only
>> reliable way is tagging with /ActualText, but this has to be done per
>> grapheme cluster as PDF viewers can’t select or highlight parts of
>> text tagged with /ActualText, so Arabic excluded since PDF stores
>> glyphs in visual order and you don’t want to tag full paragraphs.
> 
> That's a nasty bug.  Has it been established that negative
> (advance)widths are "inconsistent" TrueType and CFF fonts?  I woud have
> said that a PDF width of -573 was entirely consistent with a TrueType
> width of 573.

It is possible to store glyphs in logical order and adjust their positions so they appear in visual order, but this all break in PDF readers that expect Arabic to be in visual order (since this is what almost all PDF creators do) and try to reverse the Arabic string again to get the logical string (which is not always reliable since there is no standard reverse BiDi algorithm).

>> In
>> case of reordering, you will also need to tag the whole reordered
>> sequence as one unit since you can’t tell which glyphs belongs to
>> which character any more. People will also complain about increased
>> file size, so you will have to do tagging selectively for cases than
>> can’t be handled in a different way.
> 
> I don't know if it's due to another feature (or even merely a bug), but
> I did notice that LibreOffice-exported PDFs swell enormously if one uses
> PDF/A to make Indic text extractable.  This was with a series of
> documents that were at least 90% English (in the Latin script).  Zipping
> was ineffective.

LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font’s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice.



More information about the Unicode mailing list