Dedotted I and dotlessi

Richard Wordingham richard.wordingham at ntlworld.com
Mon Aug 17 17:15:50 CDT 2020


On Mon, 17 Aug 2020 20:53:01 +0200
Khaled Hosny via Unicode <unicode at unicode.org> wrote:

> Easier said than done. Even for tools that want to do this, the only
> reliable way is tagging with /ActualText, but this has to be done per
> grapheme cluster as PDF viewers can’t select or highlight parts of
> text tagged with /ActualText, so Arabic excluded since PDF stores
> glyphs in visual order and you don’t want to tag full paragraphs.

That's a nasty bug.  Has it been established that negative
(advance)widths are "inconsistent" TrueType and CFF fonts?  I woud have
said that a PDF width of -573 was entirely consistent with a TrueType
width of 573.

> In
> case of reordering, you will also need to tag the whole reordered
> sequence as one unit since you can’t tell which glyphs belongs to
> which character any more. People will also complain about increased
> file size, so you will have to do tagging selectively for cases than
> can’t be handled in a different way.

I don't know if it's due to another feature (or even merely a bug), but
I did notice that LibreOffice-exported PDFs swell enormously if one uses
PDF/A to make Indic text extractable.  This was with a series of
documents that were at least 90% English (in the Latin script).  Zipping
was ineffective.

Richard.



More information about the Unicode mailing list