Dedotted I and dotlessi

Khaled Hosny
Mon Aug 17 13:53:01 CDT 2020

Easier said than done. Even for tools that want to do this, the only reliable way is tagging with /ActualText, but this has to be done per grapheme cluster as PDF viewers can’t select or highlight parts of text tagged with /ActualText, so Arabic excluded since PDF stores glyphs in visual order and you don’t want to tag full paragraphs. In case of reordering, you will also need to tag the whole reordered sequence as one unit since you can’t tell which glyphs belongs to which character any more. People will also complain about increased file size, so you will have to do tagging selectively for cases than can’t be handled in a different way.

In short, text extraction from PDF is a mess. 

On Aug 17, 2020, at 8:00 PM, Markus Scherer wrote:
> PDFs *should* be generated with Unicode strings, so that copy-and-paste etc. need not try to map back from glyphs.
> Of course, that's optional, and some tools don't bother.
markus

