Dedotted I and dotlessi

Mon Aug 17 18:39:10 CDT 2020

> On Aug 18, 2020, at 1:33 AM, Khaled Hosny <dr.khaled.hosny at gmail.com> wrote:
> 
> 
> 
>> On Aug 18, 2020, at 12:15 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>> 
>> On Mon, 17 Aug 2020 20:53:01 +0200 Khaled Hosny via Unicode <unicode at unicode.org> wrote:
>> 
>>> In
>>> case of reordering, you will also need to tag the whole reordered
>>> sequence as one unit since you can’t tell which glyphs belongs to
>>> which character any more. People will also complain about increased
>>> file size, so you will have to do tagging selectively for cases than
>>> can’t be handled in a different way.
>> 
>> I don't know if it's due to another feature (or even merely a bug), but
>> I did notice that LibreOffice-exported PDFs swell enormously if one uses
>> PDF/A to make Indic text extractable.  This was with a series of
>> documents that were at least 90% English (in the Latin script).  Zipping
>> was ineffective.
> 
> LibreOffice does exactly the selective handling I described: unique one to one and many to one mappings use the font’s /ToUnicode, everything else uses /ActualText tasing per cluster (HarfBuzz cluster which is not always the same as grapheme clusters). As it happens, I wrote that code in LibreOffice.

The PDF/A issue is probably unrelated, since what I’m describing above happens with any PDF export profile.