Dedotted I and dotlessi

Mon Aug 17 09:58:40 CDT 2020

> On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> 
> There is a recommendation around that fonts should generate different
> glyph ID sequences for canonically inequivalent character sequences.
> Is this still a reasonable requirement?
> 
> The most obvious reason for this is that in simple scripts, the glyphs
> in the glyph stream follow the order of characters in the character
> stream, and therefore processes might hope to convert the glyph stream
> back to the character stream.  Now, <i, U+0302 COMBINING CIRCUMFLEX
> ACCENT> and <U+0131 LATIN SMALL LETTER DOTLESS I, U+0302> should render
> the same, and one shaping trick is to convert both base characters to
> the same glyph, commonly called dotlessi.  Glyph stream to character
> stream conversions were used in the generation of PDFs and the logic
> for extracting text from them.
> 
> Is the recommendation still valid, or have things moved on?

For some PDF work flows, yes.

>  Is the
> recommendation applicable to Indic scripts, where glyph stream to
> character stream conversion may be as complicated as the
> reverse direction and there is a natural tendency for distinctions to
> be lost.  (In Devanagari, the distinction between mandated and
> fallback half-forms is one example.)

Same workflows can’t handle one to many substitution, or reordering, so when I’m doing fonts that need these I usually just give up on the “unique glyph per code point” requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already.

Regards,
Khaled