Dedotted I and dotlessi
richard.wordingham at ntlworld.com
Mon Aug 17 03:37:47 CDT 2020
There is a recommendation around that fonts should generate different
glyph ID sequences for canonically inequivalent character sequences.
Is this still a reasonable requirement?
The most obvious reason for this is that in simple scripts, the glyphs
in the glyph stream follow the order of characters in the character
stream, and therefore processes might hope to convert the glyph stream
back to the character stream. Now, <i, U+0302 COMBINING CIRCUMFLEX
ACCENT> and <U+0131 LATIN SMALL LETTER DOTLESS I, U+0302> should render
the same, and one shaping trick is to convert both base characters to
the same glyph, commonly called dotlessi. Glyph stream to character
stream conversions were used in the generation of PDFs and the logic
for extracting text from them.
Is the recommendation still valid, or have things moved on? Is the
recommendation applicable to Indic scripts, where glyph stream to
character stream conversion may be as complicated as the
reverse direction and there is a natural tendency for distinctions to
be lost. (In Devanagari, the distinction between mandated and
fallback half-forms is one example.)
More information about the Unicode