Dedotted I and dotlessi
dr.khaled.hosny at gmail.com
Mon Aug 17 09:58:40 CDT 2020
> On Aug 17, 2020, at 10:37 AM, Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> There is a recommendation around that fonts should generate different
> glyph ID sequences for canonically inequivalent character sequences.
> Is this still a reasonable requirement?
> The most obvious reason for this is that in simple scripts, the glyphs
> in the glyph stream follow the order of characters in the character
> stream, and therefore processes might hope to convert the glyph stream
> back to the character stream. Now, <i, U+0302 COMBINING CIRCUMFLEX
> ACCENT> and <U+0131 LATIN SMALL LETTER DOTLESS I, U+0302> should render
> the same, and one shaping trick is to convert both base characters to
> the same glyph, commonly called dotlessi. Glyph stream to character
> stream conversions were used in the generation of PDFs and the logic
> for extracting text from them.
> Is the recommendation still valid, or have things moved on?
For some PDF work flows, yes.
> Is the
> recommendation applicable to Indic scripts, where glyph stream to
> character stream conversion may be as complicated as the
> reverse direction and there is a natural tendency for distinctions to
> be lost. (In Devanagari, the distinction between mandated and
> fallback half-forms is one example.)
Same workflows can’t handle one to many substitution, or reordering, so when I’m doing fonts that need these I usually just give up on the “unique glyph per code point” requirement. I also forget about it when making Arabic fonts, because extracting Arabic text reliably from PDFs generated with such workflows is a lost cause already.
More information about the Unicode