Fonts and Canonical Equivalence

Sat Aug 10 02:26:34 CDT 2019

I've spun this question off from the issue of what the USE is to do when
confronted with the NFC canonical equivalent of a string it will accept
when this equivalent does not match its regular expressions when they
are applied to strings of characters rather than canonical equivalence
classes of strings.

What sort of guidance is there on the streams of characters to be
supported by a font with respect to canonical equivalence?  For example,
one might think it would suffice for a font to support NFD strings
only, but sometimes it seems that the only canonical equivalent that
needs be supported is not the Unicode-defined canonical form, but a
renderer-defined canonical form.

For example, when a Tai Tham renderer supports subscripted final
consonants, should the font support both the sequences <tone, SAKOT,
consonant> and <SAKOT, tone, consonant>, or just the one chosen by the
rendering engine? The HarfBuzz SEA engine would present the font with
the former; font designers had seen rendering failures when Tai Tham
text belatedly started being canonically normalised.

There are similar issues with Tibetan; some fonts do not work properly
if a vowel below (ccc=132) is separated from the base of the
consonant stack by a vowel above (ccc=130).

TUS sees a rendering engine plus a font file (or a set of them) as a
single entity, so I don't think it's much guidance here.  It seems
tolerant of the loss of precision in placement when a Latin character
is rendered as base plus diacritic rather than as a precomposed glyph.
One can also pedantically argue that a font is a data file rather than
a 'process'.  (Additionally, a lot of us get confused by the mens rea
aspect of Unicode compliance.)

Richard.