Multiple Preposed Marks

Philippe Verdy verdy_p at wanadoo.fr
Tue Nov 8 17:00:01 CST 2016


2016-11-08 9:30 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> TUS Section 2.11 says, "If the combining characters can interact
> typographically—for example, U+0304 combining macron and  U+0308
> combining  diaeresis — then  the  order  of  graphic  display  is
> determined  by  the  order  of  coded  characters  (see Table 2-5).
> By  default,  the  diacritics  or other combining characters are
> positioned from the base character’s glyph outward".
>

The interpretation of   "If the combining characters can interact
typographically" should be better read as "If the combining characters have
the same non-zero combining class or any one of them has a zero combining
class".

Effectively the combining classes were historically intended to track these
possible graphic interactions, in order to allow or disable reordering and
detect canonical equivalences.

But now normalization is everywhere and causes the pairs using the
condition above to be freely reordered (or decomposed and recomposed,
meaning that the encoding order is NOT significant at all).

But it turned out that some diacritics may be positioned differently
according to their base character. E.g., the cedilla which may interact
below, where no interaction is supposed with other combining characters
normally interacting above (so that reordering to canonical equivalents is
permitted and in fact made automatically during the encoding/decoding
processes of documents), but with some Latin letters these interaction do
occur. The only way then to block the reordering (if you don't want the
positions infered from the encoding order of normalized strings), is to
block it using zero-combining joiners (CGJ).

This sentence should have been updated since long in TUS, because TUS does
not really know how characters will be positioned and Unicode permits
reordering of pairs of diacritics if they are not blocking each other for
normalization.

This is important for the cedilla, but even more important for Hebrew
diacritics, whose combining classes do not really track correctly their
relative positioning (as discussed on this list years ago, and known as the
"Hebrew points bug" (but this will never change: the combiing classes are
assigned permanently and continue to work for simple cases, but will cause
problems with some pairs needing insertions of CGJ).

This is also important for several Indic scripts that have complex
positioning rules if you use combining characters with non-zero combining
classes (initially intended for simple cases in Latin/Greek/Cyrillic).
Thanks, the most critical diacritics in Indic scripts for such complex
cases have a combining class set to zero (meaning that they blcok eah other
and their relative encoding order is not affected by normalization, but
there are many cases where CGJ is needed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161109/2a4abd76/attachment.html>


More information about the Unicode mailing list