Collation Grapheme Clusters and Canonical Equivalence
Richard Wordingham via Unicode
unicode at unicode.org
Fri Oct 18 06:21:20 CDT 2019
On Thu, 17 Oct 2019 23:11:55 +0100
Richard Wordingham via Unicode <unicode at unicode.org> wrote:
> There seems to be a Unicode non-compliance (C6) issue in the
> definition of collation grapheme clusters (defined in UTS#10 Section
> 9.9). Using the DUCET collation, the canonically equivalent strings
> รู้ <U+0E23 THAI CHARACTER RO RUA, U+0E39 THAI CHARACTER SARA UU,
> U+0E49 THAI CHARACTER MAI THO> and รัู <U+0E23, U+0E49, U+0E39>
> decompose into collation grapheme clusters in two different ways.
> The first decomposes into <U+0E23> and <U+0E39, U+0E49> and the
> second decomposes into <U+0E23, U+0E49> and <U+0E39>.
One has to take the collating elements in NFD order, so the tone mark
(secondary weight) and the vowel (primary weight) also form a cluster,
so the division into clusters is <U+0E23>, <U+0E49, U+0E39>. This
split respects canonical equivalence.
Now, one form of typo one may see in Thai is where the
vowel is typed twice. Thai fonts often lack mark-to-mark positioning
for sequences that should not occur, so the two copies of the vowel may
be overlaid. Proof-reading will not spot the mistake if the font or
layout engine does not assist.
Thus we can get <U+0E23, U+0E39, U+0E39, U+0E49> (417,000 raw Google
hits, the first 10 all good). That splits into *three* collation
grapheme clusters - <U+0E23>, <U+0E39> and <U+0E39, U+0E49>. Its
canonical equivalence <U+0E23, U+0E49, U+0E39, U+0E39> splits into two
grapheme clusters, for to form a sequence of collating elements
without skipping starting at the U+0E49, one must take all three
characters. Overall, we end up with *two* collation grapheme clusters,
<U+0E23> and <U+0E49, U+0E39, U+0E39>.
> Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
> requirement, an implementation shall provide for collation grapheme
> clusters matches based on a locale's collation order", requires
> canonically equivalent sequences to be interpreted differently.
More information about the Unicode