Collation Grapheme Clusters and Canonical Equivalence

Thu Oct 17 17:11:55 CDT 2019

There seems to be a Unicode non-compliance (C6) issue in the definition
of collation grapheme clusters (defined in UTS#10 Section 9.9).  Using
the DUCET collation, the canonically equivalent strings รู้ <U+0E23 THAI
CHARACTER RO RUA, U+0E39 THAI CHARACTER SARA UU, U+0E49 THAI CHARACTER
MAI THO> and รัู <U+0E23, U+0E49, U+0E39> decompose into collation
grapheme clusters in two different ways.  The first decomposes into
<U+0E23> and <U+0E39, U+0E49> and the second decomposes into <U+0E23,
U+0E49> and <U+0E39>.

Thus UTS#18 RL3.2 'Tailored Grapheme Clusters', namely "To meet this
requirement, an implementation shall provide for collation grapheme
clusters matches based on a locale's collation order", requires
canonically equivalent sequences to be interpreted differently.

Is this a known issue?

Should I report it against UTS#10 or UTS#18?

Is the phrase 'collation order' intended to preclude the use of search
collations?  Search collations allow one to find a collation grapheme
cluster starting with U+0E15 THAI CHARACTER TO TAO in its exemplifying
word เต่า <U+0E40 THAI CHARACTER SARA E, U+0E15, U+0E48 THAI CHARACTER
MAI EK, U+0E32 THAI CHARACTER SARA AA>.  DUCET splits it into <U+0E40,
U+0E15, U+0E48>, <U+0E32>, but most (all?) CLDR search collations split
it into <U+0E40>, <U+0E15, U+0E48>, <U+0E32>, matching the division
into grapheme clusters.

If we accept that in the Latin script Vietnamese tone marks have
primary weights (this only shows up with strings more than one
syllable long), I can produce more egregious examples based on the
various sequences canonically equivalent to U+1EAD LATIN SMALL LETTER A
WITH CIRCUMFLEX AND DOT BELOW or to U+1EDB LATIN SMALL LETTER O WITH
HORN AND ACUTE.

The root of the problem is the desire to match only contiguous
substrings.  This does not play nicely with canonical equivalence.

Richard.