UCA question / Produce Collation Element Arrays

Kip Cole via CLDR-Users cldr-users at unicode.org
Sat Dec 2 06:32:55 CST 2017


Markus and co, probably another dumb question but I’m making progress.  In section 7.2 or TR10 the algorithm for producing a CE array says:

> S2.1 Find the longest initial substring S at each point that has a match in the collation element table.
> 
> S2.1.1 If there are any non-starters following S, process each non-starter C.
> 
> S2.1.2 If C is an unblocked non-starter with respect to S, find if S + C has a match in the collation element table.
> 
> Note: This condition is specific to non-starters, and is not precisely the same as the concept of blocking in normalization, since it is dealing with look ahead for a discontiguous match, rather than with normalization forms. Hangul jamos and other starters are only supported with contiguous matches .
> 
> S2.1.3 If there is a match, replace S by S + C, and remove C. 
> 

For s2.1.1 I’m trying to confirm what “process each non-starter C” means.  Best I understand so far it means “ignore” or “skip” all C that are non-starters.  is that the correct interpretation?   It would seem to be consistent with the annotation:

Steps 2.1.1 “process each non-starter C” and 2.1.2 “find if S + C has a match in the table”, where one or more intermediate non-starters may be skipped (making it discontiguous), extends a contraction match by one code point at a time to find the next match. In particular, if C is a non-starter and if the table had a mapping for ABC but not one for AB, then a discontiguous-contraction match on text ABMC (with M being a skippable non-starter) would never be found. Well-formedness condition 5 requires the presence of the prefix contraction AB.


More information about the CLDR-Users mailing list