Minimal Implementation of Unicode Collation Algorithm

Richard Wordingham via Unicode unicode at unicode.org
Mon Dec 4 19:02:22 CST 2017


On Mon, 4 Dec 2017 12:48:11 -0800
Markus Scherer via Unicode <unicode at unicode.org> wrote:

> On Mon, Dec 4, 2017 at 5:30 AM, Richard Wordingham via Unicode <
> unicode at unicode.org> wrote:  

> > Would an implementation that supported no characters be compliant?
 
> I guess so. I assume that would mean that the CET maps nothing, and
> that the implementation does implement the implicit weighting of Han
> characters and unassigned (here: unmapped) code points. It would also
> have to do NFD first.

I am extrapolating from the comment on UTS10-C1 in UTS#10, "In
particular,  a conformant implementation must be able to compare any
two canonical-equivalent strings as being equal, for all Unicode
characters supported by that implementation."  There is now nothing
that forces the implementation to support any Unicode characters!

Possibly this results from an attempt to allow an implementation to
conform to Version x.y.z of the UCA with supporting normalisation
for some other set of characters or choosing not to support character
with non-zero canonical combining class, which, while not eliminating
the need to address canonical equivalence, goes a long way towards doing
so.

I am not aware of any general requirement that a CET be a tailoring of
DUCET or of the CLDR root collation, so the implicit weights would be
irrelevant in this case.  The implicit weights are part of DUCET.

If no characters are supported, performing NFD will be a rather obvious
trivial transformation of the null string to itself.

> 
> It used to be that for an implementation to be claimed as compliant,
> it
> > also had to pass a specific conformance test.  This requirement has
> > now been abandoned, perhaps because the Default Unicode Collation
> > Element Table (DUCET) is incompatible with the CLDR Collation
> > Algorithm. 
> 
> The DUCET is missing some things that are needed by the CLDR Collation
> Algorithm, but that has nothing to do with UCA compliance.

An implementation that only implements the CLDR collation algorithm
cannot be tailored to support DUCET, because DUCET (at Version 10.0.0)
has the ordering U+FFF8 < U+FFFE < U+1004E, which is incompatible with
UTS#35 Part 5 Section 1.1.1 - "U+FFFE maps to a CE with a minimal,
unique primary weight".

Therefore one could only apply the published UCA conformance test if it
deliberately avoided strings containing U+FFFE.

> The simple fact is that tailorings are common, and it has to be
> possible to conform to the algorithm without forbidding tailorings.

It's the CLDR collation algorithm that prohibits DUCET.  Thankfully, the
CLDR root collation can be interpreted to be compatible with the UCA.
(Tailorings may be incompatible, or at least, incompatible with the
concept of a finite CET.) 

Richard.


More information about the Unicode mailing list