UCA unnecessary collation weight 0000

Richard Wordingham via Unicode unicode at unicode.org
Fri Nov 2 20:34:58 CDT 2018


On Fri, 2 Nov 2018 14:27:37 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:

> > UTR#10 still does not explicitly state that its use of "0000" does
> > not mean it is a valid "weight", it's a notation only  
> 
> No, it is explicitly a valid weight. And it is explicitly and 
> normatively referred to in the specification of the algorithm. See 
> UTS10-D8 (and subsequent definitions), which explicitly depend on a 
> definition of "A collation weight whose value is zero." The entire 
> statement of what are primary, secondary, tertiary, etc. collation 
> elements depends on that definition. And see the tables in Section
> 3.2, which also depend on those definitions.

The definition is defective in that it doesn't handle 'large weight
values' well.  There is the anomaly that a mapping of collating element
to [1234.0000.0000][0200.020.002] may be compatible with WF1, but the
exactly equivalent mapping to [1234.020.002][0200.0000.0000] makes the
table ill-formed.  The fractional weight definitions for UCA eliminate
this '0000' notion quite well, and I once expected the UCA to move to
the CLDRCA (CLDR Collation Algorithm) fractional weight definition.
The definition of the CLDRCA does a much better job of explaining
'large weight values'.  It turns them from something exceptional to a
normal part of its functioning.  

> > (but the notation is used for TWO distinct purposes: one is for 
> > presenting the notation format used in the DUCET  
> 
> It is *not* just a notation format used in the DUCET -- it is part of 
> the normative definitional structure of the algorithm, which then 
> percolates down into further definitions and rules and the steps of
> the algorithm.

It's not needed for the CLDRCA!  The statement of the UCA algorithm
does depend on its notation, but it can be recast to avoid these zero
weights.

Richard.


More information about the Unicode mailing list