Non-primary Weights of U+FFFE

Richard Wordingham richard.wordingham at ntlworld.com
Fri Apr 4 14:36:42 CDT 2014


On Thu, 3 Apr 2014 21:17:10 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:

> > Now, in full strength
> > comparisons, we have, whatever the alternate setting,
> >
> > "op" < "ôp"
> > "o p" < "op"
> >
> > Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for
> > alternate=non-ignorable. However, if the quaternary level weight of
> > \uFFFE was calculated by the the Unicode Collation Algorithm using
> > allkeys_CLDR.txt as its collation element table, we would have
> >
> > "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable

Sorry, I meant to write
"o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=shifted

> > To get the same ordering for these strings as for
> > alternate=non-ignorable, one needs U+FFFE to have a minimal
> > quaternary weight.  I don't see a test for this in
> > CollationTest_CLDR_SHIFTED.txt.

The problem here is that the collation test is passed whether one uses
the UCA or the CLDR collation algorithm, whereas these currently define
different orders for these three strings with alternate=shifted.  

> > It seems that the UCA should be adjusted (in Section 3.6, variable
> > weighting) so that L4 weights for L1 non-variable but less than a
> > variable weight is 'as L1', rather than FFFF.  If I formally report
> > this, should it be via a CLDR ticket or through the general Unicode
> > mechanism?
 
> I am not sure what you mean. The special mapping and behavior exist
> in CLDR but not in the UCA, so none of this applies to UTS #10.

Non-variable primary weights less than variable primary weights exist
in the UCA, and are established by allkeys_CLDR.txt.  It so happens
that there aren't any such weights in *DUCET* - just as there aren't any
tertiary collation elements.

Returning to the LDML specification, Markus pointed out that in the
account of U+FFFE,
> "all levels" includes quaternary and identical.

The concept of a collation element does not really apply at the
identical level - its formation does not respect the division of a
string into collating elements. For example,  <U+0443 CYRILLIC SMALL
LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has
collating elements <U+0443, U+0334> and <U+0308>, but the identical
level contribution to the sort key is 0443, 0308, 0334.  Now the
concept of U+FFFE requires that at the 'identical' level,
"a\u0000\uFFFE" sort after "a\uFFFE".  At its simplest, this requires
that U+FFFE be transformed to a negative scalar value!

Now, as I understand it, the identical level is not intended to address
any cultural concepts of ordering, but simply as a convenience in
handling inequivalent strings, so that (a) distinct strings need not
compare as equal, and (b) canonically equivalent strings are ordered
together.  However, there are cases where changing the ordering of
indecomposable codepoints might have benefits - non-spacing Hebrew
accents (all ignorable) and kashida (U+0640 ARABIC TATWEEL) come to
mind.  The simplest mechanism I can see is for the UCA to allow a
tailoring to permute scalar values for the purposes of the identical
level.  Thus, for CLDR root, we would have the permutation (U+0000 ..
U+FFFE), and for CLDR we would require that U+FFFE be permuted to
U+0000.  (For collation, a permutation of all scalar values is
equivalent to a permutation of all indecomposable scalar values, and
allowing a formal permutation of all scalar values is simpler.)  It is
not necessary for CLDR to support any other permutations - it has no
mechanisms for tailoring casing for collation and only limited
mechanisms for creating extra levels.

Richard.



More information about the CLDR-Users mailing list