Non-primary Weights of U+FFFE

Markus Scherer markus.icu at gmail.com
Fri Apr 4 18:55:42 CDT 2014


On Fri, Apr 4, 2014 at 12:36 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Non-variable primary weights less than variable primary weights exist
> in the UCA, and are established by allkeys_CLDR.txt.


Only for U+FFFE.

Returning to the LDML specification, Markus pointed out that in the
> account of U+FFFE,
> > "all levels" includes quaternary and identical.
>
> The concept of a collation element does not really apply at the
> identical level - its formation does not respect the division of a
> string into collating elements. For example,  <U+0443 CYRILLIC SMALL
> LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has
> collating elements <U+0443, U+0334> and <U+0308>, but the identical
> level contribution to the sort key is 0443, 0308, 0334.  Now the
> concept of U+FFFE requires that at the 'identical' level,
> "a\u0000\uFFFE" sort after "a\uFFFE".


Right. With ICU 53:

<1 a\uFFFE
    29 02 , 05 02 , 05 02 , 02 , 92 02 .
<i a\u0000\uFFFE
    29 02 , 05 02 , 05 02 , 02 , 92 31 02 .

At its simplest, this requires
> that U+FFFE be transformed to a negative scalar value!
>

That depends on how you encode the identical level. In the UCA as written,
you could do a transformation like this:
FFFE->0000
0000->0001 0001
0001->0001 0002

In ICU, we use a simple "compression" scheme (a delta encoding) that
preserves binary order, and we reserved byte values 00 (terminator), 01
(level separator), 02 (for U+FFFE).

Now, as I understand it, the identical level is not intended to address
> any cultural concepts of ordering, but simply as a convenience in
> handling inequivalent strings, so that (a) distinct strings need not
> compare as equal, and (b) canonically equivalent strings are ordered
> together.


Yes. It's mostly a semi-arbitrary tie-breaker, except that in the CLDR
Japanese tailoring it provides the distinctions of JIS X 4061 level 5
(compatibility forms of Japanese characters sort after their regular forms).

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140404/61bc6b00/attachment-0001.html>


More information about the CLDR-Users mailing list