Non-primary Weights of U+FFFE
Markus Scherer
markus.icu at gmail.com
Fri Apr 4 18:55:42 CDT 2014
On Fri, Apr 4, 2014 at 12:36 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:
> Non-variable primary weights less than variable primary weights exist
> in the UCA, and are established by allkeys_CLDR.txt.
Only for U+FFFE.
Returning to the LDML specification, Markus pointed out that in the
> account of U+FFFE,
> > "all levels" includes quaternary and identical.
>
> The concept of a collation element does not really apply at the
> identical level - its formation does not respect the division of a
> string into collating elements. For example, <U+0443 CYRILLIC SMALL
> LETTER U, U+0308 COMBINING DIAERESIS, U+0334 COMBINING TILDE> has
> collating elements <U+0443, U+0334> and <U+0308>, but the identical
> level contribution to the sort key is 0443, 0308, 0334. Now the
> concept of U+FFFE requires that at the 'identical' level,
> "a\u0000\uFFFE" sort after "a\uFFFE".
Right. With ICU 53:
<1 a\uFFFE
29 02 , 05 02 , 05 02 , 02 , 92 02 .
<i a\u0000\uFFFE
29 02 , 05 02 , 05 02 , 02 , 92 31 02 .
At its simplest, this requires
> that U+FFFE be transformed to a negative scalar value!
>
That depends on how you encode the identical level. In the UCA as written,
you could do a transformation like this:
FFFE->0000
0000->0001 0001
0001->0001 0002
In ICU, we use a simple "compression" scheme (a delta encoding) that
preserves binary order, and we reserved byte values 00 (terminator), 01
(level separator), 02 (for U+FFFE).
Now, as I understand it, the identical level is not intended to address
> any cultural concepts of ordering, but simply as a convenience in
> handling inequivalent strings, so that (a) distinct strings need not
> compare as equal, and (b) canonically equivalent strings are ordered
> together.
Yes. It's mostly a semi-arbitrary tie-breaker, except that in the CLDR
Japanese tailoring it provides the distinctions of JIS X 4061 level 5
(compatibility forms of Japanese characters sort after their regular forms).
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140404/61bc6b00/attachment-0001.html>
More information about the CLDR-Users
mailing list