Non-primary Weights of U+FFFE

Richard Wordingham richard.wordingham at ntlworld.com
Thu Apr 3 16:30:36 CDT 2014


On Sun, 30 Mar 2014 09:17:44 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Sun, Mar 30, 2014 at 5:24 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > Is there any reason that a CLDR-compliant collation algorithm should
> > particularly care about the non-primary weights of U+FFFE?  So long
> > as they satisfy the well-formedness conditions, all I can see is
> > that having unique values *may* simplify sort key formation for
> > reversed levels.
> >
> 
> The non-primary weights need to be greater than the level
> separator(s)

Guaranteed by WF1 and S3.2

> and less than the weights of CEs that are ignorable on
> previous levels.

Guaranteed by WF2 plus case-related rules, even if U+FFFE is not
treated as a special case.

> It is also important to generate the special weights
> on primary to tertiary levels for shifted CEs, so that
> alternate=shifted works properly.

Can you expand on this, because I don't see any such need at the
primary to tertiary levels.

>From your comment on ICU below, I can now see that you are specifying
a behaviour for the quaternary level.  Now, in full strength
comparisons, we have, whatever the alternate setting,

"op" < "ôp"
"o p" < "op"

Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable.
However, if the quaternary level weight of \uFFFE was calculated by the
the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation
element table, we would have

"o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable

To get the same ordering for these strings as for
alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary
weight.  I don't see a test for this in CollationTest_CLDR_SHIFTED.txt.

It seems that the UCA should be adjusted (in Section 3.6, variable
weighting) so that L4 weights for L1 non-variable but less than a
variable weight is 'as L1', rather than FFFF.  If I formally report
this, should it be via a CLDR ticket or through the general Unicode
mechanism?

> In ICU, we have test code that expects the same sort keys generated
> from concatenating two strings with U+FFFE vs. calling
> ucol_mergeSortkeys() on the two separate sort keys. The latter merges
> sort keys by copying each level (separated by byte 01) from each sort
> key and inserting a byte 02 between the bytes from different sort
> keys. (see
> ucol.h<http://www.icu-project.org/apiref/icu4c/ucol_8h.html> )

So is the reason for unique weights at the secondary to tertiary levels
simply that you don't want to have to unpick ICU's run-length
compression for your test?

Richard.



More information about the CLDR-Users mailing list