Non-primary Weights of U+FFFE

Markus Scherer markus.icu at gmail.com
Thu Apr 3 23:17:10 CDT 2014


On Thu, Apr 3, 2014 at 2:30 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> > It is also important to generate the special weights
> > on primary to tertiary levels for shifted CEs, so that
> > alternate=shifted works properly.
>
> Can you expand on this, because I don't see any such need at the
> primary to tertiary levels.
>

I think I confused myself. Please ignore this sentence and instead read
what I put into the spec:

1.1.1 U+FFFE<http://www.unicode.org/reports/tr35/tr35-collation.html#Algorithm_FFFE>

U+FFFE maps to a CE with special minimal weights on all levels, including
case, quaternary and identical levels — which may require special code for
those levels. Its primary weight is not "variable": U+FFFE must not become
ignorable in alternate handling.

>From your comment on ICU below, I can now see that you are specifying
> a behaviour for the quaternary level.


"all levels" includes quaternary and identical.

Now, in full strength
> comparisons, we have, whatever the alternate setting,
>
> "op" < "ôp"
> "o p" < "op"
>
> Now, "o\uFFFE p" < "o\uFFFEp" < "o \uFFFEp" for alternate=non-ignorable.
> However, if the quaternary level weight of \uFFFE was calculated by the
> the Unicode Collation Algorithm using allkeys_CLDR.txt as its collation
> element table, we would have
>
> "o \uFFFEp" < "o\uFFFE p" < "o\uFFFEp" for alternate=non-ignorable
>
> To get the same ordering for these strings as for
> alternate=non-ignorable, one needs U+FFFE to have a minimal quaternary
> weight.  I don't see a test for this in CollationTest_CLDR_SHIFTED.txt.
>
> It seems that the UCA should be adjusted (in Section 3.6, variable
> weighting) so that L4 weights for L1 non-variable but less than a
> variable weight is 'as L1', rather than FFFF.  If I formally report
> this, should it be via a CLDR ticket or through the general Unicode
> mechanism?
>

I am not sure what you mean. The special mapping and behavior exist in CLDR
but not in the UCA, so none of this applies to UTS #10.
With ICU 53 which implements this, I get
<1 o\uFFFE p
    45 02 47 , 05 02 05 , 05 02 05 , 1C 02 04 1C .
<4 o\uFFFEp
    45 02 47 , 05 02 05 , 05 02 05 , 1C 02 1C .
<4 o \uFFFEp
    45 02 47 , 05 02 05 , 05 02 05 , 1C 04 02 1C .

(http://demo.icu-project.org/icu-bin/collation.html with
strength=quaternary, alternate=shifted, sort keys=on, and your input
strings)

> In ICU, we have test code that expects the same sort keys generated
> > from concatenating two strings with U+FFFE vs. calling
> > ucol_mergeSortkeys() on the two separate sort keys. The latter merges
> > sort keys by copying each level (separated by byte 01) from each sort
> > key and inserting a byte 02 between the bytes from different sort
> > keys. (see
> > ucol.h<http://www.icu-project.org/apiref/icu4c/ucol_8h.html> )
>
> So is the reason for unique weights at the secondary to tertiary levels
> simply that you don't want to have to unpick ICU's run-length
> compression for your test?
>

For ICU, we use weights and code to make U+FFFE behave exactly like the
function that works on finished sort keys. It makes it easy to test that it
works right.

This behavior might not otherwise be necessary. It might even work if you
give U+FFFE "common" non-primary weights and apply the run-length
compression across it. At least I can't find a reason why it would not
work. If this is true, then we could weaken the spec and turn some of the
current requirement into a recommendation.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20140403/adcf4b0e/attachment.html>


More information about the CLDR-Users mailing list