Collation / Fractional UCA / Implicit Weights Questions

Sun Nov 26 12:28:13 CST 2017

On Sat, Nov 25, 2017 at 5:07 PM, Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> As part of my efforts to implement CLDR support for the Elixir language
> I’ve now started work on collations and working my way through TR10 and the
> relevant parts of TR35.
>

Have you considered calling an existing library (e.g., ICU) from your
language runtime, rather than do this from scratch?

 I have some questions on implicit weight calculation I’m unable to resolve
> and would appreciate any help or pointers on:
>
> (1) Unified Ideograph vs Radical
>
> Is there a preferred or intended strategy - to use the Unified Ideograph
> or radical definitions?
>

This is a default, to be used when we don't know the language or desired
sort order. When one of the CJK languages is selected, the tailoring
provides a specific Han character order.

As such, you have a choice between the DUCET order, which can be
implemented with very minimal data, or the radical-stroke order, which is a
bit more meaningful but large (because it's a permutation of all of the Han
characters).

Each Han allocation block in Unicode, including the original one which has
almost all of the commonly used characters, is intended to have its share
of Han characters in radical-stroke order (although the allocation is
fixed, so mistakes cannot be corrected). That is, for most of the common
Han characters (those in the original part of the original block), there
should be little difference in the order. However, for characters outside
the original Unihan block, the DUCET order is not useful.

(2) Calculating implicit weights for radical definitions
>
> TR10/TR35 seem quiet on the topic - my working assumption is to use
> the [fixed first implicit byte E0] and [fixed last implicit byte E4] in
> FractionalUCA.txt to generate implicit weights that respect the radical
> order (left to right, top to bottom).  Is that a reasonable working
> principle?
>

Yes, the radical-stroke data is intended to provide an order as listed.

We kept the E0..E4 lead byte range in FractionalUCA.txt as is for
stability. You can use more or fewer lead bytes. For ICU, I move the
implicit-weight lead bytes much higher, to make more room for large Han
tailorings. You can choose your implicit-weight allocation freely because I
changed the primary weights of Han compatibility characters to refer to the
Han code points rather than hardcode their weights. (This is also why the
Han radical-stroke data comes first -- you can use a single-pass parser,
establish the Han order, and then look up their weights by code point.) You
just have to also move one or two "high" primary weights accordingly, such
as for U+FFFD.

(3) Implicit weight calculations in general
>
> TR10 at https://www.unicode.org/reports/tr10/#Implicit_Weights will
> generate weights with a top byte of 0xFB which would seem in conflict with
> the [fixed first implicit byte E0] and [fixed last implicit byte E4]
> indicators.  My working assumption is to use the algorithm in TR10 to
> calculate implicit weights except for radical definitions which would use
> the [fixed first] and [fixed last]
>

No, careful. The DUCET is published with 16-bit primary weights (and some
weights are pairs of 16-bit values). CLDR FractionalUCA.txt uses primary
weights of 1, 2, or 3 *bytes*. (ICU uses 4-byte weights for
unassigned-implicit weights, and in tailorings if needed.) They are
unrelated values, although they provide the same sort order (except for the
intentional CLDR reshufflings of some numerical symbols and such).

Conformance to the algorithms requires you to get the same order, but does
not require you to get the same sort keys.

This would seem to align with TR35 which says:
>
> "Note: The particular primary lead bytes for Hani vs. IMPLICIT vs.
> TRAILING are only an example” suggesting that Hani is calculated with
> leading bytes 0xFB per TR10 and the [fixed first implicit] can be used to
> generate weights for radicals (and other non specified code points)
>

No, it refers to your freedom of choice of range and bit-distribution
algorithm, as for ICU as I said above.

> Thanks in advance, —Kip
>
Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20171126/385e413e/attachment.html>