Collation / Fractional UCA / Implicit Weights Questions

Sun Nov 26 20:02:04 CST 2017

Markus, thank you for your response.  I will admit that a large part of my motivation is to learn more about collation.  The peculiarities of the Erlang VM (upon which Elixir runs) makes access to native libs challenging but not impossible. Of course leveraging your work is the canonical approach and it may be where I end up.

So now I understand better about the application of the radical data and I need to decide where to place them. You note: "For ICU, I move the implicit-weight lead bytes much higher, to make more room for large Han tailorings. You can choose your implicit-weight allocation freely” 

Where do you place them? (I know, I should read the code and I will but the learning curve is steep!)

Regards, —Kip

> On 27 Nov 2017, at 5:28 am, Markus Scherer <markus.icu at gmail.com> wrote:
> 
> On Sat, Nov 25, 2017 at 5:07 PM, Kip Cole via CLDR-Users <cldr-users at unicode.org <mailto:cldr-users at unicode.org>> wrote:
> As part of my efforts to implement CLDR support for the Elixir language I’ve now started work on collations and working my way through TR10 and the relevant parts of TR35.
> 
> Have you considered calling an existing library (e.g., ICU) from your language runtime, rather than do this from scratch?
> 
>  I have some questions on implicit weight calculation I’m unable to resolve and would appreciate any help or pointers on:
> 
> (1) Unified Ideograph vs Radical
> 
> Is there a preferred or intended strategy - to use the Unified Ideograph or radical definitions?
> 
> This is a default, to be used when we don't know the language or desired sort order. When one of the CJK languages is selected, the tailoring provides a specific Han character order.
> 
> As such, you have a choice between the DUCET order, which can be implemented with very minimal data, or the radical-stroke order, which is a bit more meaningful but large (because it's a permutation of all of the Han characters).
> 
> Each Han allocation block in Unicode, including the original one which has almost all of the commonly used characters, is intended to have its share of Han characters in radical-stroke order (although the allocation is fixed, so mistakes cannot be corrected). That is, for most of the common Han characters (those in the original part of the original block), there should be little difference in the order. However, for characters outside the original Unihan block, the DUCET order is not useful.
> 
> (2) Calculating implicit weights for radical definitions
> 
> TR10/TR35 seem quiet on the topic - my working assumption is to use the [fixed first implicit byte E0] and [fixed last implicit byte E4] in FractionalUCA.txt to generate implicit weights that respect the radical order (left to right, top to bottom).  Is that a reasonable working principle?
> 
> Yes, the radical-stroke data is intended to provide an order as listed.
> 
> We kept the E0..E4 lead byte range in FractionalUCA.txt as is for stability. You can use more or fewer lead bytes. For ICU, I move the implicit-weight lead bytes much higher, to make more room for large Han tailorings. You can choose your implicit-weight allocation freely because I changed the primary weights of Han compatibility characters to refer to the Han code points rather than hardcode their weights. (This is also why the Han radical-stroke data comes first -- you can use a single-pass parser, establish the Han order, and then look up their weights by code point.) You just have to also move one or two "high" primary weights accordingly, such as for U+FFFD.
> 
> (3) Implicit weight calculations in general
> 
> TR10 at https://www.unicode.org/reports/tr10/#Implicit_Weights <https://www.unicode.org/reports/tr10/#Implicit_Weights> will generate weights with a top byte of 0xFB which would seem in conflict with the [fixed first implicit byte E0] and [fixed last implicit byte E4] indicators.  My working assumption is to use the algorithm in TR10 to calculate implicit weights except for radical definitions which would use the [fixed first] and [fixed last]
> 
> No, careful. The DUCET is published with 16-bit primary weights (and some weights are pairs of 16-bit values). CLDR FractionalUCA.txt uses primary weights of 1, 2, or 3 *bytes*. (ICU uses 4-byte weights for unassigned-implicit weights, and in tailorings if needed.) They are unrelated values, although they provide the same sort order (except for the intentional CLDR reshufflings of some numerical symbols and such).
> 
> Conformance to the algorithms requires you to get the same order, but does not require you to get the same sort keys.
> 
> This would seem to align with TR35 which says:
> 
> "Note: The particular primary lead bytes for Hani vs. IMPLICIT vs. TRAILING are only an example” suggesting that Hani is calculated with leading bytes 0xFB per TR10 and the [fixed first implicit] can be used to generate weights for radicals (and other non specified code points)
> 
> No, it refers to your freedom of choice of range and bit-distribution algorithm, as for ICU as I said above.
> Thanks in advance, —Kip 
> 
> Best regards,
> markus

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20171127/8cc8d4cf/attachment.html>