Sorting notation

Sun Feb 23 17:04:34 CST 2014

On Sun, Feb 23, 2014 at 2:13 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-23 22:32 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:
>
>> On Sun, 23 Feb 2014 20:49:24 +0100
>> Philippe Verdy <verdy_p at wanadoo.fr> wrote:*At least, referring to
>> Version 24 of the LFML specification, I assume
>> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
>> which purports to define the meaning of "&[before 2]..<<".  It's
>> conceivable that I am wrong, and the meaning of "&[before 2]á << ạ" is
>> undefined.
>>
>
No, it's well-defined, and I believe that part of the spec is fairly
complete since CLDR 24.

This looks like a cryptic notation anyway. If we assume that there's an
> implicit reset at start of a collation rule, and that collation does not
> define any relative order for the empty string, you could simply write this
> reset at level 2 as:
>   << á << ạ
>

It might have made sense 15 years ago to permit relations without an
initial reset, because at the time the rules were applied on a blank slate.
Ever since ICU/CLDR collation rules were redefined to apply on top of DUCET
(and later on top of the CLDR root collation), you really need to reset to
something for the result to make sense. CLDR 24 forbids rules without
initial reset, and ICU 53 will follow suit.

instead of the mysterious notation (and in fact verbose and probably
> inconsistant in the way the same level 2 is further used with "<<"):
>   &[before 2]á << ạ
>

It is true that the "2" and the strength of the operator are redundant, but
the notation is now well-defined.
I don't know your criteria for "mysterious" :-)
It does help to know the root collation mappings, or at least how they are
generally constructed; for example, that á maps to two collation elements.

I don't thing the "&" is necessary except as a separator between separate
> rules (where all rules must implicitly start by a reset at some level).
>

See above.

The "monster" you describe belongs to ICU implementation (which is not part
> of any standard but now integrated in various products that have abandonned
> the idea of implementing (unstable and complex) collations themselves.
>

I think Richard refers to the "monster" because it is very, very tricky to
get one's head around the interaction of all of the pieces of the UCA,
Unicode normalization, and the CLDR additions. At least when it comes to
the heads of Richard, Mark, Ken, and my own...

Also, the implementation of UCA is easy if you don't care about data size
or speed of string comparisons. Once you care about size and speed and want
additional functionality (like in ICU), it's a major chunk of code. In the
case of ICU, that code had accreted functionality and changed with changing
specs and had gotten buggy and hard to maintain, so I am in the process of
reimplementing it, with hopes of getting it into ICU 53 in March. The code
and data actually got smaller, but it's still large.

My opinion is that this part of ICU should be detached from it in a
> completely separated project, to help simplifying it,
>

It's complex for reasons stated above, and it benefits from many
lower-level parts of ICU (Unicode properties, normalization, data loading,
data structures, ...).

It is notable that after so many years years, collation is still not
> implemented in Javascript, and still does not have a standardized API in
> Javascript/ECMAscript minimum support for strings
>

Collation was added to the ECMAScript standard in 2012, with several
browsers implementing it.
PyICU makes it available in Python.

If someone wanted to port code to JavaScript or Python, and wanted it to be
fast, the new (upcoming) ICU Java code might be a reasonable start.

When performance of applications on the client side is a problem (for
> client-side applications needing to perform dynamic collations), full
> collators are not implemented at all, and these applications use a much
> simpler model (even if they don't work very well with lots of languages).
>

Right. If the client code need not collate newly typed strings, then one
good technique is to have the server send the corresponding sort keys. By
the way, ICU makes a strong effort to write very short sort keys.

Best regards,
markus
-- 
Google Internationalization Engineering
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140223/c11f13ff/attachment.html>