Sorting notation

Philippe Verdy verdy_p at wanadoo.fr
Sun Feb 23 16:13:53 CST 2014


2014-02-23 22:32 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Sun, 23 Feb 2014 20:49:24 +0100
> Philippe Verdy <verdy_p at wanadoo.fr> wrote:*At least, referring to Version
> 24 of the LFML specification, I assume
> Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
> which purports to define the meaning of "&[before 2]..<<".  It's
> conceivable that I am wrong, and the meaning of "&[before 2]á << ạ" is
> undefined.
>

This looks like a cryptic notation anyway. If we assume that there's an
implicit reset at start of a collation rule, and that collation does not
define any relative order for the empty string, you could simply write this
reset at level 2 as:
  << á << ạ
instead of the mysterious notation (and in fact verbose and probably
inconsistant in the way the same level 2 is further used with "<<"):
  &[before 2]á << ạ

I don't thing the "&" is necessary except as a separator between separate
rules (where all rules must implicitly start by a reset at some level).

The "monster" you describe belongs to ICU implementation (which is not part
of any standard but now integrated in various products that have abandonned
the idea of implementing (unstable and complex) collations themselves.

My opinion is that this part of ICU should be detached from it in a
completely separated project, to help simplifying it, because all the rest
of ICU have viable competitive implementations (that are also more easily
ported to other languages without having to create possibly unsafe binary
bindings to native C/C++ code or Java).

It is notable that after so many years years, collation is still not
implemented in Javascript, and still does not have a standardized API in
Javascript/ECMAscript minimum support for strings (there is an
implementation though in Lua, based on internal bindings to the native
C/C++ code in its library; there are some attempts to emulate it also in
Python; in C#/J# the implementation is performed by binding the native
C/C++ code; but it still causes deployment problems for distributed
applications that need to deliver code on the client side of web services:
only Java works for now, not Javascript except by using server-side helpers
with really _slow_ remote APIs).

When performance of applications on the client side is a problem (for
client-side applications needing to perform dynamic collations), full
collators are not implemented at all, and these applications use a much
simpler model (even if they don't work very well with lots of languages).

And the existing CLDR data about collation is simply not portable at all
outside contexts where ICU can be used. Instead, each application supports
its own (more or less limited) model implementing some unspecified part of
the CLDR collation data (which is then insufficiently reused and corrected
for handling real cases, even for the most frequently needed ones).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140223/a23ae7a3/attachment.html>


More information about the Unicode mailing list