richard.wordingham at ntlworld.com
Sun Feb 23 15:32:45 CST 2014
On Sun, 23 Feb 2014 20:49:24 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> It seems surprisng that Michael Everson asks the question, when he
> already knows so much about Unicode algorithms (but may be less about
> notations used in CLDR data)
> The CLDR also has several competing notations for specifying
> collations so that may be the purpose of his question.
I have no confidence that his question has been understood. Collation
is a monster, and it is unsafe to assume that one understands it. The
ICU notation and implementation for an abstract definition of collation
turned out to be full of traps, and won't catch up with CLDR
definitions until Markus Scherer's raft of collation amendments goes in.
(Or have I missed the announcement?) Rigorous definitions have had to
address collation elements (i.e. sets of weights, one at each level
with 0 a special value), which is not as abstract as the ICU notation
was meant to be.
As an example of the treachery of collation definitions, one might
naïvely think that adding &a<<ạ to the default collation would result
in ạ << á holding, but it doesn't, for á has two collation elements,
not one. CLDR has now* redefined the notation so that &[before 2]á <<
ạ will give the ordering relationships a << ạ << á << à without having
to reorder U+0323 COMBINING DOT BELOW. In the default collations,
secondary differences are implemented by adding collation elements with
zero primary weights, while tertiary differences are implemented as
different tertiary weights in collation elements with non-zero primary
weights. I doubt that using both methods at the same level works well.
Fortunately, collation generally only needs to work well when
restricted to valid words. For some languages, the task of placing an
arbitrary string of the language's characters in the correct place by
alphabetical order is meaningless.
*At least, referring to Version 24 of the LFML specification, I assume
Part 5 Section 3.5, which defines "&..<<", also applies to Section 3.9,
which purports to define the meaning of "&[before 2]..<<". It's
conceivable that I am wrong, and the meaning of "&[before 2]á << ạ" is
More information about the Unicode