Sorting notation

Richard Wordingham richard.wordingham at ntlworld.com
Tue Feb 25 18:08:27 CST 2014


On Tue, 25 Feb 2014 21:02:47 +0100
Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> 2014-02-24 20:38 GMT+01:00 Richard Wordingham <
> richard.wordingham at ntlworld.com>:

The immediately following text of mine is entirely concerned
with the interpretation of the LDML specification "&[before 2]á << ạ".

> > My understanding of the meaning of the notation is that:
> >
> > 1) ạ is to have the same number and type of collation elements as á
> > currently has;
> > 2) The last collation element of ạ that has a positive weight at
> > level 2 is to be immediately before the corresponding collation
> > element of á at the secondary level;
> > 3) No collation element is to be ordered between these two collation
> > elements; and
> > 4) Their other collation elements are to be the same.

The terms collation element and weight as I use them are intended to be
used as in the Unicode Collation Algorithm.  It is conceivable that I
have missed some subtlety in the difference between the extended
weights of DUCET and the fractional weights preferred for the
expression of the CLDR default collation.

> I disagree with point your point (1).

> * The number of levels does not matter, the notation just indicates
> that the relation does not specify any starting weight for levels
> lower than the one indicated by the reset.

It does seem that what happens below the level of the reset is
irrelevant.  I couldn't construct a counter-example to show that it
can matter. I'd still recommend copying at the lower levels just in case
there is a subtle effect.

> * And the effective number of collation elements does not matter: we
> should assume that if one of the item has not enough collation
> elements, there's for each level a zero weight for each missing
> level. In practive this only affects the first element, except in
> case of contractions.

This makes no sense to me.  The collation elements for ạ before the
application of the rule do not matter.  The requirements I gave on the
collation elements of ạ are for its collation elements *immediately
after* the rule has been applied.  This incomprehension also applies to
your comments on points (2) to (4).

> I disagree as well on point (2). The starting element (at the reset)
> may have a null weight at that level, so that we can still order
> other elements with the same null weight at that level, notably if
> they have non null weights for higher levels.

> I agree on your point (3) EXCEPT when the first item of a pair is a
> "reset" (i.e. an empty string).
> 
> The point (4) is completely wrong. The other collaction elements in
> the first pair may be arbitrary (also possibly with distinct weights,
> but at higher levels) !!!

The specification "&[before 2]á << ạ" has to be invalid if á has no
non-zero secondary weiɡhts.  The LDML specification doesn't mention this
input error.

<snip>
> That's why I think that "&[before2] xxx" makes sense (even alone) and
> is in fact the same as "& << xxx"  or even just "<< xxx" if you
> consider that evey rule starts by an implicit reset in order to
> create a valid pair (in the first pair, the 1st item of the pair is
> the reset itself, i.e. an empty string, the second item is the first
> non-empty string indicated after it; and the pair itself has a
> numeric property specifying its level, here 2).

This has nothing to do with the LDML notation.  As far as I can tell,
you are interpreting "<< xxx" to assign xxx a collating element with
zero primary weight.
 
> The form "&a < b < c < d ..." is a compressed form of these rules:
> "<a"; "a<b", "b<c" (where the "<" is any king of comparator for some
> level). The "reset" is then automatically the first item of each pair.

> So my own syntax never needs any explicit reset, it just order
> collection elements with simple rules, in which I can also add
> optional statistics (used only for the generation of collation keys,
> but not needed at all for comparing two strings).

No.  This is similar to the fallacy that a collation is defined by the
relative ordering (and degree of difference) of the collating
elements.  Are you relying on deferred binding?  And please try not to
use 'collation element' (a sequence of weights, one per level) when you
mean 'collating element' (either a string of characters or the ordered
pair of a string and its corresponding sequence of collation elements).

> And I still don't handle some of the preprocessing needed
> for some Indic scripts (includng Thai),...

Are you aware that Thai can be handled by contractions?  Compared with
how it might have been, Thai collation is extremely computer friendly.

Richard.




More information about the Unicode mailing list