Sorting notation
Philippe Verdy
verdy_p at wanadoo.fr
Tue Feb 25 14:02:47 CST 2014
2014-02-24 20:38 GMT+01:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:
> My understanding of the meaning of the notation is that:
>
> 1) ạ is to have the same number and type of collation elements as á
> currently has;
> 2) The last collation element of ạ that has a positive weight at level
> 2 is to be immediately before the corresponding collation element of
> á at the secondary level;
> 3) No collation element is to be ordered between these two collation
> elements; and
> 4) Their other collation elements are to be the same.
>
I disagree with point your point (1).
* The number of levels does not matter, the notation just indicates that
the relation does not specify any starting weight for levels lower than the
one indicated by the reset.
* And the effective number of collation elements does not matter: we should
assume that if one of the item has not enough collation elements, there's
for each level a zero weight for each missing level. In practive this only
affects the first element, except in case of contractions.
I disagree as well on point (2). The starting element (at the reset) may
have a null weight at that level, so that we can still order other elements
with the same null weight at that level, notably if they have non null
weights for higher levels.
I agree on your point (3) EXCEPT when the first item of a pair is a "reset"
(i.e. an empty string).
The point (4) is completely wrong. The other collaction elements in the
first pair may be arbitrary (also possibly with distinct weights, but at
higher levels) !!!
In fact the two items listed after the reset do not matter at all. All what
is important is the item for the reset itself, and the first non-empty item
ordered after it.
That's why I think that "&[before2] xxx" makes sense (even alone) and is in
fact the same as "& << xxx" or even just "<< xxx" if you consider that
evey rule starts by an implicit reset in order to create a valid pair (in
the first pair, the 1st item of the pair is the reset itself, i.e. an empty
string, the second item is the first non-empty string indicated after it;
and the pair itself has a numeric property specifying its level, here 2).
The form "&a < b < c < d ..." is a compressed form of these rules: "<a";
"a<b", "b<c" (where the "<" is any king of comparator for some level). The
"reset" is then automatically the first item of each pair.
Internally in my implementation I do not really use pairs, but triplets of
the form (item1, weight, item2), where weight is a numeric value: initialy
it is set to an integer matchin the level number, but I may set it to a
fraction between two integers, to define frequencies (by adding a fraction
betwen 0 included and 1 excluded to indicate the probability of having
other collation elements with a lower weight at that level).
The fraction is then used to generate extremely compact collation keys
(using atithmetic coding), even shorter than what we get in ICU or with
Huffmann coding or other context-free prefix coding.
So my own syntax never needs any explicit reset, it just order collection
elements with simple rules, in which I can also add optional statistics
(used only for the generation of collation keys, but not needed at all for
comparing two strings).
My own syntax is even simpler: I don"t use at all multiple operators "<",
"<<", "<<<", I just specfy rules for each level separately, and for that I
do not even need any operator, byt just a common separator, so that I can
create the list of successive pairs (starting with the pair containing the
impliciit "reset" for that level). In the middle of the list however I may
set some %-tage value to indicate some relative frequency for the second
element of each pair (the 1st element of a pair with the reset always stars
at 0%, so the %tage is onlu indocated for the second element of each pair,
and this %age can never be 100% or more. Unless I specifiy a êrcentage, the
list of pairs uses implicit %-tage values distributed lineaility in the
given list (internally the precision of a double is sufficient for
practical cases, though doubles are immediately converted to 64-bit
integers, which may be needed if the statistics given are extremely skewed;
if statistics are not used to generate collation keys, I could even use
32-bit integers only for building the lookup tables for comparing strings;
so a tuning parameter can simply be used to ignore/discard the statistics
provided in rules and the construction of lookup tables is significantly
faster and the binary lookup tables are 50% more compact).
Such specification is much simpler to edit; and in many cases I don't need
to specify everything because I also perform the normalization closures or
the rules. The result is that rules are much shorter (just to represent for
the DUCET itself, via a script parsing it and generating the rules, I get
strings converted to short JSON data that can be parsed easily (the
normalization closure being still the longest part of the algorithm when
instantiating a new collator). I also never need and use the collation
weights suggested in the DUCET except for their relative values to order
the elements in much more compact rules.
Building the effective table of weights is then extremely trivial.
performing the final encoding of collation keys is not much more complex
than a traditionnal arithmeric coder (it is a bit simpler if you use
Huffman, however it is *slower* when instanciating the collator, because
you need .to precompute the bit pattern, something absolutely no needed
with artihmetic coding which uses the statastics dirctly without needed to
prebuild an helper lookup table, also because of the complex bit stuffing
which is simpler to do with the arithmetic coder).
I already spoke about this several months ago on this list. However I've
not experimented it with lots of locales needing complex collation rules
(so I did not investigate all the data already specified in CLDR for some
locales).
My code remains experimental mainly because I do not handle specific rules
needing custom work breakers (e.g. for South-East Asian scripts). I've not
experimented it a lot also with CJK Sinograms (and the complex lookup they
require if we need to lookup for the supplementary data such as
translitteration mappings for Pinyin or similar). And I still don't handle
some of the preprocessing needed for some Indic scripts (includng Thai), or
ignorable "stop words" (that may be needed even for English or French for
some kinds of collators), and the initial code to handle parsing and
preprocessing numbers is broken in my "pipelined" implementation that
attempts to avoid most internal bufferings and internal copies of
transformed strings (to avoid stressing memory allocators and garbage
collectors).
The code is not written in C/C++ but in Java (easier to test, I have little
experience of Python, and don't like PHP, I could have used Lua, also good
for experimenting designs, but it still lacks a JIT compiler to native code
and its performances are too poor compared to current cimplementations of
Javascript). I'd like to have a version can be be deployed on the client
side using a scripted language that is easily deployed (this also excludes
C/C++ implementations like ICU4C that I consider simply not available if
it's only supported by specific framework libraries; but my initial attemps
with using Javascript also has shown that it was still too slow even with
JIT and still stressed the memory allocator too much; that code is also
very unsafe, errorprone and a nightmare to debug).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140225/8822c213/attachment.html>
More information about the Unicode
mailing list