UCA unnecessary collation weight 0000
Philippe Verdy via Unicode
unicode at unicode.org
Thu Nov 1 18:38:08 CDT 2018
As well the step 2 of the algorithm speaks about a single "array" of
collation elements. Actually it's best to create one separate array per
level, and append weights for each level in the relevant array for that
level.
The steps S2.2 to S2.4 can do this, including for derived collation
elements in section 10.1, or variable weighting in section 4.
This also means that for fast string compares, the primary weights can be
processed on the fly (without needing any buffering) is the primary weights
are different between the two strings (including when one or both of the
two strings ends, and the secondary weights or tertiary weights detected
until then have not found any weight higher than the minimum weight value
for each level).
Otherwise:
- the first secondary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a secondary
buffer .
- the first tertiary weight higher that the minimum secondary weght value,
and all subsequent secondary weights must be buffered in a tertiary buffer.
- and so on for higher levels (each buffer just needs to keep a counter,
when it's first used, indicating how many weights were not buffered while
processing and counting the primary weights, because all these weights were
all equal to the minimum value for the relevant level)
- these secondary/tertiary/etc. buffers will only be used once you reach
the end of the two strings when processing the primary level and no
difference was found: you'll start by comparing the initial counters in
these buffers and the buffer that has the largest counter value is
necessarily for the smaller compared string. If both counters are equal,
then you start comparing the weights stored in each buffer, until one of
the buffers ends before another (the shorter buffer is for the smaller
compared string). If both weight buffers reach the end, you use the next
pair of buffers built for the next level and process them with the same
algorithm.
Nowhere you'll ever need to consider any [.0000] weight which is just a
notation in the format of the DUCET intended only to be readable by humans
but never needed in any machine implementation.
Now if you want to create sort keys this is similar except that you don"t
have two strings to process and compare, all you want is to create separate
arrays of weights for each level: each level can be encoded separately, the
encoding must be made so that when you'll concatenate the encoded arrays,
the first few encoded *bits* in the secondary or tertiary encodings cannot
be larger or equal to the bits used by the encoding of the primary weights
(this only limits how you'll encode the 1st weight in each array as its
first encoding *bits* must be lower than the first bits used to encode any
weight in previous levels).
Nowhere you are required to encode weights exactly like their logical
weight, this encoding is fully reversible and can use any suitable
compression technics if needed. As long as you can safely detect when an
encoding ends, because it encounters some bits (with lower values) used to
start the encoding of one of the higher levels, the compression is safe.
For each level, you can reserve only a single code used to "mark" the start
of another higher level followed by some bits to indicate which level it
is, then followed by the compressed code for the level made so that each
weight is encoded by a code not starting by the reserved mark. That
encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
(if the encoding must be readable as ASCII or UTF-8-based, and must not use
any control or SPACE or isolated surrogate) and codes used to encode each
weight must not start by a byte lower or equal to this mark. The binary or
ASCII code units used to encode each weight must just be comparable, so
that comparing codes is equivalent to compare weights represented by each
code.
As well, you are not required to store multiple "marks". This is just one
of the possibilities to encode in the sort key which level is encoded after
each "mark", and the marks are not necessarily the same before each level
(their length may also vary depending on the level they are starting):
these marks may be completely removed from the final encoding if the
encoding/compression used allows discriminating the level used by all
weights, encoded in separate sets of values.
Typical compression technics are for example differencial, notably in
secondary or higher levels, and run-legth encoded to skip sequences of
weights all equal to the minimum weight.
The code units used by the weigh encoding for each level may also need to
avoid some forbidden values if needed (e.g. when encoding the weights to
UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
would create a string not conforming to any standard UTF).
Once again this means that the sequence of logical weight will can sefely
become a readable string, even suitable to be transmitted as plain-text
using any UTF, and that compression is also possible in that case: you can
create and store lot of sort keys even for very long texts
However it is generally better to just encode sort keys only for a
reasonnably discriminant part of the text, e.g. no sort key longer than 255
bytes (created from the start of the original texts): if you compare two
sort keys and find that they are equal, and if both sort keys have this
length of 255 bytes, then you'll compare the full original texts using the
fast-compare algorithm: you don't need to store full sort keys in addition
to the original texts. This can save lot of storage, provided that original
texts are sufficiently discriminated by their start, and that cases where
the sort keys were truncated to the limit of 255 bytes are exceptionnal.
For short texts however, truncated sortkeys may save time at the price of a
reasonnable storage cost (but sortkeys can be also encoded with roughly the
same size as the original text: compression is modest for the encoded
primary level. But compression is frequently very effective for higher
levels where their smaller weight also have less possible variations of
value, in a smaller set.
Notably for the secondary level used to encode case differences, only 3
bits are enough per weight, and you just need to reserve the 3-bit value
"000" as the "mark" for indicating the start of another higher level, while
encoding secondary weights as "001" to "111".
(This means that primary levels have to be encoded so that none of their
encoded primary weights are starting with "000" marking the start of the
secondary level. So primary weights can be encoded in patterns starting by
"0001", "001", "01", or "1" and followed by other bits: this allows
encoding them as readable UTF-8 if these characters are all different at
primary level, excluding only the 16 first C0 controls which need to be
preprocessed into escape sequences using the first permitted C0 control as
an escape, and escaping that C0 control itself).
The third level, started by the mark "00" and followed by the encoded
weights indicating this is a tertiary level and not an higher level, will
also be used to encode a small set of weights (in most locales, this is not
more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
differential coding on 3-bits, you reserve "000" as the "mark" for the next
higher level, then use "001" to "111" to encode differencial weights, the
differencial weights being initially based on the minimum tertiary weight,
you'll use the bit pattern "001" to encode the most frequent minimum
tertiary weight, and patterns "01" to "11" plus additional bits to encode
other positive or negative differences of tertiary weights, or to use
run-length compression). Here also it is possible to map the patterns so
that the encoded secondary weight will be readable valid UTF-8.
The fourth level, started by the mark "000" can use the pattern "001" to
encode the most frequent minimum quaternary weight, and patterns "010" to
"011" followed by other bits to differentially encode the quaternary
weights. Here again it is possible to create an encoding for quaternary
weights that can use some run-length compression and can also be readable
valid UTF-8!
And so on.
Le jeu. 1 nov. 2018 à 22:04, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :
> So it should be clear in the UCA algorithm and in the DUCET datatable that
> "0000" is NOT a valid weight
> It is just a notational placeholder used as ".0000", only indicating in
> the DUCET format that there's NO weight assigned at the indicated level,
> because the collation element is ALWAYS ignorable at this level.
> The DUCET could have as well used the notation ".none", or just dropped
> every ".0000" in its file (provided it contains a data entry specifying
> what is the minimum weight used for each level). This notation is only
> intended to be read by humans editing the file, so they don't need to
> wonder what is the level of the first indicated weight or remember what is
> the minimum weight for that level.
> But the DUCET table is actually generated by a machine and processed by
> machines.
>
>
>
> Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :
>
>> In summary, this step given in the algorithm is completely unneeded and
>> can be dropped completely:
>>
>> *S3.2 <http://unicode.org/reports/tr10/#S3.2> *If L is not 1, append a *level
>> separator*
>>
>> *Note:*The level separator is zero (0000), which is guaranteed to be
>> lower than any weight in the resulting sort key. This guarantees that when
>> two strings of unequal length are compared, where the shorter string is a
>> prefix of the longer string, the longer string is always sorted after the
>> shorter—in the absence of special features like contractions. For example:
>> "abc" < "abcX" where "X" can be any character(s).
>>
>> Remove any reference to the "level separator" from the UCA. You never
>> need it.
>>
>> As well this paragraph
>>
>> 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3>
>>
>> *Step 3.* Construct a sort key for each collation element array by
>> successively appending all non-zero weights from the collation element
>> array. Figure 2 gives an example of the application of this step to one
>> collation element array.
>>
>> Figure 2. Collation Element Array to Sort Key
>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>> Collation Element ArraySort Key
>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706
>> 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
>>
>> can be written with this figure:
>>
>> Figure 2. Collation Element Array to Sort Key
>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>> Collation Element ArraySort Key
>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>
>> The parentheses mark the collation weights 0020 and 0002 that can be
>> safely removed if they are respectively the minimum secondary weight and
>> minimum tertiary weight.
>> But note that 0020 is kept in two places as they are followed by a higher
>> weight 0021. This is general for any tailored collation (not just the
>> DUCET).
>>
>> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy <verdy_p at wanadoo.fr> a
>> écrit :
>>
>>> The 0000 is there in the UCA only because the DUCET is published in a
>>> format that uses it, but here also this format is useless: you never need
>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>> just needs to indicate what is the minimum weight assigned for every level
>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>
>>>
>>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus.icu at gmail.com> a
>>> écrit :
>>>
>>>> There are lots of ways to implement the UCA.
>>>>
>>>> When you want fast string comparison, the zero weights are useful for
>>>> processing -- and you don't actually assemble a sort key.
>>>>
>>>> People who want sort keys usually want them to be short, so you spend
>>>> time on compression. You probably also build sort keys as byte vectors not
>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>>> collation data file remunges all weights into fractional byte sequences,
>>>> and leaves gaps for tailoring.
>>>>
>>>> markus
>>>>
>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/9fe85101/attachment-0001.html>
More information about the Unicode
mailing list