UCA unnecessary collation weight 0000

Philippe Verdy via Unicode unicode at unicode.org
Fri Nov 2 08:54:19 CDT 2018


It's not just a question of "I like it or not". But the fact that the
standard makes the presence of 0000 required in some steps, and the
requirement is in fact wrong: this is in fact NEVER required to create an
equivalent collation order. these steps are completely unnecessary and
should be removed.

Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ <mark at macchiato.com> a écrit :

> You may not like the format of the data, but you are not bound to it. If
> you don't like the data format (eg you want [.0021.0002] instead of
> [.0000.0021.0002]), you can transform it however you want as long as you
> get the same answer, as it says here:
>
> http://unicode.org/reports/tr10/#Conformance
> “The Unicode Collation Algorithm is a logical specification.
> Implementations are free to change any part of the algorithm as long as any
> two strings compared by the implementation are ordered the same as they
> would be by the algorithm as specified. Implementations may also use a
> different format for the data in the Default Unicode Collation Element
> Table. The sort key is a logical intermediate object: if an implementation
> produces the same results in comparison of strings, the sort keys can
> differ in format from what is specified in this document. (See Section 9,
> Implementation Notes.)”
>
>
> That is what is done, for example, in ICU's implementation. See
> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> collation elements" and "sort keys" to see the transformed collation
> elements (from the DUCET + CLDR) and the resulting sort keys.
>
> a =>[29,05,_05] => 29 , 05 , 05 .
> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> à => <same>
> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> À => <same>
>
> Mark
>
>
> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>> As well the step 2 of the algorithm speaks about a single "array" of
>> collation elements. Actually it's best to create one separate array per
>> level, and append weights for each level in the relevant array for that
>> level.
>> The steps S2.2 to S2.4 can do this, including for derived collation
>> elements in section 10.1, or variable weighting in section 4.
>>
>> This also means that for fast string compares, the primary weights can be
>> processed on the fly (without needing any buffering) is the primary weights
>> are different between the two strings (including when one or both of the
>> two strings ends, and the secondary weights or tertiary weights detected
>> until then have not found any weight higher than the minimum weight value
>> for each level).
>> Otherwise:
>> - the first secondary weight higher that the minimum secondary weght
>> value, and all subsequent secondary weights must be buffered in a
>> secondary  buffer  .
>> - the first tertiary weight higher that the minimum secondary weght
>> value, and all subsequent secondary weights must be buffered in a tertiary
>> buffer.
>> - and so on for higher levels (each buffer just needs to keep a counter,
>> when it's first used, indicating how many weights were not buffered while
>> processing and counting the primary weights, because all these weights were
>> all equal to the minimum value for the relevant level)
>> - these secondary/tertiary/etc. buffers will only be used once you reach
>> the end of the two strings when processing the primary level and no
>> difference was found: you'll start by comparing the initial counters in
>> these buffers and the buffer that has the largest counter value is
>> necessarily for the smaller compared string. If both counters are equal,
>> then you start comparing the weights stored in each buffer, until one of
>> the buffers ends before another (the shorter buffer is for the smaller
>> compared string). If both weight buffers reach the end, you use the next
>> pair of buffers built for the next level and process them with the same
>> algorithm.
>>
>> Nowhere you'll ever need to consider any [.0000] weight which is just a
>> notation in the format of the DUCET intended only to be readable by humans
>> but never needed in any machine implementation.
>>
>> Now if you want to create sort keys this is similar except that you don"t
>> have two strings to process and compare, all you want is to create separate
>> arrays of weights for each level: each level can be encoded separately, the
>> encoding must be made so that when you'll concatenate the encoded arrays,
>> the first few encoded *bits* in the secondary or tertiary encodings cannot
>> be larger or equal to the bits used by the encoding of the primary weights
>> (this only limits how you'll encode the 1st weight in each array as its
>> first encoding *bits* must be lower than the first bits used to encode any
>> weight in previous levels).
>>
>> Nowhere you are required to encode weights exactly like their logical
>> weight, this encoding is fully reversible and can use any suitable
>> compression technics if needed. As long as you can safely detect when an
>> encoding ends, because it encounters some bits (with lower values) used to
>> start the encoding of one of the higher levels, the compression is safe.
>>
>> For each level, you can reserve only a single code used to "mark" the
>> start of another higher level followed by some bits to indicate which level
>> it is, then followed by the compressed code for the level made so that each
>> weight is encoded by a code not starting by the reserved mark. That
>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
>> any control or SPACE or isolated surrogate) and codes used to encode each
>> weight must not start by a byte lower or equal to this mark. The binary or
>> ASCII code units used to encode each weight must just be comparable, so
>> that comparing codes is equivalent to compare weights represented by each
>> code.
>>
>> As well, you are not required to store multiple "marks". This is just one
>> of the possibilities to encode in the sort key which level is encoded after
>> each "mark", and the marks are not necessarily the same before each level
>> (their length may also vary depending on the level they are starting):
>> these marks may be completely removed from the final encoding if the
>> encoding/compression used allows discriminating the level used by all
>> weights, encoded in separate sets of values.
>>
>> Typical compression technics are for example differencial, notably in
>> secondary or higher levels, and run-legth encoded to skip sequences of
>> weights all equal to the minimum weight.
>>
>> The code units used by the weigh encoding for each level may also need to
>> avoid some forbidden values if needed (e.g. when encoding the weights to
>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
>> would create a string not conforming to any standard UTF).
>>
>> Once again this means that the sequence of logical weight will can sefely
>> become a readable string, even suitable to be transmitted as plain-text
>> using any UTF, and that compression is also possible in that case: you can
>> create and store lot of sort keys even for very long texts
>>
>> However it is generally better to just encode sort keys only for a
>> reasonnably discriminant part of the text, e.g. no sort key longer than 255
>> bytes (created from the start of the original texts): if you compare two
>> sort keys and find that they are equal, and if both sort keys have this
>> length of 255 bytes, then you'll compare the full original texts using the
>> fast-compare algorithm: you don't need to store full sort keys in addition
>> to the original texts. This can save lot of storage, provided that original
>> texts are sufficiently discriminated by their start, and that cases where
>> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>>
>> For short texts however, truncated sortkeys may save time at the price of
>> a reasonnable storage cost (but sortkeys can be also encoded with roughly
>> the same size as the original text: compression is modest for the encoded
>> primary level. But compression is frequently very effective for higher
>> levels where their smaller weight also have less possible variations of
>> value, in a smaller set.
>>
>> Notably for the secondary level used to encode case differences, only 3
>> bits are enough per weight, and you just need to reserve the 3-bit value
>> "000" as the "mark" for indicating the start of another higher level, while
>> encoding secondary weights as "001" to  "111".
>>
>> (This means that primary levels have to be encoded so that none of their
>> encoded primary weights are starting with "000" marking the start of the
>> secondary level. So primary weights can be encoded in patterns starting by
>> "0001", "001", "01", or "1" and followed by other bits: this allows
>> encoding them as readable UTF-8 if these characters are all different at
>> primary level, excluding only the 16 first C0 controls which need to be
>> preprocessed into escape sequences using the first permitted C0 control as
>> an escape, and escaping that C0 control itself).
>>
>> The third level, started by the mark "00" and followed by the encoded
>> weights indicating this is a tertiary level and not an higher level, will
>> also be used to encode a small set of weights (in most locales, this is not
>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
>> differential coding on 3-bits, you reserve "000" as the "mark" for the next
>> higher level, then use "001" to "111" to encode differencial weights, the
>> differencial weights being initially based on the minimum tertiary weight,
>> you'll use the bit pattern "001" to encode the most frequent minimum
>> tertiary weight, and patterns "01" to "11" plus additional bits to encode
>> other positive or negative differences of tertiary weights, or to use
>> run-length compression). Here also it is possible to map the patterns so
>> that the encoded secondary weight will be readable valid UTF-8.
>>
>> The fourth level, started by the mark "000" can use the pattern "001" to
>> encode the most frequent minimum quaternary weight, and patterns "010" to
>> "011" followed by other bits to differentially encode the quaternary
>> weights. Here again it is possible to create an encoding for quaternary
>> weights that can use some run-length compression and can also be readable
>> valid UTF-8!
>>
>> And so on.
>>
>>
>>
>>
>>
>>
>>
>>
>> Le jeu. 1 nov. 2018 à 22:04, Philippe Verdy <verdy_p at wanadoo.fr> a
>> écrit :
>>
>>> So it should be clear in the UCA algorithm and in the DUCET datatable
>>> that "0000" is NOT a valid weight
>>> It is just a notational placeholder used as ".0000", only indicating in
>>> the DUCET format that there's NO weight assigned at the indicated level,
>>> because the collation element is ALWAYS ignorable at this level.
>>> The DUCET could have as well used the notation ".none", or just dropped
>>> every ".0000" in its file (provided it contains a data entry specifying
>>> what is the minimum weight used for each level). This notation is only
>>> intended to be read by humans editing the file, so they don't need to
>>> wonder what is the level of the first indicated weight or remember what is
>>> the minimum weight for that level.
>>> But the DUCET table is actually generated by a machine and processed by
>>> machines.
>>>
>>>
>>>
>>> Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy <verdy_p at wanadoo.fr> a
>>> écrit :
>>>
>>>> In summary, this step given in the algorithm is completely unneeded and
>>>> can be dropped completely:
>>>>
>>>> *S3.2 <http://unicode.org/reports/tr10/#S3.2> *If L is not 1, append a *level
>>>> separator*
>>>>
>>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>>> lower than any weight in the resulting sort key. This guarantees that when
>>>> two strings of unequal length are compared, where the shorter string is a
>>>> prefix of the longer string, the longer string is always sorted after the
>>>> shorter—in the absence of special features like contractions. For example:
>>>> "abc" < "abcX" where "X" can be any character(s).
>>>>
>>>> Remove any reference to the "level separator" from the UCA. You never
>>>> need it.
>>>>
>>>> As well this paragraph
>>>>
>>>> 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3>
>>>>
>>>> *Step 3.* Construct a sort key for each collation element array by
>>>> successively appending all non-zero weights from the collation element
>>>> array. Figure 2 gives an example of the application of this step to one
>>>> collation element array.
>>>>
>>>> Figure 2. Collation Element Array to Sort Key
>>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>>>> Collation Element ArraySort Key
>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>>> 0002 0002 0002
>>>>
>>>> can be written with this figure:
>>>>
>>>> Figure 2. Collation Element Array to Sort Key
>>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>>>> Collation Element ArraySort Key
>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>>
>>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>>> safely removed if they are respectively the minimum secondary weight and
>>>> minimum tertiary weight.
>>>> But note that 0020 is kept in two places as they are followed by a
>>>> higher weight 0021. This is general for any tailored collation (not just
>>>> the DUCET).
>>>>
>>>> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy <verdy_p at wanadoo.fr> a
>>>> écrit :
>>>>
>>>>> The 0000 is there in the UCA only because the DUCET is published in a
>>>>> format that uses it, but here also this format is useless: you never need
>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>>> just needs to indicate what is the minimum weight assigned for every level
>>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>>
>>>>>
>>>>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus.icu at gmail.com> a
>>>>> écrit :
>>>>>
>>>>>> There are lots of ways to implement the UCA.
>>>>>>
>>>>>> When you want fast string comparison, the zero weights are useful for
>>>>>> processing -- and you don't actually assemble a sort key.
>>>>>>
>>>>>> People who want sort keys usually want them to be short, so you spend
>>>>>> time on compression. You probably also build sort keys as byte vectors not
>>>>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>>>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>>>>> collation data file remunges all weights into fractional byte sequences,
>>>>>> and leaves gaps for tailoring.
>>>>>>
>>>>>> markus
>>>>>>
>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/ffb2a491/attachment.html>


More information about the Unicode mailing list