UCA unnecessary collation weight 0000

Mark Davis ☕️ via Unicode unicode at unicode.org
Fri Nov 2 09:23:39 CDT 2018


The table is the way it is because it is easier to process (and comprehend)
when the first field is always the primary weight, second is always the
secondary, etc.

Go ahead and transform the input DUCET files as you see fit. The "should be
removed" is your personal preference. Unless we hear strong demand
otherwise from major implementers, people have better things to do than
change their parsers to suit your preference.

Mark


On Fri, Nov 2, 2018 at 2:54 PM Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of 0000 required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to create an
> equivalent collation order. these steps are completely unnecessary and
> should be removed.
>
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ <mark at macchiato.com> a écrit :
>
>> You may not like the format of the data, but you are not bound to it. If
>> you don't like the data format (eg you want [.0021.0002] instead of
>> [.0000.0021.0002]), you can transform it however you want as long as you
>> get the same answer, as it says here:
>>
>> http://unicode.org/reports/tr10/#Conformance
>> “The Unicode Collation Algorithm is a logical specification.
>> Implementations are free to change any part of the algorithm as long as any
>> two strings compared by the implementation are ordered the same as they
>> would be by the algorithm as specified. Implementations may also use a
>> different format for the data in the Default Unicode Collation Element
>> Table. The sort key is a logical intermediate object: if an implementation
>> produces the same results in comparison of strings, the sort keys can
>> differ in format from what is specified in this document. (See Section 9,
>> Implementation Notes.)”
>>
>>
>> That is what is done, for example, in ICU's implementation. See
>> http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
>> collation elements" and "sort keys" to see the transformed collation
>> elements (from the DUCET + CLDR) and the resulting sort keys.
>>
>> a =>[29,05,_05] => 29 , 05 , 05 .
>> a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
>> à => <same>
>> A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
>> À => <same>
>>
>> Mark
>>
>>
>> On Fri, Nov 2, 2018 at 12:42 AM Philippe Verdy via Unicode <
>> unicode at unicode.org> wrote:
>>
>>> As well the step 2 of the algorithm speaks about a single "array" of
>>> collation elements. Actually it's best to create one separate array per
>>> level, and append weights for each level in the relevant array for that
>>> level.
>>> The steps S2.2 to S2.4 can do this, including for derived collation
>>> elements in section 10.1, or variable weighting in section 4.
>>>
>>> This also means that for fast string compares, the primary weights can
>>> be processed on the fly (without needing any buffering) is the primary
>>> weights are different between the two strings (including when one or both
>>> of the two strings ends, and the secondary weights or tertiary weights
>>> detected until then have not found any weight higher than the minimum
>>> weight value for each level).
>>> Otherwise:
>>> - the first secondary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a
>>> secondary  buffer  .
>>> - the first tertiary weight higher that the minimum secondary weght
>>> value, and all subsequent secondary weights must be buffered in a tertiary
>>> buffer.
>>> - and so on for higher levels (each buffer just needs to keep a counter,
>>> when it's first used, indicating how many weights were not buffered while
>>> processing and counting the primary weights, because all these weights were
>>> all equal to the minimum value for the relevant level)
>>> - these secondary/tertiary/etc. buffers will only be used once you reach
>>> the end of the two strings when processing the primary level and no
>>> difference was found: you'll start by comparing the initial counters in
>>> these buffers and the buffer that has the largest counter value is
>>> necessarily for the smaller compared string. If both counters are equal,
>>> then you start comparing the weights stored in each buffer, until one of
>>> the buffers ends before another (the shorter buffer is for the smaller
>>> compared string). If both weight buffers reach the end, you use the next
>>> pair of buffers built for the next level and process them with the same
>>> algorithm.
>>>
>>> Nowhere you'll ever need to consider any [.0000] weight which is just a
>>> notation in the format of the DUCET intended only to be readable by humans
>>> but never needed in any machine implementation.
>>>
>>> Now if you want to create sort keys this is similar except that you
>>> don"t have two strings to process and compare, all you want is to create
>>> separate arrays of weights for each level: each level can be encoded
>>> separately, the encoding must be made so that when you'll concatenate the
>>> encoded arrays, the first few encoded *bits* in the secondary or tertiary
>>> encodings cannot be larger or equal to the bits used by the encoding of the
>>> primary weights (this only limits how you'll encode the 1st weight in each
>>> array as its first encoding *bits* must be lower than the first bits used
>>> to encode any weight in previous levels).
>>>
>>> Nowhere you are required to encode weights exactly like their logical
>>> weight, this encoding is fully reversible and can use any suitable
>>> compression technics if needed. As long as you can safely detect when an
>>> encoding ends, because it encounters some bits (with lower values) used to
>>> start the encoding of one of the higher levels, the compression is safe.
>>>
>>> For each level, you can reserve only a single code used to "mark" the
>>> start of another higher level followed by some bits to indicate which level
>>> it is, then followed by the compressed code for the level made so that each
>>> weight is encoded by a code not starting by the reserved mark. That
>>> encoding "mark" is not necessarily a 0000, it may be a nul byte, or a '!'
>>> (if the encoding must be readable as ASCII or UTF-8-based, and must not use
>>> any control or SPACE or isolated surrogate) and codes used to encode each
>>> weight must not start by a byte lower or equal to this mark. The binary or
>>> ASCII code units used to encode each weight must just be comparable, so
>>> that comparing codes is equivalent to compare weights represented by each
>>> code.
>>>
>>> As well, you are not required to store multiple "marks". This is just
>>> one of the possibilities to encode in the sort key which level is encoded
>>> after each "mark", and the marks are not necessarily the same before each
>>> level (their length may also vary depending on the level they are
>>> starting): these marks may be completely removed from the final encoding if
>>> the encoding/compression used allows discriminating the level used by all
>>> weights, encoded in separate sets of values.
>>>
>>> Typical compression technics are for example differencial, notably in
>>> secondary or higher levels, and run-legth encoded to skip sequences of
>>> weights all equal to the minimum weight.
>>>
>>> The code units used by the weigh encoding for each level may also need
>>> to avoid some forbidden values if needed (e.g. when encoding the weights to
>>> UTF-8 or UTF16, or BOCU-1, or SCSU, you cannot use isolate code units
>>> reserved for or representing an isolate surrogate in U+D800..U+DFFF as this
>>> would create a string not conforming to any standard UTF).
>>>
>>> Once again this means that the sequence of logical weight will can
>>> sefely become a readable string, even suitable to be transmitted as
>>> plain-text using any UTF, and that compression is also possible in that
>>> case: you can create and store lot of sort keys even for very long texts
>>>
>>> However it is generally better to just encode sort keys only for a
>>> reasonnably discriminant part of the text, e.g. no sort key longer than 255
>>> bytes (created from the start of the original texts): if you compare two
>>> sort keys and find that they are equal, and if both sort keys have this
>>> length of 255 bytes, then you'll compare the full original texts using the
>>> fast-compare algorithm: you don't need to store full sort keys in addition
>>> to the original texts. This can save lot of storage, provided that original
>>> texts are sufficiently discriminated by their start, and that cases where
>>> the sort keys were truncated to the limit of 255 bytes are exceptionnal.
>>>
>>> For short texts however, truncated sortkeys may save time at the price
>>> of a reasonnable storage cost (but sortkeys can be also encoded with
>>> roughly the same size as the original text: compression is modest for the
>>> encoded primary level. But compression is frequently very effective for
>>> higher levels where their smaller weight also have less possible variations
>>> of value, in a smaller set.
>>>
>>> Notably for the secondary level used to encode case differences, only 3
>>> bits are enough per weight, and you just need to reserve the 3-bit value
>>> "000" as the "mark" for indicating the start of another higher level, while
>>> encoding secondary weights as "001" to  "111".
>>>
>>> (This means that primary levels have to be encoded so that none of their
>>> encoded primary weights are starting with "000" marking the start of the
>>> secondary level. So primary weights can be encoded in patterns starting by
>>> "0001", "001", "01", or "1" and followed by other bits: this allows
>>> encoding them as readable UTF-8 if these characters are all different at
>>> primary level, excluding only the 16 first C0 controls which need to be
>>> preprocessed into escape sequences using the first permitted C0 control as
>>> an escape, and escaping that C0 control itself).
>>>
>>> The third level, started by the mark "00" and followed by the encoded
>>> weights indicating this is a tertiary level and not an higher level, will
>>> also be used to encode a small set of weights (in most locales, this is not
>>> more than 8 or 16, so you need only 3 or 4 bits to encode weights (using
>>> differential coding on 3-bits, you reserve "000" as the "mark" for the next
>>> higher level, then use "001" to "111" to encode differencial weights, the
>>> differencial weights being initially based on the minimum tertiary weight,
>>> you'll use the bit pattern "001" to encode the most frequent minimum
>>> tertiary weight, and patterns "01" to "11" plus additional bits to encode
>>> other positive or negative differences of tertiary weights, or to use
>>> run-length compression). Here also it is possible to map the patterns so
>>> that the encoded secondary weight will be readable valid UTF-8.
>>>
>>> The fourth level, started by the mark "000" can use the pattern "001" to
>>> encode the most frequent minimum quaternary weight, and patterns "010" to
>>> "011" followed by other bits to differentially encode the quaternary
>>> weights. Here again it is possible to create an encoding for quaternary
>>> weights that can use some run-length compression and can also be readable
>>> valid UTF-8!
>>>
>>> And so on.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Le jeu. 1 nov. 2018 à 22:04, Philippe Verdy <verdy_p at wanadoo.fr> a
>>> écrit :
>>>
>>>> So it should be clear in the UCA algorithm and in the DUCET datatable
>>>> that "0000" is NOT a valid weight
>>>> It is just a notational placeholder used as ".0000", only indicating in
>>>> the DUCET format that there's NO weight assigned at the indicated level,
>>>> because the collation element is ALWAYS ignorable at this level.
>>>> The DUCET could have as well used the notation ".none", or just dropped
>>>> every ".0000" in its file (provided it contains a data entry specifying
>>>> what is the minimum weight used for each level). This notation is only
>>>> intended to be read by humans editing the file, so they don't need to
>>>> wonder what is the level of the first indicated weight or remember what is
>>>> the minimum weight for that level.
>>>> But the DUCET table is actually generated by a machine and processed by
>>>> machines.
>>>>
>>>>
>>>>
>>>> Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy <verdy_p at wanadoo.fr> a
>>>> écrit :
>>>>
>>>>> In summary, this step given in the algorithm is completely unneeded
>>>>> and can be dropped completely:
>>>>>
>>>>> *S3.2 <http://unicode.org/reports/tr10/#S3.2> *If L is not 1, append
>>>>> a *level separator*
>>>>>
>>>>> *Note:*The level separator is zero (0000), which is guaranteed to be
>>>>> lower than any weight in the resulting sort key. This guarantees that when
>>>>> two strings of unequal length are compared, where the shorter string is a
>>>>> prefix of the longer string, the longer string is always sorted after the
>>>>> shorter—in the absence of special features like contractions. For example:
>>>>> "abc" < "abcX" where "X" can be any character(s).
>>>>>
>>>>> Remove any reference to the "level separator" from the UCA. You never
>>>>> need it.
>>>>>
>>>>> As well this paragraph
>>>>>
>>>>> 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3>
>>>>>
>>>>> *Step 3.* Construct a sort key for each collation element array by
>>>>> successively appending all non-zero weights from the collation element
>>>>> array. Figure 2 gives an example of the application of this step to one
>>>>> collation element array.
>>>>>
>>>>> Figure 2. Collation Element Array to Sort Key
>>>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>>>>> Collation Element ArraySort Key
>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002],
>>>>> [.06EE.0020.0002] 0706 06D9 06EE 0000 0020 0020 0021 0020 0000 0002
>>>>> 0002 0002 0002
>>>>>
>>>>> can be written with this figure:
>>>>>
>>>>> Figure 2. Collation Element Array to Sort Key
>>>>> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
>>>>> Collation Element ArraySort Key
>>>>> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
>>>>> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>>>>>
>>>>> The parentheses mark the collation weights 0020 and 0002 that can be
>>>>> safely removed if they are respectively the minimum secondary weight and
>>>>> minimum tertiary weight.
>>>>> But note that 0020 is kept in two places as they are followed by a
>>>>> higher weight 0021. This is general for any tailored collation (not just
>>>>> the DUCET).
>>>>>
>>>>> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy <verdy_p at wanadoo.fr> a
>>>>> écrit :
>>>>>
>>>>>> The 0000 is there in the UCA only because the DUCET is published in a
>>>>>> format that uses it, but here also this format is useless: you never need
>>>>>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>>>>>> just needs to indicate what is the minimum weight assigned for every level
>>>>>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>>>>>
>>>>>>
>>>>>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus.icu at gmail.com> a
>>>>>> écrit :
>>>>>>
>>>>>>> There are lots of ways to implement the UCA.
>>>>>>>
>>>>>>> When you want fast string comparison, the zero weights are useful
>>>>>>> for processing -- and you don't actually assemble a sort key.
>>>>>>>
>>>>>>> People who want sort keys usually want them to be short, so you
>>>>>>> spend time on compression. You probably also build sort keys as byte
>>>>>>> vectors not uint16 vectors (because byte vectors fit into more APIs and
>>>>>>> tend to be shorter), like ICU does using the CLDR collation data file. The
>>>>>>> CLDR root collation data file remunges all weights into fractional byte
>>>>>>> sequences, and leaves gaps for tailoring.
>>>>>>>
>>>>>>> markus
>>>>>>>
>>>>>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/689e9862/attachment.html>


More information about the Unicode mailing list