UCA unnecessary collation weight 0000
Philippe Verdy via Unicode
unicode at unicode.org
Thu Nov 1 16:04:40 CDT 2018
So it should be clear in the UCA algorithm and in the DUCET datatable that
"0000" is NOT a valid weight
It is just a notational placeholder used as ".0000", only indicating in the
DUCET format that there's NO weight assigned at the indicated level,
because the collation element is ALWAYS ignorable at this level.
The DUCET could have as well used the notation ".none", or just dropped
every ".0000" in its file (provided it contains a data entry specifying
what is the minimum weight used for each level). This notation is only
intended to be read by humans editing the file, so they don't need to
wonder what is the level of the first indicated weight or remember what is
the minimum weight for that level.
But the DUCET table is actually generated by a machine and processed by
machines.
Le jeu. 1 nov. 2018 à 21:57, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :
> In summary, this step given in the algorithm is completely unneeded and
> can be dropped completely:
>
> *S3.2 <http://unicode.org/reports/tr10/#S3.2> *If L is not 1, append a *level
> separator*
>
> *Note:*The level separator is zero (0000), which is guaranteed to be
> lower than any weight in the resulting sort key. This guarantees that when
> two strings of unequal length are compared, where the shorter string is a
> prefix of the longer string, the longer string is always sorted after the
> shorter—in the absence of special features like contractions. For example:
> "abc" < "abcX" where "X" can be any character(s).
>
> Remove any reference to the "level separator" from the UCA. You never need
> it.
>
> As well this paragraph
>
> 7.3 Form Sort Keys <http://unicode.org/reports/tr10/#Step_3>
>
> *Step 3.* Construct a sort key for each collation element array by
> successively appending all non-zero weights from the collation element
> array. Figure 2 gives an example of the application of this step to one
> collation element array.
>
> Figure 2. Collation Element Array to Sort Key
> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0000.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0000 0020 0020 0021 0020 0000 0002 0002 0002 0002
>
> can be written with this figure:
>
> Figure 2. Collation Element Array to Sort Key
> <http://unicode.org/reports/tr10/#Array_To_Sort_Key_Table>
> Collation Element ArraySort Key
> [.0706.0020.0002], [.06D9.0020.0002], [.0021.0002], [.06EE.0020.0002] 0706
> 06D9 06EE 0020 0020 0021 (0020) (0002 0002 0002 0002)
>
> The parentheses mark the collation weights 0020 and 0002 that can be
> safely removed if they are respectively the minimum secondary weight and
> minimum tertiary weight.
> But note that 0020 is kept in two places as they are followed by a higher
> weight 0021. This is general for any tailored collation (not just the
> DUCET).
>
> Le jeu. 1 nov. 2018 à 21:42, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :
>
>> The 0000 is there in the UCA only because the DUCET is published in a
>> format that uses it, but here also this format is useless: you never need
>> any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
>> just needs to indicate what is the minimum weight assigned for every level
>> (except the highest level where it is "implicitly" 0001, and not 0000).
>>
>>
>> Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus.icu at gmail.com> a
>> écrit :
>>
>>> There are lots of ways to implement the UCA.
>>>
>>> When you want fast string comparison, the zero weights are useful for
>>> processing -- and you don't actually assemble a sort key.
>>>
>>> People who want sort keys usually want them to be short, so you spend
>>> time on compression. You probably also build sort keys as byte vectors not
>>> uint16 vectors (because byte vectors fit into more APIs and tend to be
>>> shorter), like ICU does using the CLDR collation data file. The CLDR root
>>> collation data file remunges all weights into fractional byte sequences,
>>> and leaves gaps for tailoring.
>>>
>>> markus
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181101/6114beb5/attachment.html>
More information about the Unicode
mailing list