UCA unnecessary collation weight 0000

Philippe Verdy via Unicode unicode at unicode.org
Thu Nov 1 15:42:05 CDT 2018


The 0000 is there in the UCA only because the DUCET is published in a
format that uses it, but here also this format is useless: you never need
any [.0000], or [.0000.0000] in the DUCET table as well. Instead the DUCET
just needs to indicate what is the minimum weight assigned for every level
(except the highest level where it is "implicitly" 0001, and not 0000).


Le jeu. 1 nov. 2018 à 21:08, Markus Scherer <markus.icu at gmail.com> a écrit :

> There are lots of ways to implement the UCA.
>
> When you want fast string comparison, the zero weights are useful for
> processing -- and you don't actually assemble a sort key.
>
> People who want sort keys usually want them to be short, so you spend time
> on compression. You probably also build sort keys as byte vectors not
> uint16 vectors (because byte vectors fit into more APIs and tend to be
> shorter), like ICU does using the CLDR collation data file. The CLDR root
> collation data file remunges all weights into fractional byte sequences,
> and leaves gaps for tailoring.
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181101/ed485edd/attachment.html>


More information about the Unicode mailing list