UCA unnecessary collation weight 0000

Ken Whistler via Unicode unicode at unicode.org
Fri Nov 2 16:27:37 CDT 2018


On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
> I was replying not about the notational repreentation of the DUCET 
> data table (using [.0000...] unnecessarily) but about the text of 
> UTR#10 itself. Which remains highly confusive, and contains completely 
> unnecesary steps, and just complicates things with absoiluytely no 
> benefit at all by introducing confusion about these "0000". 

Sorry, Philippe, but the confusion that I am seeing introduced is what 
you are introducing to the unicode list in the course of this discussion.


> UTR#10 still does not explicitly state that its use of "0000" does not 
> mean it is a valid "weight", it's a notation only

No, it is explicitly a valid weight. And it is explicitly and 
normatively referred to in the specification of the algorithm. See 
UTS10-D8 (and subsequent definitions), which explicitly depend on a 
definition of "A collation weight whose value is zero." The entire 
statement of what are primary, secondary, tertiary, etc. collation 
elements depends on that definition. And see the tables in Section 3.2, 
which also depend on those definitions.


> (but the notation is used for TWO distinct purposes: one is for 
> presenting the notation format used in the DUCET

It is *not* just a notation format used in the DUCET -- it is part of 
the normative definitional structure of the algorithm, which then 
percolates down into further definitions and rules and the steps of the 
algorithm.

> itself to present how collation elements are structured, the other one 
> is for marking the presence of a possible, but not always required, 
> encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It 
is not part of the *notation* for collation elements, but instead is a 
magic value chosen for the level separator precisely because zero values 
from the collation elements are removed during sort key construction, so 
that zero is then guaranteed to be a lower value than any remaining 
weight added to the sort key under construction. This part of the 
algorithm is not rocket science, by the way!
>
> UTR#10 is still needlessly confusive.

O.k., if you think so, you then know what to do:

https://www.unicode.org/review/pri385/

and

https://www.unicode.org/reporting.html

> Even the example tables can be made without using these "0000" (for 
> example in tables showing how to build sort keys, it can present the 
> list of weights splitted in separate columns, one column per level, 
> without any "0000". The implementation does not necessarily have to 
> create a buffer containing all weight values in a row, when separate 
> buffers for each level is far superior (and even more efficient as it 
> can save space in memory).

The UCA doesn't *require* you to do anything particular in your own 
implementation, other than come up with the same results for string 
comparisons. That is clearly stated in the conformance clause of UTS #10.

https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance

> The step "S3.2" in the UCA algorithm should not even be there (it is 
> made in favor an specific implementation which is not even efficient 
> or optimal),

That is a false statement. Step S3.2 is there to provide a clear 
statement of the algorithm, to guarantee correct results for string 
comparison. Section 9 of UTS #10 provides a whole lunch buffet of 
techniques that implementations can choose from to increase the 
efficiency of their implementations, as they deem appropriate. You are 
free to implement as you choose -- including techniques that do not 
require any level separators. You are, however, duly warned in:

https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators

that "While this technique is relatively easy to implement, it can 
interfere with other compression methods."

> it complicates the algorithm with absoluytely no benefit at all); you 
> can ALWAYS remove it completely and this still generates equivalent 
> results.

No you cannot ALWAYS remove it completely. Whether or not your 
implementation can do so, depends on what other techniques you may be 
using to increase performance, store shorter keys, or whatever else may 
be at stake in your optimization.

If you don't like zeroes in collation, be my guest, and ignore them 
completely. Take them out of your tables, and don't use level 
separators. Just make sure you end up with conformant result for 
comparison of strings when you are done. And in the meantime, if you 
want to complain about the text of the specification of UTS #10, then 
provide carefully worded alternatives as suggestions for improvement to 
the text, rather than just endlessly ranting about how the standard is 
confusive because the collation weight 0000 is "unnecessary".

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/c2fc495a/attachment.html>


More information about the Unicode mailing list