UCA unnecessary collation weight 0000
Ken Whistler via Unicode
unicode at unicode.org
Fri Nov 2 16:27:37 CDT 2018
On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
> I was replying not about the notational repreentation of the DUCET
> data table (using [.0000...] unnecessarily) but about the text of
> UTR#10 itself. Which remains highly confusive, and contains completely
> unnecesary steps, and just complicates things with absoiluytely no
> benefit at all by introducing confusion about these "0000".
Sorry, Philippe, but the confusion that I am seeing introduced is what
you are introducing to the unicode list in the course of this discussion.
> UTR#10 still does not explicitly state that its use of "0000" does not
> mean it is a valid "weight", it's a notation only
No, it is explicitly a valid weight. And it is explicitly and
normatively referred to in the specification of the algorithm. See
UTS10-D8 (and subsequent definitions), which explicitly depend on a
definition of "A collation weight whose value is zero." The entire
statement of what are primary, secondary, tertiary, etc. collation
elements depends on that definition. And see the tables in Section 3.2,
which also depend on those definitions.
> (but the notation is used for TWO distinct purposes: one is for
> presenting the notation format used in the DUCET
It is *not* just a notation format used in the DUCET -- it is part of
the normative definitional structure of the algorithm, which then
percolates down into further definitions and rules and the steps of the
algorithm.
> itself to present how collation elements are structured, the other one
> is for marking the presence of a possible, but not always required,
> encoding of an explicit level separator for encoding sort keys).
That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
is not part of the *notation* for collation elements, but instead is a
magic value chosen for the level separator precisely because zero values
from the collation elements are removed during sort key construction, so
that zero is then guaranteed to be a lower value than any remaining
weight added to the sort key under construction. This part of the
algorithm is not rocket science, by the way!
>
> UTR#10 is still needlessly confusive.
O.k., if you think so, you then know what to do:
https://www.unicode.org/review/pri385/
and
https://www.unicode.org/reporting.html
> Even the example tables can be made without using these "0000" (for
> example in tables showing how to build sort keys, it can present the
> list of weights splitted in separate columns, one column per level,
> without any "0000". The implementation does not necessarily have to
> create a buffer containing all weight values in a row, when separate
> buffers for each level is far superior (and even more efficient as it
> can save space in memory).
The UCA doesn't *require* you to do anything particular in your own
implementation, other than come up with the same results for string
comparisons. That is clearly stated in the conformance clause of UTS #10.
https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
> The step "S3.2" in the UCA algorithm should not even be there (it is
> made in favor an specific implementation which is not even efficient
> or optimal),
That is a false statement. Step S3.2 is there to provide a clear
statement of the algorithm, to guarantee correct results for string
comparison. Section 9 of UTS #10 provides a whole lunch buffet of
techniques that implementations can choose from to increase the
efficiency of their implementations, as they deem appropriate. You are
free to implement as you choose -- including techniques that do not
require any level separators. You are, however, duly warned in:
https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
that "While this technique is relatively easy to implement, it can
interfere with other compression methods."
> it complicates the algorithm with absoluytely no benefit at all); you
> can ALWAYS remove it completely and this still generates equivalent
> results.
No you cannot ALWAYS remove it completely. Whether or not your
implementation can do so, depends on what other techniques you may be
using to increase performance, store shorter keys, or whatever else may
be at stake in your optimization.
If you don't like zeroes in collation, be my guest, and ignore them
completely. Take them out of your tables, and don't use level
separators. Just make sure you end up with conformant result for
comparison of strings when you are done. And in the meantime, if you
want to complain about the text of the specification of UTS #10, then
provide carefully worded alternatives as suggestions for improvement to
the text, rather than just endlessly ranting about how the standard is
confusive because the collation weight 0000 is "unnecessary".
--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181102/c2fc495a/attachment.html>
More information about the Unicode
mailing list