UCA unnecessary collation weight 0000

Sun Nov 4 15:51:24 CST 2018

So you finally admit that I was right... And that the specs include
requirements that are not even needed to make UCA work, and that not even
used by wellknown implementations. These are old artefacts which are now
really confusive (instructing programmers to adopt the old deprecated
behavior, before realizing that this was a bad advice which jut complicated
their task). UCA can be implemented **conformingly** without these, even
for the simplest implementations (where using complex packages like ICU is
not an option and rewriting it is not one as well for much simpler goals)
where these incorrect requirements are in fact suggesting to be more
inefficient than really needed.
There's not a lot of work to edit and to fix the specs without these
polluting 0000 "pseudo-weights".

Le dim. 4 nov. 2018 à 09:27, Mark Davis ☕️ <mark at macchiato.com> a écrit :

> Philippe, I agree that we could have structured the UCA differently. It
> does make sense, for example, to have the weights be simply decimal values
> instead of integers. But nobody is going to go through the substantial
> work of restructuring the UCA spec and data file unless there is a very
> strong reason to do so. It takes far more time and effort than people
> realize to change in the algorithm/data while making sure that everything
> lines up without inadvertent changes being introduced.
>
> It is just not worth the effort. There are so, so, many things we can do
> in Unicode (encoding, properties, algorithms, CLDR, ICU) that have a higher
> benefit.
>
> You can continue flogging this horse all you want, but I'm muting this
> thread (and I suspect I'm not the only one).
>
> Mark
>
>
> On Sun, Nov 4, 2018 at 2:37 AM Philippe Verdy via Unicode <
> unicode at unicode.org> wrote:
>
>> Le ven. 2 nov. 2018 à 22:27, Ken Whistler <kenwhistler at att.net> a écrit :
>>
>>>
>>> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>>>
>>> I was replying not about the notational repreentation of the DUCET data
>>> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
>>> Which remains highly confusive, and contains completely unnecesary steps,
>>> and just complicates things with absoiluytely no benefit at all by
>>> introducing confusion about these "0000".
>>>
>>> Sorry, Philippe, but the confusion that I am seeing introduced is what
>>> you are introducing to the unicode list in the course of this discussion.
>>>
>>>
>>> UTR#10 still does not explicitly state that its use of "0000" does not
>>> mean it is a valid "weight", it's a notation only
>>>
>>> No, it is explicitly a valid weight. And it is explicitly and
>>> normatively referred to in the specification of the algorithm. See UTS10-D8
>>> (and subsequent definitions), which explicitly depend on a definition of "A
>>> collation weight whose value is zero." The entire statement of what are
>>> primary, secondary, tertiary, etc. collation elements depends on that
>>> definition. And see the tables in Section 3.2, which also depend on those
>>> definitions.
>>>
>>> (but the notation is used for TWO distinct purposes: one is for
>>> presenting the notation format used in the DUCET
>>>
>>> It is *not* just a notation format used in the DUCET -- it is part of
>>> the normative definitional structure of the algorithm, which then
>>> percolates down into further definitions and rules and the steps of the
>>> algorithm.
>>>
>>
>> I insist that this is NOT NEEDED at all for the definition, it is
>> absolutely NOT structural. The algorithm still guarantees the SAME result.
>>
>> It is ONLY used to explain the format of the DUCET and the fact the this
>> format does NOT use 0000 as a valid weight, ans os can use it as a notation
>> (in fact only a presentational feature).
>>
>>
>>> itself to present how collation elements are structured, the other one
>>> is for marking the presence of a possible, but not always required,
>>> encoding of an explicit level separator for encoding sort keys).
>>>
>>> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
>>> is not part of the *notation* for collation elements, but instead is a
>>> magic value chosen for the level separator precisely because zero values
>>> from the collation elements are removed during sort key construction, so
>>> that zero is then guaranteed to be a lower value than any remaining weight
>>> added to the sort key under construction. This part of the algorithm is not
>>> rocket science, by the way!
>>>
>>
>> Here again you make a confusion: a sort key MAY use them as separators if
>> it wants to compress keys by reencoding weights per level: that's the only
>> case where you may want to introduce an encoding pattern starting with 0,
>> while the rest of the encoding for weights in that level must using
>> patterns not starting by this 0 (the number of bits to encode this 0 does
>> not matter: it is only part of the encoding used on this level which does
>> not necessarily have to use 16-bit code units per weight.
>>
>>>
>>> Even the example tables can be made without using these "0000" (for
>>> example in tables showing how to build sort keys, it can present the list
>>> of weights splitted in separate columns, one column per level, without any
>>> "0000". The implementation does not necessarily have to create a buffer
>>> containing all weight values in a row, when separate buffers for each level
>>> is far superior (and even more efficient as it can save space in memory).
>>>
>>> The UCA doesn't *require* you to do anything particular in your own
>>> implementation, other than come up with the same results for string
>>> comparisons.
>>>
>> Yes I know, but the algorithm also does not require me to use these
>> invalid 0000 pseudo-weights, that the algorithm itself will always discard
>> (in a completely needless step)!
>>
>>
>>> That is clearly stated in the conformance clause of UTS #10.
>>>
>>> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>>>
>>> The step "S3.2" in the UCA algorithm should not even be there (it is
>>> made in favor an specific implementation which is not even efficient or
>>> optimal),
>>>
>>> That is a false statement. Step S3.2 is there to provide a clear
>>> statement of the algorithm, to guarantee correct results for string
>>> comparison.
>>>
>>
>> You're wrong, this statement is completely useless in all cases. There is
>> still the correct results for string comparison without them: a string
>> comparison can only compare valid weights for each level, it will not
>> compare any weight past the end of the text in any one of the two compared
>> strings, nowhere it will compare weights with one of them being 0, unless
>> this 0 is used as a "guard value" for the end of text and your compare loop
>> still continues scanning the longer string when the other string has
>> already ended (this case should be detected much earlier before
>> determineing the next collection boundary in the string and then computing
>> its weights for each level.
>>
>>> Section 9 of UTS #10 provides a whole lunch buffet of techniques that
>>> implementations can choose from to increase the efficiency of their
>>> implementations, as they deem appropriate. You are free to implement as you
>>> choose -- including techniques that do not require any level separators.
>>> You are, however, duly warned in:
>>>
>>>
>>> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
>>>
>>> that "While this technique is relatively easy to implement, it can
>>> interfere with other compression methods."
>>>
>>> it complicates the algorithm with absoluytely no benefit at all); you
>>> can ALWAYS remove it completely and this still generates equivalent results.
>>>
>>> No you cannot ALWAYS remove it completely. Whether or not your
>>> implementation can do so, depends on what other techniques you may be using
>>> to increase performance, store shorter keys, or whatever else may be at
>>> stake in your optimization
>>>
>> I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However
>> you MAY ADD them ONLY when generating and encoding the sort keys, if the
>> encoding used really does compress the weights into smaller values: this is
>> the only case where you want to ADD a separator, internally only in the
>> binary key encoder, but but as part of the algorithm itself.
>>
>> If your key generation does not use any compression (in the simplest
>> implementations), then it can simply an directly concatenate all weights
>> with the same code units size (16-bit in the DUCET), without inserting any
>> additional 0000 code unit to separate them: your resulting sort key will
>> still not contain any 0000 code unit in any part for any level because the
>> algorithm already has excluded them. Finally this means that sort keys can
>> be stored in C-strings (terminated by null code units, instead of being
>> delimited by a separately encoded length property, but for C-strings where
>> code units are 8-bit, i.e. "char" in C, you still need an encoder to
>> convert the 16-bit binary weights into sequences of bytes not containing
>> any 00 byte: if this encoder is used, still you don't need any 00 separator
>> between encoded levels!).
>>
>> As all these 0000 weigths are unnecessary, then the current UCA algorithm
>> trying to introduce them needlessly is REALLY introducing unnecessary
>> confusion: values of weights NEVER need to be restricted.
>>
>> The only conditions that matter is that:
>> - all weights are *comparable* (sign does not even matter, they are not
>> even restricted to be numbers or even just integers) and that
>> - they are **fully ordered**, and that the fully ordered set of weights
>> (not necessarily an enumerable set or a discrete set, as this can the
>> continuous set of real numbers)
>> - and that the full set of weights is **fully partitioned** into
>> distinct, intervals (with no intersection between intervals, so intervals
>> are also comparable)
>> - that the highest interval will be used by weights in the primary level:
>> each partition is numbered (by the level: a positive integer between 1 and
>> L): you can compare the level numbers assigned to the partition in which
>> the weight is a member: if level(weight1) > level(weight2) (this is an
>> comparison of positive integers), then necessarily you may have weight1 <
>> weight2 (this is only comparing weights encoded arbitrarily and which can
>> still use a 0 value if you wish to use it to encode a valid weight for a
>> valid collation element at any level 1 to N; this is also the only
>> condition needed to respect rule WF2 in UCA).
>>
>> ---
>> Notes about encodings for weights in sort keys:
>>
>> If weights are chosen to be rational numbers, e.g any rational numbers in
>> the (0.0, 1.0) open interval, and because your collation algorithm will
>> only recognize a finite set of distinct collation elements with necessarily
>> a finite number N of distinct weights w(i), for i in 0..(N-1), allows the
>> collation weights to be represented by choosing them **arbitrarily** within
>> this open interval:
>> - this can be done simply by partitionning the (0.0 1.0) into N half-open
>> intervals [w(i), w(i+1));
>> - and then encoding a weight w(i) by any **arbitrarily chosen rational**
>> inside one of these intervals (for example this can be done for using
>> compression with arithmetic coding).
>>
>> A weight encoding using a finite discrete set (of binary integers between
>> 0 and M-1) is what you need to use classic Huffman coding: this is
>> equivalent to multiplying the previous rationals and truncating them to the
>> nearest floor integer, but as this limits the choice of rational numbers
>> above so that distinct weights remain distinct with the binary encoding,
>> you need to keep more significant bits with Huffman coding than with
>> Arithmetic coding (i.e. you need a higher value of M; where M is typically
>> a power of 2 using 1-bit code units, or power of 256 for the simpler
>> encodings using 8-bit code units, or a power of 65536 for an uncompressed
>> encoding of 16-bit weight values).
>>
>> Arithmetic coding is in fact equivalent to Huffman coding, except that M
>> is not necessarily a positive integer but can be any positive rational and
>> can then represent each weigh value with a rational number of bits on
>> average, instead of a static integer number of bits. You can say as well
>> that Huffman coding is a restriction of Arithmetic coding where M must be
>> an integer, or that Arithmetic coding is a generalization of Huffman coding.
>>
>> Both the Huffman and Arithmetic codings are wellknown examples of "prefix
>> coding" (the latter offering a bit more compression, for the same
>> statistical distribution of encoded values). The open interval (0.0, w(0))
>> is still not used at all to encode weights, but can still have a statistic
>> distribution, usable with the prefix encoding to represent the end of
>> string. But here again this does not represent the artificial 0000 weight
>> which is NEVER encoded anywhere.
>>
>> ---
>>
>> Ask to a mathematician you trust, he will confirm that these rules
>> speaking about the pseudo-weight 0000 in UCA are completely unnecessary
>> (i.e. removing them from the algorithm does not change the result for
>> comparing strings, or for generating sort keys)
>> And as a conclusion, attempting to introduce them in the standard creates
>> more confusion than it helps (in fact it is most probably a relict of a
>> former bogous *implementation*, that still relied on them because other
>> well-formness conditions were not satistified, or not well defined in the
>> earlier attempts to define the UCA...). That this is not even needed for
>> computing "composite weights" (which is not defining new weights, but an
>> attempt to encode them in a larger space: this can be done completely
>> outside the standard algorithm itself: just allow weights to be rational
>> numbers, it is then easy to extend the number of encodable weights as a
>> single number without increasing the numeric range in which they are
>> defined; then leave the encoder of the sort key generator store them with a
>> convenient "prefix coding", using one or more code units of arbitrary
>> length).
>>
>> Philippe.
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181104/76c841d3/attachment.html>