UCA unnecessary collation weight 0000

Philippe Verdy via Unicode unicode at unicode.org
Sat Nov 3 20:33:32 CDT 2018


Le ven. 2 nov. 2018 à 22:27, Ken Whistler <kenwhistler at att.net> a écrit :

>
> On 11/2/2018 10:02 AM, Philippe Verdy via Unicode wrote:
>
> I was replying not about the notational repreentation of the DUCET data
> table (using [.0000...] unnecessarily) but about the text of UTR#10 itself.
> Which remains highly confusive, and contains completely unnecesary steps,
> and just complicates things with absoiluytely no benefit at all by
> introducing confusion about these "0000".
>
> Sorry, Philippe, but the confusion that I am seeing introduced is what you
> are introducing to the unicode list in the course of this discussion.
>
>
> UTR#10 still does not explicitly state that its use of "0000" does not
> mean it is a valid "weight", it's a notation only
>
> No, it is explicitly a valid weight. And it is explicitly and normatively
> referred to in the specification of the algorithm. See UTS10-D8 (and
> subsequent definitions), which explicitly depend on a definition of "A
> collation weight whose value is zero." The entire statement of what are
> primary, secondary, tertiary, etc. collation elements depends on that
> definition. And see the tables in Section 3.2, which also depend on those
> definitions.
>
> (but the notation is used for TWO distinct purposes: one is for presenting
> the notation format used in the DUCET
>
> It is *not* just a notation format used in the DUCET -- it is part of the
> normative definitional structure of the algorithm, which then percolates
> down into further definitions and rules and the steps of the algorithm.
>

I insist that this is NOT NEEDED at all for the definition, it is
absolutely NOT structural. The algorithm still guarantees the SAME result.

It is ONLY used to explain the format of the DUCET and the fact the this
format does NOT use 0000 as a valid weight, ans os can use it as a notation
(in fact only a presentational feature).


> itself to present how collation elements are structured, the other one is
> for marking the presence of a possible, but not always required, encoding
> of an explicit level separator for encoding sort keys).
>
> That is a numeric value of zero, used in Section 7.3, Form Sort Keys. It
> is not part of the *notation* for collation elements, but instead is a
> magic value chosen for the level separator precisely because zero values
> from the collation elements are removed during sort key construction, so
> that zero is then guaranteed to be a lower value than any remaining weight
> added to the sort key under construction. This part of the algorithm is not
> rocket science, by the way!
>

Here again you make a confusion: a sort key MAY use them as separators if
it wants to compress keys by reencoding weights per level: that's the only
case where you may want to introduce an encoding pattern starting with 0,
while the rest of the encoding for weights in that level must using
patterns not starting by this 0 (the number of bits to encode this 0 does
not matter: it is only part of the encoding used on this level which does
not necessarily have to use 16-bit code units per weight.

>
> Even the example tables can be made without using these "0000" (for
> example in tables showing how to build sort keys, it can present the list
> of weights splitted in separate columns, one column per level, without any
> "0000". The implementation does not necessarily have to create a buffer
> containing all weight values in a row, when separate buffers for each level
> is far superior (and even more efficient as it can save space in memory).
>
> The UCA doesn't *require* you to do anything particular in your own
> implementation, other than come up with the same results for string
> comparisons.
>
Yes I know, but the algorithm also does not require me to use these invalid
0000 pseudo-weights, that the algorithm itself will always discard (in a
completely needless step)!


> That is clearly stated in the conformance clause of UTS #10.
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Basic_Conformance
>
> The step "S3.2" in the UCA algorithm should not even be there (it is made
> in favor an specific implementation which is not even efficient or optimal),
>
> That is a false statement. Step S3.2 is there to provide a clear statement
> of the algorithm, to guarantee correct results for string comparison.
>

You're wrong, this statement is completely useless in all cases. There is
still the correct results for string comparison without them: a string
comparison can only compare valid weights for each level, it will not
compare any weight past the end of the text in any one of the two compared
strings, nowhere it will compare weights with one of them being 0, unless
this 0 is used as a "guard value" for the end of text and your compare loop
still continues scanning the longer string when the other string has
already ended (this case should be detected much earlier before
determineing the next collection boundary in the string and then computing
its weights for each level.

> Section 9 of UTS #10 provides a whole lunch buffet of techniques that
> implementations can choose from to increase the efficiency of their
> implementations, as they deem appropriate. You are free to implement as you
> choose -- including techniques that do not require any level separators.
> You are, however, duly warned in:
>
>
> https://www.unicode.org/reports/tr10/tr10-39.html#Eliminating_level_separators
>
> that "While this technique is relatively easy to implement, it can
> interfere with other compression methods."
>
> it complicates the algorithm with absoluytely no benefit at all); you can
> ALWAYS remove it completely and this still generates equivalent results.
>
> No you cannot ALWAYS remove it completely. Whether or not your
> implementation can do so, depends on what other techniques you may be using
> to increase performance, store shorter keys, or whatever else may be at
> stake in your optimization
>
I maintain: you can ALWAYS REMOVE it compeltely of the algorithm. However
you MAY ADD them ONLY when generating and encoding the sort keys, if the
encoding used really does compress the weights into smaller values: this is
the only case where you want to ADD a separator, internally only in the
binary key encoder, but but as part of the algorithm itself.

If your key generation does not use any compression (in the simplest
implementations), then it can simply an directly concatenate all weights
with the same code units size (16-bit in the DUCET), without inserting any
additional 0000 code unit to separate them: your resulting sort key will
still not contain any 0000 code unit in any part for any level because the
algorithm already has excluded them. Finally this means that sort keys can
be stored in C-strings (terminated by null code units, instead of being
delimited by a separately encoded length property, but for C-strings where
code units are 8-bit, i.e. "char" in C, you still need an encoder to
convert the 16-bit binary weights into sequences of bytes not containing
any 00 byte: if this encoder is used, still you don't need any 00 separator
between encoded levels!).

As all these 0000 weigths are unnecessary, then the current UCA algorithm
trying to introduce them needlessly is REALLY introducing unnecessary
confusion: values of weights NEVER need to be restricted.

The only conditions that matter is that:
- all weights are *comparable* (sign does not even matter, they are not
even restricted to be numbers or even just integers) and that
- they are **fully ordered**, and that the fully ordered set of weights
(not necessarily an enumerable set or a discrete set, as this can the
continuous set of real numbers)
- and that the full set of weights is **fully partitioned** into distinct,
intervals (with no intersection between intervals, so intervals are also
comparable)
- that the highest interval will be used by weights in the primary level:
each partition is numbered (by the level: a positive integer between 1 and
L): you can compare the level numbers assigned to the partition in which
the weight is a member: if level(weight1) > level(weight2) (this is an
comparison of positive integers), then necessarily you may have weight1 <
weight2 (this is only comparing weights encoded arbitrarily and which can
still use a 0 value if you wish to use it to encode a valid weight for a
valid collation element at any level 1 to N; this is also the only
condition needed to respect rule WF2 in UCA).

---
Notes about encodings for weights in sort keys:

If weights are chosen to be rational numbers, e.g any rational numbers in
the (0.0, 1.0) open interval, and because your collation algorithm will
only recognize a finite set of distinct collation elements with necessarily
a finite number N of distinct weights w(i), for i in 0..(N-1), allows the
collation weights to be represented by choosing them **arbitrarily** within
this open interval:
- this can be done simply by partitionning the (0.0 1.0) into N half-open
intervals [w(i), w(i+1));
- and then encoding a weight w(i) by any **arbitrarily chosen rational**
inside one of these intervals (for example this can be done for using
compression with arithmetic coding).

A weight encoding using a finite discrete set (of binary integers between 0
and M-1) is what you need to use classic Huffman coding: this is equivalent
to multiplying the previous rationals and truncating them to the nearest
floor integer, but as this limits the choice of rational numbers above so
that distinct weights remain distinct with the binary encoding, you need to
keep more significant bits with Huffman coding than with Arithmetic coding
(i.e. you need a higher value of M; where M is typically a power of 2 using
1-bit code units, or power of 256 for the simpler encodings using 8-bit
code units, or a power of 65536 for an uncompressed encoding of 16-bit
weight values).

Arithmetic coding is in fact equivalent to Huffman coding, except that M is
not necessarily a positive integer but can be any positive rational and can
then represent each weigh value with a rational number of bits on average,
instead of a static integer number of bits. You can say as well that
Huffman coding is a restriction of Arithmetic coding where M must be an
integer, or that Arithmetic coding is a generalization of Huffman coding.

Both the Huffman and Arithmetic codings are wellknown examples of "prefix
coding" (the latter offering a bit more compression, for the same
statistical distribution of encoded values). The open interval (0.0, w(0))
is still not used at all to encode weights, but can still have a statistic
distribution, usable with the prefix encoding to represent the end of
string. But here again this does not represent the artificial 0000 weight
which is NEVER encoded anywhere.

---

Ask to a mathematician you trust, he will confirm that these rules speaking
about the pseudo-weight 0000 in UCA are completely unnecessary (i.e.
removing them from the algorithm does not change the result for comparing
strings, or for generating sort keys)
And as a conclusion, attempting to introduce them in the standard creates
more confusion than it helps (in fact it is most probably a relict of a
former bogous *implementation*, that still relied on them because other
well-formness conditions were not satistified, or not well defined in the
earlier attempts to define the UCA...). That this is not even needed for
computing "composite weights" (which is not defining new weights, but an
attempt to encode them in a larger space: this can be done completely
outside the standard algorithm itself: just allow weights to be rational
numbers, it is then easy to extend the number of encodable weights as a
single number without increasing the numeric range in which they are
defined; then leave the encoder of the sort key generator store them with a
convenient "prefix coding", using one or more code units of arbitrary
length).

Philippe.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181104/76d8c879/attachment.html>


More information about the Unicode mailing list