UCA unnecessary collation weight 0000

Richard Wordingham via Unicode unicode at unicode.org
Fri Nov 2 09:39:49 CDT 2018


On Fri, 2 Nov 2018 14:54:19 +0100
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> It's not just a question of "I like it or not". But the fact that the
> standard makes the presence of 0000 required in some steps, and the
> requirement is in fact wrong: this is in fact NEVER required to
> create an equivalent collation order. these steps are completely
> unnecessary and should be removed.
> 
> Le ven. 2 nov. 2018 à 14:03, Mark Davis ☕️ <mark at macchiato.com> a
> écrit :
> 
> > You may not like the format of the data, but you are not bound to
> > it. If you don't like the data format (eg you want [.0021.0002]
> > instead of [.0000.0021.0002]), you can transform it however you
> > want as long as you get the same answer, as it says here:
> >
> > http://unicode.org/reports/tr10/#Conformance
> > “The Unicode Collation Algorithm is a logical specification.
> > Implementations are free to change any part of the algorithm as
> > long as any two strings compared by the implementation are ordered
> > the same as they would be by the algorithm as specified.
> > Implementations may also use a different format for the data in the
> > Default Unicode Collation Element Table. The sort key is a logical
> > intermediate object: if an implementation produces the same results
> > in comparison of strings, the sort keys can differ in format from
> > what is specified in this document. (See Section 9, Implementation
> > Notes.)”

Given the above paragraph, how does the standard force you to use a
special 0000?  Perhaps the wording of the standard can be changed to
prevent your unhappy interpretation.

> > That is what is done, for example, in ICU's implementation. See
> > http://demo.icu-project.org/icu-bin/collation.html and turn on "raw
> > collation elements" and "sort keys" to see the transformed collation
> > elements (from the DUCET + CLDR) and the resulting sort keys.
> >
> > a =>[29,05,_05] => 29 , 05 , 05 .
> > a\u0300 => [29,05,_05][,8A,_05] => 29 , 45 8A , 06 .
> > à => <same>
> > A\u0300 => [29,05,u1C][,8A,_05] => 29 , 45 8A , DC 05 .
> > À => <same>

As you can see, Mark does not come to the same conclusion as you, and
nor do I.

Richard.



More information about the Unicode mailing list