ISO 14651/14652 vs Unicode sorting

Ken Whistler kenwhistler at sonic.net
Thu May 28 09:52:03 CDT 2020


Ilya,

On this topic, see the extended discussion of variable weighting in UTS #10:

https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples

On 5/28/2020 1:42 AM, Ilya Zakharevich via Unicode wrote:
> I have been informed that according to the tables distributed with ISO
> 14651/14652, the following strings should be sorted in this order:
>
>>    foobar
>>    foo baz
> Moreover, this is how glibc (and, as a corollary, all utilities) do
> this in European locales on contemporary Linuxes.
ISO 14651 recommends the "Shifted" handling of variables. In this 
particular case, your concern is with the handling of U+0020 SPACE, but 
that choice also affects all punctuation and symbols, unless otherwise 
tailored.
>
> I checked COBUILT, American Heritage, and Le Petit Robert II — and it
> seems that they do indeed use this (brain damaged?) order.  (Although
> not, apparently, Le Petit Robert I — which SEEMS TO HAVE compound
> words tackled at the end of the main record.)
Precisely what happens in various dictionaries is a bit beside the 
point, because they often follow somewhat special rules that may not 
always directly match the results of just taking all the headwords and 
sorting the strings according to a particular collation setting. They 
may require special tailoring.
>
> However, this definitely contradicts what
>    https://icu4c-demos-7hxm2n5zgq-uc.a.run.app/icu-bin/collation.html
> does with the default locale, and with `en´.

In that demo, the collation *defaults* to "Non-ignorable". Again, see 
the discussion of variable weighting cited above. In a "Non-ignorable" 
collation, the primary weights of the variables (space included) *are* 
used at the primary level of sortkey construction, instead of being 
shifted to only make a difference following any tertiary weight 
differences. So you get the results in the demo you see where the space 
character "makes a difference" -- namely, that it is weighted as 
significantly as other full letters.

However, if you switch options in that demo to "Shifted" -- see the the 
seventh line of the radio buttons, labeled "alternate", then you get the 
Shifted weighting, which will then mirror the results you see for glibc.

>
> So what is the intended behavior: of ICU, or of ISO?!

There is no "right answer" here. The Unicode Collation Algorithm comes 
with built-in alternative parametric settings, and, of course, the 
option to tailor the collation rules indefinitely, to meet the 
requirements of particular languages and/or particular dictionary 
orderings or other special purposes. ISO 14651 also allows different 
settings (although not as completely spelled out as in UCA) and 
tailorings. What glibc has done is pick the default, out-of-the-box 
shifted handling of variables implied by ISO 14651, but that is simply 
an implementation choice.

--Ken

>
> Thanks,
> Ilya
>


More information about the Unicode mailing list