ISO 14651/14652 vs Unicode sorting
Ken Whistler
kenwhistler at sonic.net
Thu May 28 09:52:03 CDT 2020
Ilya,
On this topic, see the extended discussion of variable weighting in UTS #10:
https://www.unicode.org/reports/tr10/#Variable_Weighting_Examples
On 5/28/2020 1:42 AM, Ilya Zakharevich via Unicode wrote:
> I have been informed that according to the tables distributed with ISO
> 14651/14652, the following strings should be sorted in this order:
>
>> foobar
>> foo baz
> Moreover, this is how glibc (and, as a corollary, all utilities) do
> this in European locales on contemporary Linuxes.
ISO 14651 recommends the "Shifted" handling of variables. In this
particular case, your concern is with the handling of U+0020 SPACE, but
that choice also affects all punctuation and symbols, unless otherwise
tailored.
>
> I checked COBUILT, American Heritage, and Le Petit Robert II — and it
> seems that they do indeed use this (brain damaged?) order. (Although
> not, apparently, Le Petit Robert I — which SEEMS TO HAVE compound
> words tackled at the end of the main record.)
Precisely what happens in various dictionaries is a bit beside the
point, because they often follow somewhat special rules that may not
always directly match the results of just taking all the headwords and
sorting the strings according to a particular collation setting. They
may require special tailoring.
>
> However, this definitely contradicts what
> https://icu4c-demos-7hxm2n5zgq-uc.a.run.app/icu-bin/collation.html
> does with the default locale, and with `en´.
In that demo, the collation *defaults* to "Non-ignorable". Again, see
the discussion of variable weighting cited above. In a "Non-ignorable"
collation, the primary weights of the variables (space included) *are*
used at the primary level of sortkey construction, instead of being
shifted to only make a difference following any tertiary weight
differences. So you get the results in the demo you see where the space
character "makes a difference" -- namely, that it is weighted as
significantly as other full letters.
However, if you switch options in that demo to "Shifted" -- see the the
seventh line of the radio buttons, labeled "alternate", then you get the
Shifted weighting, which will then mirror the results you see for glibc.
>
> So what is the intended behavior: of ICU, or of ISO?!
There is no "right answer" here. The Unicode Collation Algorithm comes
with built-in alternative parametric settings, and, of course, the
option to tailor the collation rules indefinitely, to meet the
requirements of particular languages and/or particular dictionary
orderings or other special purposes. ISO 14651 also allows different
settings (although not as completely spelled out as in UCA) and
tailorings. What glibc has done is pick the default, out-of-the-box
shifted handling of variables implied by ISO 14651, but that is simply
an implementation choice.
--Ken
>
> Thanks,
> Ilya
>
More information about the Unicode
mailing list