CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale

Markus Scherer at
Sat Apr 5 12:12:10 CDT 2014

On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham <
richard.wordingham at> wrote:

> > In CLDR and ICU, the rules specify the set of characters that need
> > dictionary support. (It's triggered by script, not by language.)
> In CLDR, which rules are these?

I think it's
    <variable id="$SA">\p{Line_Break=Complex_Context}</variable>
which you can find in the line-break rules in

Also, as far as I know, the ICU rule syntax is different enough from the
CLDR syntax that the conversion is manual. The ICU dictionary support might
need a manual addition.

(Others know a lot more about segmentation than I do.)

Turning to collation, is the way to provide defaulting for collation
> tag in collation/root.xml to list all languages as valid sublocales?

The validSubLocales data was removed from CLDR. Instead, we have some empty
base-language collation files to document that the root order is known to
be appropriate; as opposed to the absence of a base-language collation file
which basically means "don't know".

I am a bit confused as to the point of having the file collation/en.xml.
> What does it achieve?  Does it exist purely for the sake of its comment?


In addition, in the current ICU implementation (I am not sure about the
LDML spec), an empty base-language file means we find something and don't
go through the default locale. When we agree that collation should go
directly to root, rather than to the default locale, then we could remove
the empty resource bundles from ICU (although they are very small). We
would keep the empty CLDR files for documentation.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list