CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale

Richard Wordingham richard.wordingham at
Mon Apr 7 18:39:32 CDT 2014

On Sat, 5 Apr 2014 10:12:10 -0700
Markus Scherer < at> wrote:

> On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham <
> richard.wordingham at> wrote:
> > > In CLDR and ICU, the rules specify the set of characters that need
> > > dictionary support. (It's triggered by script, not by language.)
> >
> > In CLDR, which rules are these?

> I think it's
>     <variable id="$SA">\p{Line_Break=Complex_Context}</variable>
> which you can find in the line-break rules in

If the dictionary is chosen only by script and not by language, then
the design of ICU is currently broken as far as minority languages are
concerned.  I can't see how a Thai dictionary and a Northern or NE Thai
dictionary can co-exist.  (The usual script for writing these languages
is the Thai script, despite attempts to reinvigorate old regional

Going back to the CLDR level, there's another complexity.  Good Thai
typography inserts a space before U+0E46 THAI CHARACTER MAIYAMOK, and
does not break lines before the U+0E46.  It may be possible to fix the
line breaking by a rule something like "× \u0e46".  The sequence
<U+0020, U+0E46> should usually be considered the end of a word - the
truth of Line_Break=Complex_Context can vary within a word.  (There are
a few dictionary entries where <U+0020, U+0E46> occurs within the
non-compound lexical item - U+0E46 is then also followed by a space.)

I haven't yet experimented with these rules in ICU.  Might these tweaks
work?  Would tailoring Thai characters not to be
Line_Break=Complex_Context succeed in disabling the use of the Thai
dictionary for a locale?  The following rule in root.xml diminishes

	<variable id="$AL">[$AI $AL $XX $SA $SG]</variable>

In all the examples of Pali I've seen in the Thai script, words are
separated by spaces. 

I think U+0E46 should be Line_Break=Exclamation.

Now some people get round the problem by omitting the space but
starting the glyph of mai yamok with a space.  ICU does this with words
that end in mai yamok - there is no preceding space character.  When
looking at serials in Thai magazines, I've noticed that spaces are
omitted before question and exclamation marks when there is a risk of
justification moving them onto the next line.  I suspect the rule
"× EX" is often not implemented.  It is possible that changing the line
break property of mai yamok could inconvenience these people -
removing <space, mai yamok> from the end of a word in the (Thai) Royal
Institute Dictionary does not always yield a word.

The immediate consequence of all this is that changing the inheritance
rules for segmentation would only be depriving certain people of a
benefit they probably don't yet have.

> In addition, in the current ICU implementation (I am not sure about
> the LDML spec), an empty base-language file means we find something
> and don't go through the default locale.

Formally, that looks like a non-compliance!


More information about the CLDR-Users mailing list