CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale

Mon Apr 7 18:39:32 CDT 2014

On Sat, 5 Apr 2014 10:12:10 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Sat, Apr 5, 2014 at 9:30 AM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
> 
> > > In CLDR and ICU, the rules specify the set of characters that need
> > > dictionary support. (It's triggered by script, not by language.)
> >
> > In CLDR, which rules are these?

> I think it's
>     <variable id="$SA">\p{Line_Break=Complex_Context}</variable>
> which you can find in the line-break rules in
>     http://unicode.org/cldr/trac/browser/trunk/common/segments/root.xml

If the dictionary is chosen only by script and not by language, then
the design of ICU is currently broken as far as minority languages are
concerned.  I can't see how a Thai dictionary and a Northern or NE Thai
dictionary can co-exist.  (The usual script for writing these languages
is the Thai script, despite attempts to reinvigorate old regional
scripts.)

Going back to the CLDR level, there's another complexity.  Good Thai
typography inserts a space before U+0E46 THAI CHARACTER MAIYAMOK, and
does not break lines before the U+0E46.  It may be possible to fix the
line breaking by a rule something like "× \u0e46".  The sequence
<U+0020, U+0E46> should usually be considered the end of a word - the
truth of Line_Break=Complex_Context can vary within a word.  (There are
a few dictionary entries where <U+0020, U+0E46> occurs within the
non-compound lexical item - U+0E46 is then also followed by a space.)

I haven't yet experimented with these rules in ICU.  Might these tweaks
work?  Would tailoring Thai characters not to be
Line_Break=Complex_Context succeed in disabling the use of the Thai
dictionary for a locale?  The following rule in root.xml diminishes
hope: 

	<variable id="$AL">[$AI $AL $XX $SA $SG]</variable>

In all the examples of Pali I've seen in the Thai script, words are
separated by spaces. 

I think U+0E46 should be Line_Break=Exclamation.

Now some people get round the problem by omitting the space but
starting the glyph of mai yamok with a space.  ICU does this with words
that end in mai yamok - there is no preceding space character.  When
looking at serials in Thai magazines, I've noticed that spaces are
omitted before question and exclamation marks when there is a risk of
justification moving them onto the next line.  I suspect the rule
"× EX" is often not implemented.  It is possible that changing the line
break property of mai yamok could inconvenience these people -
removing <space, mai yamok> from the end of a word in the (Thai) Royal
Institute Dictionary does not always yield a word.

The immediate consequence of all this is that changing the inheritance
rules for segmentation would only be depriving certain people of a
benefit they probably don't yet have.

> In addition, in the current ICU implementation (I am not sure about
> the LDML spec), an empty base-language file means we find something
> and don't go through the default locale.

Formally, that looks like a non-compliance!

Richard.