CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale

Sat Apr 5 11:30:31 CDT 2014

On Thu, 3 Apr 2014 20:01:40 -0700
Markus Scherer <markus.icu at gmail.com> wrote:

> On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:

>> How are break iteration rules meant to interact with
>> dictionary-based word and line-breakers?

> In CLDR and ICU, the rules specify the set of characters that need
> dictionary support. (It's triggered by script, not by language.)

In CLDR, which rules are these?  I can't find them.  All I can find is
statements outside CLDR such as "For Thai, Lao, Khmer, Myanmar, and
other scripts that do not typically use spaces between words, a good
implementation should not depend on the default word boundary
specification" in UAX#29 'Unicode Text Segmentation'.

Now, some minority languages in these scripts use spaces between words,
as can be seen in the Northern Khmer bible (e.g. at
http://www.amazon.com/Bible-Northern-Khmer-Black-Cover/dp/9749141083).
While Thai might be a good fallback language for kxm-Thai-TH (there is
some usage of kxm-Khmr-TH), a Thai dictionary-based break iterator would
be a disaster.  On the other hand, I would hope for tolerable breaking
performance from a Thai dictionary-based break iterator for
North-Eastern Thai (tts-Thai-TH), which does not separate words.  By
contrast, I would describe the performance for phonetically written
Northern Thai, as revealed by the Thai spell-checker in LibreOffice, as
unsurprisingly poor.

> I expect that there will generally be data for language-specific
> exceptions, overrides and such for more languages than character-level
> segmentation rules. Those low-level rules should always fall back to
> root when there is no language-specific data. I think the higher-level
> exceptions should probably also avoid going through some default
> language.

If breakers just ignore the segmentation rules, then it should always
help to define rough and ready segmentation rules for every language
that uses a mainland SE Asian script as identified by Line_Break=SA.
Syllable breaking is generally a good approximation to word and
line-breaking, and in the visually ordered scripts, the preposed vowels
start syllables.  One needs a good reason to default the segmentation
rules to root for such languages.

Turning to collation, is the way to provide defaulting for collation
tag in collation/root.xml to list all languages as valid sublocales?  I
am a bit confused as to the point of having the file collation/en.xml.
What does it achieve?  Does it exist purely for the sake of its comment?

Richard.