CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale
richard.wordingham at ntlworld.com
Sat Apr 5 11:30:31 CDT 2014
On Thu, 3 Apr 2014 20:01:40 -0700
Markus Scherer <markus.icu at gmail.com> wrote:
> On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>> How are break iteration rules meant to interact with
>> dictionary-based word and line-breakers?
> In CLDR and ICU, the rules specify the set of characters that need
> dictionary support. (It's triggered by script, not by language.)
In CLDR, which rules are these? I can't find them. All I can find is
statements outside CLDR such as "For Thai, Lao, Khmer, Myanmar, and
other scripts that do not typically use spaces between words, a good
implementation should not depend on the default word boundary
specification" in UAX#29 'Unicode Text Segmentation'.
Now, some minority languages in these scripts use spaces between words,
as can be seen in the Northern Khmer bible (e.g. at
While Thai might be a good fallback language for kxm-Thai-TH (there is
some usage of kxm-Khmr-TH), a Thai dictionary-based break iterator would
be a disaster. On the other hand, I would hope for tolerable breaking
performance from a Thai dictionary-based break iterator for
North-Eastern Thai (tts-Thai-TH), which does not separate words. By
contrast, I would describe the performance for phonetically written
Northern Thai, as revealed by the Thai spell-checker in LibreOffice, as
> I expect that there will generally be data for language-specific
> exceptions, overrides and such for more languages than character-level
> segmentation rules. Those low-level rules should always fall back to
> root when there is no language-specific data. I think the higher-level
> exceptions should probably also avoid going through some default
If breakers just ignore the segmentation rules, then it should always
help to define rough and ready segmentation rules for every language
that uses a mainland SE Asian script as identified by Line_Break=SA.
Syllable breaking is generally a good approximation to word and
line-breaking, and in the visually ordered scripts, the preposed vowels
start syllables. One needs a good reason to default the segmentation
rules to root for such languages.
Turning to collation, is the way to provide defaulting for collation
tag in collation/root.xml to list all languages as valid sublocales? I
am a bit confused as to the point of having the file collation/en.xml.
What does it achieve? Does it exist purely for the sake of its comment?
More information about the CLDR-Users