CLDR proposal: Unicode algorithms should fall back to root, not to unrelated default locale
Richard Wordingham
richard.wordingham at ntlworld.com
Sat Apr 5 11:30:31 CDT 2014
On Thu, 3 Apr 2014 20:01:40 -0700
Markus Scherer <markus.icu at gmail.com> wrote:
> On Thu, Apr 3, 2014 at 1:21 PM, Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>> How are break iteration rules meant to interact with
>> dictionary-based word and line-breakers?
> In CLDR and ICU, the rules specify the set of characters that need
> dictionary support. (It's triggered by script, not by language.)
In CLDR, which rules are these? I can't find them. All I can find is
statements outside CLDR such as "For Thai, Lao, Khmer, Myanmar, and
other scripts that do not typically use spaces between words, a good
implementation should not depend on the default word boundary
specification" in UAX#29 'Unicode Text Segmentation'.
Now, some minority languages in these scripts use spaces between words,
as can be seen in the Northern Khmer bible (e.g. at
http://www.amazon.com/Bible-Northern-Khmer-Black-Cover/dp/9749141083).
While Thai might be a good fallback language for kxm-Thai-TH (there is
some usage of kxm-Khmr-TH), a Thai dictionary-based break iterator would
be a disaster. On the other hand, I would hope for tolerable breaking
performance from a Thai dictionary-based break iterator for
North-Eastern Thai (tts-Thai-TH), which does not separate words. By
contrast, I would describe the performance for phonetically written
Northern Thai, as revealed by the Thai spell-checker in LibreOffice, as
unsurprisingly poor.
> I expect that there will generally be data for language-specific
> exceptions, overrides and such for more languages than character-level
> segmentation rules. Those low-level rules should always fall back to
> root when there is no language-specific data. I think the higher-level
> exceptions should probably also avoid going through some default
> language.
If breakers just ignore the segmentation rules, then it should always
help to define rough and ready segmentation rules for every language
that uses a mainland SE Asian script as identified by Line_Break=SA.
Syllable breaking is generally a good approximation to word and
line-breaking, and in the visually ordered scripts, the preposed vowels
start syllables. One needs a good reason to default the segmentation
rules to root for such languages.
Turning to collation, is the way to provide defaulting for collation
tag in collation/root.xml to list all languages as valid sublocales? I
am a bit confused as to the point of having the file collation/en.xml.
What does it achieve? Does it exist purely for the sake of its comment?
Richard.
More information about the CLDR-Users
mailing list