propose th-u-lb-nodict

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Fri May 26 02:52:44 CDT 2017


On Thu, 25 May 2017 16:55:29 -0700
Peter Edberg via CLDR-Users <cldr-users at unicode.org> wrote:

> > On May 25, 2017, at 4:30 PM, Richard Wordingham via CLDR-Users
> > <cldr-users at unicode.org> wrote:
> > 
> > On Thu, 25 May 2017 14:39:58 -0700
> > Peter Edberg via CLDR-Users <cldr-users at unicode.org> wrote:
> >   
> >>> -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum)
> >>> -u-ld-thai0-sanskrit  
> > 
> > I'm not sure why there should be line-breaking 'dictionary' for
> > Pali in Thai script, 
> >   
> >>> Perhaps the -nodict should also be by script, e.g.
> >>> -u-ld-thai0-nodict
> >>> still allows dictionary use for CJK, just none for Thai script.  
> > 
> > Most dictionaries should be identified by language, not script.  The
> > problem being addressed is the use of a Siamese dictionary for
> > breaking text in other languages.   
> 
> The issue is that libraries that implement this spec, such as ICU ,
> would typically choose a dictionary to use based on script range.

That's a fault.  They should first consider the language.

Now, there is a related issue of whether a locale should be
able to specify the language of stretches in an unexpected script.
Word processors often do a tripartition of scripts into simple, complex,
and CJK, though the corresponding standards fail to define the three
categories, and use that to select the font and sometimes (usually?)
the language.  This works well for most multi-script paragraphs once
the tripartition has stabilised.

> So
> one needs to be able to specify, e.g.
> - For Thai script, use xxx dictionary.
> - For Khmer script, use yyy dictionary.
> 
> The xxx and yyy would specify language, but you still need to
> associate them with a script.

I believe that for Northeastern Thai one needs a preference list -
prefer a NE Thai dictionary, allow fall back to a Siamese dictionary.
Now, in the Lao script (e.g. for Tai Noi) that gets more complicated if
one wants to cater for modern language rather than just transcribing
old manuscripts.  Systematic omission of tone marks could confuse a
line-breaker that looks for word boundaries between correctly spelt
words.

For Northern Thai, one may find it better to prefer a Northern Thai
dictionary but refuse to use a Siamese dictionary.

Richard.


More information about the CLDR-Users mailing list