propose th-u-lb-nodict

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Thu May 25 15:00:56 CDT 2017


On Thu, 25 May 2017 08:38:19 -0700
Markus Scherer via CLDR-Users <cldr-users at unicode.org> wrote:

> So this would be not just "no dictionary", it would be "no breaks at
> all in any script that uses no spaces between words".

᩺No, the behaviour would be to treat SA as AL.  While this can cause
major problems for newspaper columns, the effect for wider text such as
memoranda would rather be numerous extents of white space.  I presume
books would get at least some type-setting treatment, i.e. line-break
opportunities would be inserted manually.

Treating all SA as AL is not entirely appropriate.  For example,
treating U+0E46 THAI CHARACTER MAI YAMOK as 'Exclamation' (EX) would be
better; <U+0020, U+0E46> should not be split from the alphabetical
characters preceding it.  

> It would be nice to come up with a 5-8 letter abbreviation for what it
> does, rather than what it doesn't do.
> 
> Also, is it more useful to have no breaks in, say, Thai strings, at
> all (gross under-segmentation) -- or to have breaks between
> orthographic syllables (over-segmentation)?
> (That would be a yet different subtag.)

Do not believe Indian claims about the primacy of orthographic
syllables.  The natural division within-word line-breaking in the Thai
and Lao scripts is the phonetic syllable.  Indeed, Lao line-breaking
tends to happen at syllable boundaries.

There are, of course, several levels of line-breaking.  Artificial
breaks are more at the level of hyphenation.  If you want a suggestion
for simple emergency breaks in Thai and Lao, the best place is before
preposed vowels.  The next obvious place is after the visargas, though
the Thai language (which, of course, is not the subject of the
suggestion) does have some exceptions such as silenced consonants
following U+0E30. 

The locale example given is, of course, almost oxymoronic.  In general,
of course, a *Thai language* dictionary should not be used for another
language.  Unfortunately, I am trying to think of a good example
of a scriptio continua language for which a Thai dictionary is clearly
completely useless.  (Pali, Pattani Malay and Northern Khmer in the
Thai script are *not* scriptio continua.)  For a Tai language like
Northern Thai, a Thai dictionary is not completely useless. However,
this raised the next point.

For Northern Thai, nod_TH (or more precisely, nod-Thai_TH), one would
normally want to use it with a Northern Thai dictionary.  The intention
behind nod-u-lb-nodict should be not to use a Thai dictionary for
line-breaking, not not to use a Northern Thai dictionary.

A better modifier would be "-u-lb-noth", meaning "Do not fall back to
a Thai dictionary".  Contrariwise, "-u-lb-th" could authorise fallback
to a Thai dictionary.  Perhaps "u-lb-la" should authorise
dictionary-based line-breaking of scriptio continua Latin.  With these
ideas, "pi-u-lb-noth" should let me type Pali without worrying about
spurious line-breaks in the middle of words.  (Of course, I still have
to watch out for spurious line-breaks in Thai.)

Richard.





More information about the CLDR-Users mailing list