Martin Hosken via CLDR-Users
cldr-users at unicode.org
Fri May 26 03:55:48 CDT 2017
> > So this would be not just "no dictionary", it would be "no breaks at
> > all in any script that uses no spaces between words".
> ᩺No, the behaviour would be to treat SA as AL. While this can cause
> major problems for newspaper columns, the effect for wider text such as
> memoranda would rather be numerous extents of white space. I presume
> books would get at least some type-setting treatment, i.e. line-break
> opportunities would be inserted manually.
Correct. The assumption is that the text has been appropriately broken using ZWSP. One wouldn't select this for text that was not in that state. Notice this isn't some fallback behaviour. This is specifically chosen for a run of text by the document creator, in full knowledge of its potential impact.
> Treating all SA as AL is not entirely appropriate. For example,
> treating U+0E46 THAI CHARACTER MAI YAMOK as 'Exclamation' (EX) would be
> better; <U+0020, U+0E46> should not be split from the alphabetical
> characters preceding it.
The good news is that since it is just another set of break iterator rules, we can do things like that, so long as we make it clear what we are doing and why.
> > It would be nice to come up with a 5-8 letter abbreviation for what it
> > does, rather than what it doesn't do.
> > Also, is it more useful to have no breaks in, say, Thai strings, at
> > all (gross under-segmentation) -- or to have breaks between
> > orthographic syllables (over-segmentation)?
> > (That would be a yet different subtag.)
What makes Thai hard is that you can't analyse a text into orthographic syllables without knowledge of the language.
> Do not believe Indian claims about the primacy of orthographic
> syllables. The natural division within-word line-breaking in the Thai
> and Lao scripts is the phonetic syllable. Indeed, Lao line-breaking
> tends to happen at syllable boundaries.
> There are, of course, several levels of line-breaking. Artificial
> breaks are more at the level of hyphenation. If you want a suggestion
> for simple emergency breaks in Thai and Lao, the best place is before
> preposed vowels. The next obvious place is after the visargas, though
> the Thai language (which, of course, is not the subject of the
> suggestion) does have some exceptions such as silenced consonants
> following U+0E30.
We would need to be careful about adding emergency breaks. For example, in polysyllabic words, we wouldn't want to break even between two syllables. So my proposal really would be: only break at places that other languages would break, with no recourse to a dictionary.
> The locale example given is, of course, almost oxymoronic. In general,
> of course, a *Thai language* dictionary should not be used for another
> language. Unfortunately, I am trying to think of a good example
> of a scriptio continua language for which a Thai dictionary is clearly
> completely useless. (Pali, Pattani Malay and Northern Khmer in the
> Thai script are *not* scriptio continua.) For a Tai language like
> Northern Thai, a Thai dictionary is not completely useless. However,
> this raised the next point.
One example I have is So (Bruic-Katuic-Mon Khmer), but there are plenty of other languages that aren't Tai but that use Thai script. And again, nobody *has* to use this thing. You only turn it on if you want to say: yes I have broken this thing into words myself, please don't break it up any more through the use of a dictionary. That's all it's saying. It's not trying to be clever. It's not trying to make anyone's life easier. It's saying: *stop* trying to be clever and think you know better than the document author. Just break where I say you can break and be done with it.
> A better modifier would be "-u-lb-noth", meaning "Do not fall back to
> a Thai dictionary". Contrariwise, "-u-lb-th" could authorise fallback
> to a Thai dictionary. Perhaps "u-lb-la" should authorise
> dictionary-based line-breaking of scriptio continua Latin. With these
> ideas, "pi-u-lb-noth" should let me type Pali without worrying about
> spurious line-breaks in the middle of words. (Of course, I still have
> to watch out for spurious line-breaks in Thai.)
Let's not get carried away. If you want Thai based breaking you just use lang="th" or do nothing since the default analysis will say: oh Thai script, unknown language, assume Thai. Which is a good and helpful thing to do.
More information about the CLDR-Users