propose th-u-lb-nodict

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Fri May 26 14:13:20 CDT 2017


On Fri, 26 May 2017 09:55:48 +0100
Martin Hosken via CLDR-Users <cldr-users at unicode.org> wrote:

> > > Also, is it more useful to have no breaks in, say, Thai strings,
> > > at all (gross under-segmentation) -- or to have breaks between
> > > orthographic syllables (over-segmentation)?
> > > (That would be a yet different subtag.)    

> What makes Thai hard is that you can't analyse a text into
> orthographic syllables without knowledge of the language.

That is solved by assuming one orthographic syllable if it is not
obvious there are two or possibly three - I'm not sure how many
orthographic syllables there are in /tua/ 'body' ตัว <TO TAO, MAI
HAN-AKAT, WO WAEN>.  I think there are two.

> > There are, of course, several levels of line-breaking.  Artificial
> > breaks are more at the level of hyphenation.  If you want a
> > suggestion for simple emergency breaks in Thai and Lao, the best
> > place is before preposed vowels.  The next obvious place is after
> > the visargas, though the Thai language (which, of course, is not
> > the subject of the suggestion) does have some exceptions such as
> > silenced consonants following U+0E30.   
> 
> We would need to be careful about adding emergency breaks. For
> example, in polysyllabic words, we wouldn't want to break even
> between two syllables. So my proposal really would be: only break at
> places that other languages would break, with no recourse to a
> dictionary.

Emergency breaks belong to the domain of hyphenation, which I believe
is beyond the scope of CLDR.  If a word won't fit in a row of text, it
usually needs breaking - that even happens with English.

> > The locale example given is, of course, almost oxymoronic.  In
> > general, of course, a *Thai language* dictionary should not be used
> > for another language.  Unfortunately, I am trying to think of a
> > good example of a scriptio continua language for which a Thai
> > dictionary is clearly completely useless.  (Pali, Pattani Malay and
> > Northern Khmer in the Thai script are *not* scriptio continua.)
> > For a Tai language like Northern Thai, a Thai dictionary is not
> > completely useless. However, this raised the next point.  
> 
> One example I have is So (Bruic-Katuic-Mon Khmer), but there are
> plenty of other languages that aren't Tai but that use Thai script.

My problem was that the examples I could find were either Tai languages
or separated words with spaces.  I must say if feels strange to me to
see sentence-terminating full stops in Thai script.

> > A better modifier would be "-u-lb-noth", meaning "Do not fall back
> > to a Thai dictionary".  Contrariwise, "-u-lb-th" could authorise
> > fallback to a Thai dictionary.  Perhaps "u-lb-la" should authorise
> > dictionary-based line-breaking of scriptio continua Latin.  With
> > these ideas, "pi-u-lb-noth" should let me type Pali without
> > worrying about spurious line-breaks in the middle of words.  (Of
> > course, I still have to watch out for spurious line-breaks in
> > Thai.)  
> 
> Let's not get carried away. If you want Thai based breaking you just
> use lang="th" or do nothing since the default analysis will say: oh
> Thai script, unknown language, assume Thai. Which is a good and
> helpful thing to do.

Are you sure about that?  The default analysis feels more like, "Oh Thai
script, ignore the language, just assume Thai for line-breaking."  That
was the behaviour when I looked at ICU a few years ago.  The precise
logic in ICU was that every language, significantly including English,
uses the Thai-language word-boundary detector to do Thai-script
line-breaking.  I was able to create a Pali line-breaker in ICU that
recognised that Pali is not written scriptio continua
in the Thai script.  (Word boundaries are often lost before words
beginning with vowels - not very different to British or US Sanskrit.)

The problem with this approach lies in adding very lightweight
locales - line-breaking, word-breaking, perhaps collation, and possibly
a very few bits of data, but nothing more.  There may be a build issue
for ICU - ICU uses wetware to convert algorithmic CLDR line- and
word-breaking to its own data format.

For an example in an application, while in LibreOffice I can switch
Thai spell-checking off by setting the language to Malayalam (which I
do as the easy way of preventing the spelling in the Tai Tham script
being checked as though the language were Siamese - I haven't installed
a Malayalam spell-checker), LibreOffice still breaks Thai script text as
though it were Siamese. The problem here is that usually Siamese is the
best language to assume for line-breaking Thai script text in the
middle of English text, though with its script class to language
maps, LibreOffice for one could do better - if it has alternative
dictionaries.

(I need to find out how I got LibreOffice, using correct tagging, to do
nod_Lana spell-checking a few years ago.  I think I had to
update Hunspell to a more recent version of Unicode.)

Richard.



More information about the CLDR-Users mailing list