propose th-u-lb-nodict

Martin Hosken via CLDR-Users cldr-users at unicode.org
Sat May 27 02:30:13 CDT 2017


Dear Richard,

> > What makes Thai hard is that you can't analyse a text into
> > orthographic syllables without knowledge of the language.  
> 
> That is solved by assuming one orthographic syllable if it is not
> obvious there are two or possibly three - I'm not sure how many
> orthographic syllables there are in /tua/ 'body' ตัว <TO TAO, MAI
> HAN-AKAT, WO WAEN>.  I think there are two.

I.e. it's hard. There are many clear cases, but there are as many ambiguous cases. This differs from say Burmese script where you can algorithmically work out all syllable breaks (I bet you'll find an ambiguous one now, just to prove me wrong!)

> > We would need to be careful about adding emergency breaks. For
> > example, in polysyllabic words, we wouldn't want to break even
> > between two syllables. So my proposal really would be: only break at
> > places that other languages would break, with no recourse to a
> > dictionary.  
> 
> Emergency breaks belong to the domain of hyphenation, which I believe
> is beyond the scope of CLDR.  If a word won't fit in a row of text, it
> usually needs breaking - that even happens with English.

s/emergency/mid word/

I.e. the point is that the line breaker shouldn't be doing syllable breaking. At least that is not what I want for lb-nodict.

> > One example I have is So (Bruic-Katuic-Mon Khmer), but there are
> > plenty of other languages that aren't Tai but that use Thai script.  
> 
> My problem was that the examples I could find were either Tai languages
> or separated words with spaces.  I must say if feels strange to me to
> see sentence-terminating full stops in Thai script.

Indeed. Strange things do happen.

> > Let's not get carried away. If you want Thai based breaking you just
> > use lang="th" or do nothing since the default analysis will say: oh
> > Thai script, unknown language, assume Thai. Which is a good and
> > helpful thing to do.  
> 
> Are you sure about that?  The default analysis feels more like, "Oh Thai
> script, ignore the language, just assume Thai for line-breaking."  That
> was the behaviour when I looked at ICU a few years ago.

Agreed and it still is. And this is an initial simple attempt to get around that.

> The problem with this approach lies in adding very lightweight
> locales - line-breaking, word-breaking, perhaps collation, and possibly
> a very few bits of data, but nothing more.  There may be a build issue
> for ICU - ICU uses wetware to convert algorithmic CLDR line- and
> word-breaking to its own data format.

I think that's called: good engineering. I.e. it could be a problem, but from my analysis I don't think it will be hard to do at all.

> For an example in an application, while in LibreOffice I can switch
> Thai spell-checking off by setting the language to Malayalam (which I
> do as the easy way of preventing the spelling in the Tai Tham script
> being checked as though the language were Siamese - I haven't installed
> a Malayalam spell-checker), LibreOffice still breaks Thai script text as
> though it were Siamese. The problem here is that usually Siamese is the
> best language to assume for line-breaking Thai script text in the
> middle of English text, though with its script class to language
> maps, LibreOffice for one could do better - if it has alternative
> dictionaries.

Correct. And it's ICU that is doing the line breaking.

Yours,
Martin



More information about the CLDR-Users mailing list