From cldr-users at unicode.org Mon May 1 01:58:11 2017 From: cldr-users at unicode.org (Cameron Dutro via CLDR-Users) Date: Sun, 30 Apr 2017 23:58:11 -0700 Subject: Word break question In-Reply-To: References: <20170430021236.62f345a3@JRWUBU2> <20170430141106.2fb797ed@JRWUBU2> <20170430220149.0d183722@JRWUBU2> <20170501000732.0605e779@JRWUBU2> Message-ID: Awesome, thanks Mark! And another thanks to Richard for being so willing to help on a Sunday :) -Cameron On Sun, Apr 30, 2017 at 4:36 PM, Mark Davis ?? via CLDR-Users < cldr-users at unicode.org> wrote: > Richard, Cameron, Philippe, thanks for tracking this down... I filed a > ticket at http://unicode.org/cldr/trac/ticket/10226. If you have any > comments on the proposed solution, please add them there so we don't lose > them. > > Mark > > On Sun, Apr 30, 2017 at 4:07 PM, Richard Wordingham via CLDR-Users < > cldr-users at unicode.org> wrote: > >> On Sun, 30 Apr 2017 15:09:17 -0700 >> Cameron Dutro via CLDR-Users wrote: >> >> > Hey Richard, >> > >> > Unfortunately the Hebrew letters cannot be ignored since the $AHLetter >> > variable introduces a character class, which is the source of my >> > confusion. Here are all the variables in question: >> > >> > $AHLetter = [$ALetter(2) $Hebrew_Letter(2)] >> > $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter} >> > $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*) >> > $ALetter(1) = \p{Word_Break=ALetter} >> > $ALetter(2) = ($ALetter(1) $FEZ*) >> > $FEZ = [$Format $Extend $ZWJ] >> > $Format = \p{Word_Break=Format} >> > $Extend = \p{Word_Break=Extend} >> > $ZWJ = \p{Word_Break=ZWJ} >> > >> >> >> >> > How should my implementation handle these cases? >> >> It would have been friendlier if instead of doing macro-like >> expansions, it had compounded finite state machines. Then, it would >> have reported an error at "$AHLetter = [$ALetter(2) >> $Hebrew_Letter(2)]". Basically, the CLDR definition is wrong! What >> CLDR should have is >> >> $AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)] >> $AHLetter(2) = ($AHLetter(1) $FEZ*) >> >> Alternatively, it could have >> $AHLetter = ($ALetter | $Hebrew_Letter) >> >> At this point, you may realise that ICU does not derive the break >> iterators from the CLDR definitions. Instead, they are derived >> manually from the specifications. What can then happen is that >> someone works from what the specification should say, rather than from >> what it does say. >> >> Richard. >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users >> > > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu May 25 05:13:41 2017 From: cldr-users at unicode.org (Martin Hosken via CLDR-Users) Date: Thu, 25 May 2017 11:13:41 +0100 Subject: propose th-u-lb-nodict Message-ID: <20170525111341.4a91c8d6@sil-mh7> Dear All, When line breaking minority text in, say, the Thai script or any script that uses dictionary based breaking, the dictionary used is for the dominant language. A while back, we addressed this for the Khmer script and I've had no complaints since. Now, we could try to do something similar for other dictionary broken languages. But I would like to suggest a simpler approach that can address fixed texts very well, and that is to add a nodict line break locale property. This property would switch the line break iterator to one that uses a set of rules with no dictionary statement in it. In other words, SA type characters are treated as one great long string and it is up to the source text to have inserted appropriate ZWSP, or other kinds of spaces, to control the breaks. What do folks think? From my perspective, this would solve a bunch of bugs that are pointed my way with regard to line breaking and minority languages, even if it is not the best possible solution. It's pretty cheap to do and it doesn't change anything that is already out there. Yours, Martin From cldr-users at unicode.org Thu May 25 08:09:53 2017 From: cldr-users at unicode.org (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?= via CLDR-Users) Date: Thu, 25 May 2017 15:09:53 +0200 Subject: propose th-u-lb-nodict In-Reply-To: <20170525111341.4a91c8d6@sil-mh7> References: <20170525111341.4a91c8d6@sil-mh7> Message-ID: Interesting idea. You should file a ticket with your proposal. {phone} On May 25, 2017 12:14, "Martin Hosken via CLDR-Users" < cldr-users at unicode.org> wrote: > Dear All, > > When line breaking minority text in, say, the Thai script or any script > that uses dictionary based breaking, the dictionary used is for the > dominant language. A while back, we addressed this for the Khmer script and > I've had no complaints since. Now, we could try to do something similar for > other dictionary broken languages. But I would like to suggest a simpler > approach that can address fixed texts very well, and that is to add a > nodict line break locale property. This property would switch the line > break iterator to one that uses a set of rules with no dictionary statement > in it. In other words, SA type characters are treated as one great long > string and it is up to the source text to have inserted appropriate ZWSP, > or other kinds of spaces, to control the breaks. > > What do folks think? From my perspective, this would solve a bunch of bugs > that are pointed my way with regard to line breaking and minority > languages, even if it is not the best possible solution. It's pretty cheap > to do and it doesn't change anything that is already out there. > > Yours, > Martin > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu May 25 10:38:19 2017 From: cldr-users at unicode.org (Markus Scherer via CLDR-Users) Date: Thu, 25 May 2017 08:38:19 -0700 Subject: propose th-u-lb-nodict In-Reply-To: <20170525111341.4a91c8d6@sil-mh7> References: <20170525111341.4a91c8d6@sil-mh7> Message-ID: So this would be not just "no dictionary", it would be "no breaks at all in any script that uses no spaces between words". It would be nice to come up with a 5-8 letter abbreviation for what it does, rather than what it doesn't do. Also, is it more useful to have no breaks in, say, Thai strings, at all (gross under-segmentation) -- or to have breaks between orthographic syllables (over-segmentation)? (That would be a yet different subtag.) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu May 25 15:00:56 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Thu, 25 May 2017 21:00:56 +0100 Subject: propose th-u-lb-nodict In-Reply-To: References: <20170525111341.4a91c8d6@sil-mh7> Message-ID: <20170525210056.7918cf2e@JRWUBU2> On Thu, 25 May 2017 08:38:19 -0700 Markus Scherer via CLDR-Users wrote: > So this would be not just "no dictionary", it would be "no breaks at > all in any script that uses no spaces between words". ?No, the behaviour would be to treat SA as AL. While this can cause major problems for newspaper columns, the effect for wider text such as memoranda would rather be numerous extents of white space. I presume books would get at least some type-setting treatment, i.e. line-break opportunities would be inserted manually. Treating all SA as AL is not entirely appropriate. For example, treating U+0E46 THAI CHARACTER MAI YAMOK as 'Exclamation' (EX) would be better; should not be split from the alphabetical characters preceding it. > It would be nice to come up with a 5-8 letter abbreviation for what it > does, rather than what it doesn't do. > > Also, is it more useful to have no breaks in, say, Thai strings, at > all (gross under-segmentation) -- or to have breaks between > orthographic syllables (over-segmentation)? > (That would be a yet different subtag.) Do not believe Indian claims about the primacy of orthographic syllables. The natural division within-word line-breaking in the Thai and Lao scripts is the phonetic syllable. Indeed, Lao line-breaking tends to happen at syllable boundaries. There are, of course, several levels of line-breaking. Artificial breaks are more at the level of hyphenation. If you want a suggestion for simple emergency breaks in Thai and Lao, the best place is before preposed vowels. The next obvious place is after the visargas, though the Thai language (which, of course, is not the subject of the suggestion) does have some exceptions such as silenced consonants following U+0E30. The locale example given is, of course, almost oxymoronic. In general, of course, a *Thai language* dictionary should not be used for another language. Unfortunately, I am trying to think of a good example of a scriptio continua language for which a Thai dictionary is clearly completely useless. (Pali, Pattani Malay and Northern Khmer in the Thai script are *not* scriptio continua.) For a Tai language like Northern Thai, a Thai dictionary is not completely useless. However, this raised the next point. For Northern Thai, nod_TH (or more precisely, nod-Thai_TH), one would normally want to use it with a Northern Thai dictionary. The intention behind nod-u-lb-nodict should be not to use a Thai dictionary for line-breaking, not not to use a Northern Thai dictionary. A better modifier would be "-u-lb-noth", meaning "Do not fall back to a Thai dictionary". Contrariwise, "-u-lb-th" could authorise fallback to a Thai dictionary. Perhaps "u-lb-la" should authorise dictionary-based line-breaking of scriptio continua Latin. With these ideas, "pi-u-lb-noth" should let me type Pali without worrying about spurious line-breaks in the middle of words. (Of course, I still have to watch out for spurious line-breaks in Thai.) Richard. From cldr-users at unicode.org Thu May 25 16:39:58 2017 From: cldr-users at unicode.org (Peter Edberg via CLDR-Users) Date: Thu, 25 May 2017 14:39:58 -0700 Subject: propose th-u-lb-nodict In-Reply-To: <593544C8-49DA-4112-8D02-7D681975E588@mac.com> References: <20170525111341.4a91c8d6@sil-mh7> <593544C8-49DA-4112-8D02-7D681975E588@mac.com> Message-ID: <0617441D-BBD9-4396-A074-654197119B82@apple.com> (resending from correct account, so it goes to the list) > On May 25, 2017, at 10:14 AM, Peter Edberg wrote: > > Martin, > > CLDR already defines following -u-lb- and -u-lw- extensions for controlling linebreak behavior ( see http://www.unicode.org/reports/tr35/#Key_Type_Definitions ): > > A Unicode Line Break Style Identifier defines a preferred line break style corresponding to the CSS level 3 line-break option . Specifying "lb" in a locale identifier overrides the locale?s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml . > "lb" Line break style "strict" CSS level 3 line-break=strict, e.g. treat CJ as NS > "normal" CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh > "loose" CSS lev 3 line-break=loose > A Unicode Line Break Word Identifier defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option . The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml . > "lw" Line break word handling "normal" CSS level 3 word-break=normal, normal script/language behavior for midword breaks > "breakall" CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting > "keepall" CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks > > We cannot add -lb-nodict- because regardless of dictionary usage we still need to be able to select among CSS strict/normal/loose behavior. > > I see two options: > > 1. One option is to add something that goes beyond the -lw-keepall- option to prohibit midword breaks *including* dictionary breaks. > > 2. The other, which I prefer, is to add a new, independent option for controlling dictionary breaks. This could be -u-ld- with options like the following (the options have to be 5-8 alphanum): > > -u-ld-nodict (no dictionary at all) > # and then perhaps options for specific dictionaries. Right now use of dictionaries is a function of script range, so the options might need to allow specification of scriopt range and then dictionary, e.g. > > -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum) > -u-ld-thai0-sanskrit > > Perhaps the -nodict should also be by script, e.g. > -u-ld-thai0-nodict > still allows dictionary use for CJK, just none for Thai script. > > - Peter E > > >> On May 25, 2017, at 8:38 AM, Markus Scherer via CLDR-Users > wrote: >> >> So this would be not just "no dictionary", it would be "no breaks at all in any script that uses no spaces between words". >> It would be nice to come up with a 5-8 letter abbreviation for what it does, rather than what it doesn't do. >> >> Also, is it more useful to have no breaks in, say, Thai strings, at all (gross under-segmentation) -- or to have breaks between orthographic syllables (over-segmentation)? >> (That would be a yet different subtag.) > > >> On May 25, 2017, at 3:13 AM, Martin Hosken via CLDR-Users > wrote: >> >> Dear All, >> >> When line breaking minority text in, say, the Thai script or any script that uses dictionary based breaking, the dictionary used is for the dominant language. A while back, we addressed this for the Khmer script and I've had no complaints since. Now, we could try to do something similar for other dictionary broken languages. But I would like to suggest a simpler approach that can address fixed texts very well, and that is to add a nodict line break locale property. This property would switch the line break iterator to one that uses a set of rules with no dictionary statement in it. In other words, SA type characters are treated as one great long string and it is up to the source text to have inserted appropriate ZWSP, or other kinds of spaces, to control the breaks. >> >> What do folks think? From my perspective, this would solve a bunch of bugs that are pointed my way with regard to line breaking and minority languages, even if it is not the best possible solution. It's pretty cheap to do and it doesn't change anything that is already out there. >> >> Yours, >> Martin >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at unicode.org >> http://unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From cldr-users at unicode.org Thu May 25 18:30:36 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Fri, 26 May 2017 00:30:36 +0100 Subject: propose th-u-lb-nodict In-Reply-To: <0617441D-BBD9-4396-A074-654197119B82@apple.com> References: <20170525111341.4a91c8d6@sil-mh7> <593544C8-49DA-4112-8D02-7D681975E588@mac.com> <0617441D-BBD9-4396-A074-654197119B82@apple.com> Message-ID: <20170526003036.428edce1@JRWUBU2> On Thu, 25 May 2017 14:39:58 -0700 Peter Edberg via CLDR-Users wrote: > > -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum) > > -u-ld-thai0-sanskrit I'm not sure why there should be line-breaking 'dictionary' for Pali in Thai script, > > Perhaps the -nodict should also be by script, e.g. > > -u-ld-thai0-nodict > > still allows dictionary use for CJK, just none for Thai script. Most dictionaries should be identified by language, not script. The problem being addressed is the use of a Siamese dictionary for breaking text in other languages. There is something practical that we haven't touched on. Should we be defining the language to be assumed for embedded foreign text? Richard. From cldr-users at unicode.org Thu May 25 18:55:29 2017 From: cldr-users at unicode.org (Peter Edberg via CLDR-Users) Date: Thu, 25 May 2017 16:55:29 -0700 Subject: propose th-u-lb-nodict In-Reply-To: <20170526003036.428edce1@JRWUBU2> References: <20170525111341.4a91c8d6@sil-mh7> <593544C8-49DA-4112-8D02-7D681975E588@mac.com> <0617441D-BBD9-4396-A074-654197119B82@apple.com> <20170526003036.428edce1@JRWUBU2> Message-ID: <2C18B250-8F3C-4735-A4A1-C08A59E8A89A@apple.com> > On May 25, 2017, at 4:30 PM, Richard Wordingham via CLDR-Users wrote: > > On Thu, 25 May 2017 14:39:58 -0700 > Peter Edberg via CLDR-Users wrote: > >>> -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum) >>> -u-ld-thai0-sanskrit > > I'm not sure why there should be line-breaking 'dictionary' for Pali in > Thai script, > >>> Perhaps the -nodict should also be by script, e.g. >>> -u-ld-thai0-nodict >>> still allows dictionary use for CJK, just none for Thai script. > > Most dictionaries should be identified by language, not script. The > problem being addressed is the use of a Siamese dictionary for breaking > text in other languages. The issue is that libraries that implement this spec, such as ICU , would typically choose a dictionary to use based on script range. So one needs to be able to specify, e.g. - For Thai script, use xxx dictionary. - For Khmer script, use yyy dictionary. The xxx and yyy would specify language, but you still need to associate them with a script. - Peter E > > There is something practical that we haven't touched on. Should we be > defining the language to be assumed for embedded foreign text? > > Richard. > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at unicode.org > http://unicode.org/mailman/listinfo/cldr-users From cldr-users at unicode.org Fri May 26 02:52:44 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Fri, 26 May 2017 08:52:44 +0100 Subject: propose th-u-lb-nodict In-Reply-To: <2C18B250-8F3C-4735-A4A1-C08A59E8A89A@apple.com> References: <20170525111341.4a91c8d6@sil-mh7> <593544C8-49DA-4112-8D02-7D681975E588@mac.com> <0617441D-BBD9-4396-A074-654197119B82@apple.com> <20170526003036.428edce1@JRWUBU2> <2C18B250-8F3C-4735-A4A1-C08A59E8A89A@apple.com> Message-ID: <20170526085244.39faa1e2@JRWUBU2> On Thu, 25 May 2017 16:55:29 -0700 Peter Edberg via CLDR-Users wrote: > > On May 25, 2017, at 4:30 PM, Richard Wordingham via CLDR-Users > > wrote: > > > > On Thu, 25 May 2017 14:39:58 -0700 > > Peter Edberg via CLDR-Users wrote: > > > >>> -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum) > >>> -u-ld-thai0-sanskrit > > > > I'm not sure why there should be line-breaking 'dictionary' for > > Pali in Thai script, > > > >>> Perhaps the -nodict should also be by script, e.g. > >>> -u-ld-thai0-nodict > >>> still allows dictionary use for CJK, just none for Thai script. > > > > Most dictionaries should be identified by language, not script. The > > problem being addressed is the use of a Siamese dictionary for > > breaking text in other languages. > > The issue is that libraries that implement this spec, such as ICU , > would typically choose a dictionary to use based on script range. That's a fault. They should first consider the language. Now, there is a related issue of whether a locale should be able to specify the language of stretches in an unexpected script. Word processors often do a tripartition of scripts into simple, complex, and CJK, though the corresponding standards fail to define the three categories, and use that to select the font and sometimes (usually?) the language. This works well for most multi-script paragraphs once the tripartition has stabilised. > So > one needs to be able to specify, e.g. > - For Thai script, use xxx dictionary. > - For Khmer script, use yyy dictionary. > > The xxx and yyy would specify language, but you still need to > associate them with a script. I believe that for Northeastern Thai one needs a preference list - prefer a NE Thai dictionary, allow fall back to a Siamese dictionary. Now, in the Lao script (e.g. for Tai Noi) that gets more complicated if one wants to cater for modern language rather than just transcribing old manuscripts. Systematic omission of tone marks could confuse a line-breaker that looks for word boundaries between correctly spelt words. For Northern Thai, one may find it better to prefer a Northern Thai dictionary but refuse to use a Siamese dictionary. Richard. From cldr-users at unicode.org Fri May 26 03:55:48 2017 From: cldr-users at unicode.org (Martin Hosken via CLDR-Users) Date: Fri, 26 May 2017 09:55:48 +0100 Subject: propose th-u-lb-nodict In-Reply-To: <20170525210056.7918cf2e@JRWUBU2> References: <20170525111341.4a91c8d6@sil-mh7> <20170525210056.7918cf2e@JRWUBU2> Message-ID: <20170526095548.5efe2c77@sil-mh7> Dear Richard, > > So this would be not just "no dictionary", it would be "no breaks at > > all in any script that uses no spaces between words". > > ?No, the behaviour would be to treat SA as AL. While this can cause > major problems for newspaper columns, the effect for wider text such as > memoranda would rather be numerous extents of white space. I presume > books would get at least some type-setting treatment, i.e. line-break > opportunities would be inserted manually. Correct. The assumption is that the text has been appropriately broken using ZWSP. One wouldn't select this for text that was not in that state. Notice this isn't some fallback behaviour. This is specifically chosen for a run of text by the document creator, in full knowledge of its potential impact. > Treating all SA as AL is not entirely appropriate. For example, > treating U+0E46 THAI CHARACTER MAI YAMOK as 'Exclamation' (EX) would be > better; should not be split from the alphabetical > characters preceding it. The good news is that since it is just another set of break iterator rules, we can do things like that, so long as we make it clear what we are doing and why. > > It would be nice to come up with a 5-8 letter abbreviation for what it > > does, rather than what it doesn't do. > > > > Also, is it more useful to have no breaks in, say, Thai strings, at > > all (gross under-segmentation) -- or to have breaks between > > orthographic syllables (over-segmentation)? > > (That would be a yet different subtag.) What makes Thai hard is that you can't analyse a text into orthographic syllables without knowledge of the language. > Do not believe Indian claims about the primacy of orthographic > syllables. The natural division within-word line-breaking in the Thai > and Lao scripts is the phonetic syllable. Indeed, Lao line-breaking > tends to happen at syllable boundaries. > > There are, of course, several levels of line-breaking. Artificial > breaks are more at the level of hyphenation. If you want a suggestion > for simple emergency breaks in Thai and Lao, the best place is before > preposed vowels. The next obvious place is after the visargas, though > the Thai language (which, of course, is not the subject of the > suggestion) does have some exceptions such as silenced consonants > following U+0E30. We would need to be careful about adding emergency breaks. For example, in polysyllabic words, we wouldn't want to break even between two syllables. So my proposal really would be: only break at places that other languages would break, with no recourse to a dictionary. > The locale example given is, of course, almost oxymoronic. In general, > of course, a *Thai language* dictionary should not be used for another > language. Unfortunately, I am trying to think of a good example > of a scriptio continua language for which a Thai dictionary is clearly > completely useless. (Pali, Pattani Malay and Northern Khmer in the > Thai script are *not* scriptio continua.) For a Tai language like > Northern Thai, a Thai dictionary is not completely useless. However, > this raised the next point. One example I have is So (Bruic-Katuic-Mon Khmer), but there are plenty of other languages that aren't Tai but that use Thai script. And again, nobody *has* to use this thing. You only turn it on if you want to say: yes I have broken this thing into words myself, please don't break it up any more through the use of a dictionary. That's all it's saying. It's not trying to be clever. It's not trying to make anyone's life easier. It's saying: *stop* trying to be clever and think you know better than the document author. Just break where I say you can break and be done with it. > A better modifier would be "-u-lb-noth", meaning "Do not fall back to > a Thai dictionary". Contrariwise, "-u-lb-th" could authorise fallback > to a Thai dictionary. Perhaps "u-lb-la" should authorise > dictionary-based line-breaking of scriptio continua Latin. With these > ideas, "pi-u-lb-noth" should let me type Pali without worrying about > spurious line-breaks in the middle of words. (Of course, I still have > to watch out for spurious line-breaks in Thai.) Let's not get carried away. If you want Thai based breaking you just use lang="th" or do nothing since the default analysis will say: oh Thai script, unknown language, assume Thai. Which is a good and helpful thing to do. Yours, Martin From cldr-users at unicode.org Fri May 26 14:13:20 2017 From: cldr-users at unicode.org (Richard Wordingham via CLDR-Users) Date: Fri, 26 May 2017 20:13:20 +0100 Subject: propose th-u-lb-nodict In-Reply-To: <20170526095548.5efe2c77@sil-mh7> References: <20170525111341.4a91c8d6@sil-mh7> <20170525210056.7918cf2e@JRWUBU2> <20170526095548.5efe2c77@sil-mh7> Message-ID: <20170526201320.401d5039@JRWUBU2> On Fri, 26 May 2017 09:55:48 +0100 Martin Hosken via CLDR-Users wrote: > > > Also, is it more useful to have no breaks in, say, Thai strings, > > > at all (gross under-segmentation) -- or to have breaks between > > > orthographic syllables (over-segmentation)? > > > (That would be a yet different subtag.) > What makes Thai hard is that you can't analyse a text into > orthographic syllables without knowledge of the language. That is solved by assuming one orthographic syllable if it is not obvious there are two or possibly three - I'm not sure how many orthographic syllables there are in /tua/ 'body' ??? . I think there are two. > > There are, of course, several levels of line-breaking. Artificial > > breaks are more at the level of hyphenation. If you want a > > suggestion for simple emergency breaks in Thai and Lao, the best > > place is before preposed vowels. The next obvious place is after > > the visargas, though the Thai language (which, of course, is not > > the subject of the suggestion) does have some exceptions such as > > silenced consonants following U+0E30. > > We would need to be careful about adding emergency breaks. For > example, in polysyllabic words, we wouldn't want to break even > between two syllables. So my proposal really would be: only break at > places that other languages would break, with no recourse to a > dictionary. Emergency breaks belong to the domain of hyphenation, which I believe is beyond the scope of CLDR. If a word won't fit in a row of text, it usually needs breaking - that even happens with English. > > The locale example given is, of course, almost oxymoronic. In > > general, of course, a *Thai language* dictionary should not be used > > for another language. Unfortunately, I am trying to think of a > > good example of a scriptio continua language for which a Thai > > dictionary is clearly completely useless. (Pali, Pattani Malay and > > Northern Khmer in the Thai script are *not* scriptio continua.) > > For a Tai language like Northern Thai, a Thai dictionary is not > > completely useless. However, this raised the next point. > > One example I have is So (Bruic-Katuic-Mon Khmer), but there are > plenty of other languages that aren't Tai but that use Thai script. My problem was that the examples I could find were either Tai languages or separated words with spaces. I must say if feels strange to me to see sentence-terminating full stops in Thai script. > > A better modifier would be "-u-lb-noth", meaning "Do not fall back > > to a Thai dictionary". Contrariwise, "-u-lb-th" could authorise > > fallback to a Thai dictionary. Perhaps "u-lb-la" should authorise > > dictionary-based line-breaking of scriptio continua Latin. With > > these ideas, "pi-u-lb-noth" should let me type Pali without > > worrying about spurious line-breaks in the middle of words. (Of > > course, I still have to watch out for spurious line-breaks in > > Thai.) > > Let's not get carried away. If you want Thai based breaking you just > use lang="th" or do nothing since the default analysis will say: oh > Thai script, unknown language, assume Thai. Which is a good and > helpful thing to do. Are you sure about that? The default analysis feels more like, "Oh Thai script, ignore the language, just assume Thai for line-breaking." That was the behaviour when I looked at ICU a few years ago. The precise logic in ICU was that every language, significantly including English, uses the Thai-language word-boundary detector to do Thai-script line-breaking. I was able to create a Pali line-breaker in ICU that recognised that Pali is not written scriptio continua in the Thai script. (Word boundaries are often lost before words beginning with vowels - not very different to British or US Sanskrit.) The problem with this approach lies in adding very lightweight locales - line-breaking, word-breaking, perhaps collation, and possibly a very few bits of data, but nothing more. There may be a build issue for ICU - ICU uses wetware to convert algorithmic CLDR line- and word-breaking to its own data format. For an example in an application, while in LibreOffice I can switch Thai spell-checking off by setting the language to Malayalam (which I do as the easy way of preventing the spelling in the Tai Tham script being checked as though the language were Siamese - I haven't installed a Malayalam spell-checker), LibreOffice still breaks Thai script text as though it were Siamese. The problem here is that usually Siamese is the best language to assume for line-breaking Thai script text in the middle of English text, though with its script class to language maps, LibreOffice for one could do better - if it has alternative dictionaries. (I need to find out how I got LibreOffice, using correct tagging, to do nod_Lana spell-checking a few years ago. I think I had to update Hunspell to a more recent version of Unicode.) Richard. From cldr-users at unicode.org Sat May 27 02:30:13 2017 From: cldr-users at unicode.org (Martin Hosken via CLDR-Users) Date: Sat, 27 May 2017 08:30:13 +0100 Subject: propose th-u-lb-nodict In-Reply-To: <20170526201320.401d5039@JRWUBU2> References: <20170525111341.4a91c8d6@sil-mh7> <20170525210056.7918cf2e@JRWUBU2> <20170526095548.5efe2c77@sil-mh7> <20170526201320.401d5039@JRWUBU2> Message-ID: <20170527083013.2a9f0bf1@sil-mh7> Dear Richard, > > What makes Thai hard is that you can't analyse a text into > > orthographic syllables without knowledge of the language. > > That is solved by assuming one orthographic syllable if it is not > obvious there are two or possibly three - I'm not sure how many > orthographic syllables there are in /tua/ 'body' ??? HAN-AKAT, WO WAEN>. I think there are two. I.e. it's hard. There are many clear cases, but there are as many ambiguous cases. This differs from say Burmese script where you can algorithmically work out all syllable breaks (I bet you'll find an ambiguous one now, just to prove me wrong!) > > We would need to be careful about adding emergency breaks. For > > example, in polysyllabic words, we wouldn't want to break even > > between two syllables. So my proposal really would be: only break at > > places that other languages would break, with no recourse to a > > dictionary. > > Emergency breaks belong to the domain of hyphenation, which I believe > is beyond the scope of CLDR. If a word won't fit in a row of text, it > usually needs breaking - that even happens with English. s/emergency/mid word/ I.e. the point is that the line breaker shouldn't be doing syllable breaking. At least that is not what I want for lb-nodict. > > One example I have is So (Bruic-Katuic-Mon Khmer), but there are > > plenty of other languages that aren't Tai but that use Thai script. > > My problem was that the examples I could find were either Tai languages > or separated words with spaces. I must say if feels strange to me to > see sentence-terminating full stops in Thai script. Indeed. Strange things do happen. > > Let's not get carried away. If you want Thai based breaking you just > > use lang="th" or do nothing since the default analysis will say: oh > > Thai script, unknown language, assume Thai. Which is a good and > > helpful thing to do. > > Are you sure about that? The default analysis feels more like, "Oh Thai > script, ignore the language, just assume Thai for line-breaking." That > was the behaviour when I looked at ICU a few years ago. Agreed and it still is. And this is an initial simple attempt to get around that. > The problem with this approach lies in adding very lightweight > locales - line-breaking, word-breaking, perhaps collation, and possibly > a very few bits of data, but nothing more. There may be a build issue > for ICU - ICU uses wetware to convert algorithmic CLDR line- and > word-breaking to its own data format. I think that's called: good engineering. I.e. it could be a problem, but from my analysis I don't think it will be hard to do at all. > For an example in an application, while in LibreOffice I can switch > Thai spell-checking off by setting the language to Malayalam (which I > do as the easy way of preventing the spelling in the Tai Tham script > being checked as though the language were Siamese - I haven't installed > a Malayalam spell-checker), LibreOffice still breaks Thai script text as > though it were Siamese. The problem here is that usually Siamese is the > best language to assume for line-breaking Thai script text in the > middle of English text, though with its script class to language > maps, LibreOffice for one could do better - if it has alternative > dictionaries. Correct. And it's ICU that is doing the line breaking. Yours, Martin