propose th-u-lb-nodict

Peter Edberg via CLDR-Users cldr-users at unicode.org
Thu May 25 16:39:58 CDT 2017


(resending from correct account, so it goes to the list)

> On May 25, 2017, at 10:14 AM, Peter Edberg wrote:
> 
> Martin,
> 
> CLDR already defines following -u-lb- and -u-lw- extensions for controlling linebreak behavior ( see http://www.unicode.org/reports/tr35/#Key_Type_Definitions <http://www.unicode.org/reports/tr35/#Key_Type_Definitions>):
> 
> A Unicode Line Break Style Identifier <http://www.unicode.org/reports/tr35/#UnicodeLineBreakStyleIdentifier> defines a preferred line break style corresponding to the CSS level 3 line-break option <https://drafts.csswg.org/css-text/#line-break-property>. Specifying "lb" in a locale identifier overrides the locale‘s default style (which may correspond to "normal" or "strict"). The valid values are those name attribute values in the type elements of key name="lb" in bcp47/segmentation.xml <http://www.unicode.org/repos/cldr/tags/latest/common/bcp47/segmentation.xml>.
> "lb"	Line break style	"strict"	CSS level 3 line-break=strict, e.g. treat CJ as NS
> "normal"	CSS level 3 line-break=normal, e.g. treat CJ as ID, break before hyphens for ja,zh
> "loose"	CSS lev 3 line-break=loose
> A Unicode Line Break Word Identifier <http://www.unicode.org/reports/tr35/#UnicodeLineBreakWordIdentifier> defines preferred line break word handling behavior corresponding to the CSS level 3 word-break option <https://drafts.csswg.org/css-text/#word-break-property>. The valid values are those name attribute values in the type elements of key name="lw" in bcp47/segmentation.xml <http://www.unicode.org/repos/cldr/tags/latest/common/bcp47/segmentation.xml>.
> "lw"	Line break word handling	"normal"	CSS level 3 word-break=normal, normal script/language behavior for midword breaks
> "breakall"	CSS level 3 word-break=break-all, allow midword breaks unless forbidden by lb setting
> "keepall"	CSS level 3 word-break=keep-all, prohibit midword breaks except for dictionary breaks
> 
> We cannot add -lb-nodict- because regardless of dictionary usage we still need to be able to select among CSS strict/normal/loose behavior.
> 
> I see two options:
> 
> 1. One option is to add something that goes beyond the -lw-keepall- option to prohibit midword breaks *including* dictionary breaks.
> 
> 2. The other, which I prefer, is to add a new, independent option for controlling dictionary breaks. This could be -u-ld- with options like the following (the options have to be 5-8 alphanum):
> 
> -u-ld-nodict (no dictionary at all)
> # and then perhaps options for specific dictionaries. Right now use of dictionaries is a function of script range, so the options might need to allow specification of scriopt range and then dictionary, e.g.
> 
> -u-ld-thai0-pali0 (using 0 to pad the subtags to 5 alphanum)
> -u-ld-thai0-sanskrit
> 
> Perhaps the -nodict should also be by script, e.g.
> -u-ld-thai0-nodict
> still allows dictionary use for CJK, just none for Thai script.
> 
> - Peter E
> 
> 
>> On May 25, 2017, at 8:38 AM, Markus Scherer via CLDR-Users <cldr-users at unicode.org <mailto:cldr-users at unicode.org>> wrote:
>> 
>> So this would be not just "no dictionary", it would be "no breaks at all in any script that uses no spaces between words".
>> It would be nice to come up with a 5-8 letter abbreviation for what it does, rather than what it doesn't do.
>> 
>> Also, is it more useful to have no breaks in, say, Thai strings, at all (gross under-segmentation) -- or to have breaks between orthographic syllables (over-segmentation)?
>> (That would be a yet different subtag.)
> 
> 
>> On May 25, 2017, at 3:13 AM, Martin Hosken via CLDR-Users <cldr-users at unicode.org <mailto:cldr-users at unicode.org>> wrote:
>> 
>> Dear All,
>> 
>> When line breaking minority text in, say, the Thai script or any script that uses dictionary based breaking, the dictionary used is for the dominant language. A while back, we addressed this for the Khmer script and I've had no complaints since. Now, we could try to do something similar for other dictionary broken languages. But I would like to suggest a simpler approach that can address fixed texts very well, and that is to add a nodict line break locale property. This property would switch the line break iterator to one that uses a set of rules with no dictionary statement in it. In other words, SA type characters are treated as one great long string and it is up to the source text to have inserted appropriate ZWSP, or other kinds of spaces, to control the breaks.
>> 
>> What do folks think? From my perspective, this would solve a bunch of bugs that are pointed my way with regard to line breaking and minority languages, even if it is not the best possible solution. It's pretty cheap to do and it doesn't change anything that is already out there.
>> 
>> Yours,
>> Martin
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org <mailto:CLDR-Users at unicode.org>
>> http://unicode.org/mailman/listinfo/cldr-users
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170525/2b1aa1c7/attachment.html>


More information about the CLDR-Users mailing list