propose th-u-lb-nodict

Martin Hosken via CLDR-Users cldr-users at unicode.org
Thu May 25 05:13:41 CDT 2017


Dear All,

When line breaking minority text in, say, the Thai script or any script that uses dictionary based breaking, the dictionary used is for the dominant language. A while back, we addressed this for the Khmer script and I've had no complaints since. Now, we could try to do something similar for other dictionary broken languages. But I would like to suggest a simpler approach that can address fixed texts very well, and that is to add a nodict line break locale property. This property would switch the line break iterator to one that uses a set of rules with no dictionary statement in it. In other words, SA type characters are treated as one great long string and it is up to the source text to have inserted appropriate ZWSP, or other kinds of spaces, to control the breaks.

What do folks think? From my perspective, this would solve a bunch of bugs that are pointed my way with regard to line breaking and minority languages, even if it is not the best possible solution. It's pretty cheap to do and it doesn't change anything that is already out there.

Yours,
Martin


More information about the CLDR-Users mailing list