Word break question

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Sat Apr 29 20:12:36 CDT 2017


On Sat, 29 Apr 2017 17:42:03 -0700
Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:

> Hey CLDR users,
> 
> I have a question regarding the word break rules from CLDR v31.
> Consider the following word break test:
> 
> ÷ 0001 × 0308 ÷ 0041 ÷
> 
> I believe rule #5 should apply between 0308 and 0041, which looks
> like this:
> 
> $AHLetter × $AHLetter
> 
> 0308 has a word break property of "Extend" which $AHLetter matches,
> and 0041 has a word break property of ALetter which $AHLetter also
> matches. The thing is, rule #5 indicates no break should occur
> between these characters. Furthermore, there are only two rules in
> which a break is indicated (3.1 and 3.2), both of which don't apply
> in this case. What am I missing?

You're missing the shape of the brackets in "<variable
id="$ALetter">($ALetter $FEZ*)</variable>".  The brackets are round,
not square, so <U+0308> does not match $ALetter as it is not a string
starting with something for which Word_Break=ALetter.  Obviously
<U+0001, U+0308> does not match either.

Secondly, if you read
http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you
will see that the final rule of "Any ÷ Any" is implicit.

Richard.



More information about the CLDR-Users mailing list