Additional Word Break Questions

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Tue Aug 8 20:06:37 CDT 2017


Dear CLDR users,

As you may recall I emailed this list a few months ago with a question
about the word break rules, and today I've run into several more of what I
think are disagreements between the word break rules and the published word
break test cases.

*First Issue*

This is the word break test case in question: ÷ 200D ÷ 261D ÷

It would appear that rule 3.3 matches at index 1, i.e. the index between
the two characters. Rule 3.3 is: $ZWJ × ($Extended_Pict | $EmojiNRK)

Character 200D has word break property values of Extend and ZWJ, while
character 261D has a word break property value of E_Base. Therefore, the
left-hand side of rule 3.3 matches 200D and the right-hand side matches
261D. Since the rule indicates no break, I'm confused by the presence test
case. What am I doing wrong here?

*Second Issue*

The other test cases my implementation is failing to pass are these:

÷ 0061 ÷ 1F1E6 × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
÷ 0061 ÷ 1F1E6 × 1F1E7 × 200D ÷ 1F1E8 ÷ 0062 ÷
÷ 0061 ÷ 1F1E6 × 200D × 1F1E7 ÷ 1F1E8 ÷ 0062 ÷
÷ 0061 ÷ 1F1E6 × 1F1E7 ÷ 1F1E8 × 1F1E9 ÷ 0062 ÷

In all cases, the issue lies with the expected non-break between the second
and third characters, eg. 1F1E6 and 1F1E7. The word break property value of
both these characters is Regional_Indicator. The only rule that looks like
it might match is 15: ^$Regional_Indicator × $Regional_Indicator. However,
rule 15 does not match.

Thanks for your help in advance!

-Cameron
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170808/89646d31/attachment.html>


More information about the CLDR-Users mailing list