Word break question

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sun Apr 30 00:41:33 CDT 2017


Hey Richard,

Thank you for your response, it's been quite helpful.

I'm aware of the difference between the various types of brackets. I think
my implementation is treating the round brackets as literal characters
because rules are compiled into a regular expression. Variable replacement
gives this as (nearly) the final expanded form of the rule in question:

[(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ×
[(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}]

As you can see, the parentheses exist *within* the character class, and are
therefore treated as literal characters.

I understand that the transformation rules are unicode sets as opposed to
true regular expressions. Is there documentation available that explains
the differences between the two, or perhaps the syntax and expected
behavior of a unicode set?

Thank you,

-Cameron

On Sat, Apr 29, 2017 at 6:12 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sat, 29 Apr 2017 17:42:03 -0700
> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > Hey CLDR users,
> >
> > I have a question regarding the word break rules from CLDR v31.
> > Consider the following word break test:
> >
> > ÷ 0001 × 0308 ÷ 0041 ÷
> >
> > I believe rule #5 should apply between 0308 and 0041, which looks
> > like this:
> >
> > $AHLetter × $AHLetter
> >
> > 0308 has a word break property of "Extend" which $AHLetter matches,
> > and 0041 has a word break property of ALetter which $AHLetter also
> > matches. The thing is, rule #5 indicates no break should occur
> > between these characters. Furthermore, there are only two rules in
> > which a break is indicated (3.1 and 3.2), both of which don't apply
> > in this case. What am I missing?
>
> You're missing the shape of the brackets in "<variable
> id="$ALetter">($ALetter $FEZ*)</variable>".  The brackets are round,
> not square, so <U+0308> does not match $ALetter as it is not a string
> starting with something for which Word_Break=ALetter.  Obviously
> <U+0001, U+0308> does not match either.
>
> Secondly, if you read
> http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you
> will see that the final rule of "Any ÷ Any" is implicit.
>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170429/96aed22d/attachment.html>


More information about the CLDR-Users mailing list