Word break question

Philippe Verdy via CLDR-Users cldr-users at unicode.org
Sun Apr 30 07:49:37 CDT 2017


OK the problem is with the extra parentheses that are transcluded as is
during expansion of $variables, even if these $variables are then used
within a character class.
What this means is that <variable id="$ALetter">($ALetter $FEZ*)</variable>
cannot be used in a character class, and the inclusion of $ALetter in a
character class is invalid, when its expansion is not a single character...
Even if you drop the (unnecesssary) parentheses in <variable
id="$ALetter">$ALetter
$FEZ*</variable> it will not be correct.

In fact this variable definition is silly because it is self-referencing
itself, so it would expand to
<variable id="$ALetter">(($ALetter $FEZ*) $FEZ*)</variable>, then
<variable id="$ALetter">((($ALetter $FEZ*) $FEZ*) $FEZ*)</variable>, and so
on infinitely.
Removing the parentheses would still expand it to:
<variable id="$ALetter">$ALetter $FEZ* $FEZ*)</variable>, then
<variable id="$ALetter">$ALetter $FEZ* $FEZ* $FEZ*</variable>, and so on
infinitely.
I think that the defined variable should be renamed...

This is clearly a bug, IMHO

2017-04-30 7:41 GMT+02:00 Cameron Dutro via CLDR-Users <
cldr-users at unicode.org>:

> Hey Richard,
>
> Thank you for your response, it's been quite helpful.
>
> I'm aware of the difference between the various types of brackets. I think
> my implementation is treating the round brackets as literal characters
> because rules are compiled into a regular expression. Variable replacement
> gives this as (nearly) the final expanded form of the rule in question:
>
> [(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ×
> [(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}]
>
> As you can see, the parentheses exist *within* the character class, and
> are therefore treated as literal characters.
>
> I understand that the transformation rules are unicode sets as opposed to
> true regular expressions. Is there documentation available that explains
> the differences between the two, or perhaps the syntax and expected
> behavior of a unicode set?
>
> Thank you,
>
> -Cameron
>
> On Sat, Apr 29, 2017 at 6:12 PM, Richard Wordingham via CLDR-Users <
> cldr-users at unicode.org> wrote:
>
>> On Sat, 29 Apr 2017 17:42:03 -0700
>> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
>>
>> > Hey CLDR users,
>> >
>> > I have a question regarding the word break rules from CLDR v31.
>> > Consider the following word break test:
>> >
>> > ÷ 0001 × 0308 ÷ 0041 ÷
>> >
>> > I believe rule #5 should apply between 0308 and 0041, which looks
>> > like this:
>> >
>> > $AHLetter × $AHLetter
>> >
>> > 0308 has a word break property of "Extend" which $AHLetter matches,
>> > and 0041 has a word break property of ALetter which $AHLetter also
>> > matches. The thing is, rule #5 indicates no break should occur
>> > between these characters. Furthermore, there are only two rules in
>> > which a break is indicated (3.1 and 3.2), both of which don't apply
>> > in this case. What am I missing?
>>
>> You're missing the shape of the brackets in "<variable
>> id="$ALetter">($ALetter $FEZ*)</variable>".  The brackets are round,
>> not square, so <U+0308> does not match $ALetter as it is not a string
>> starting with something for which Word_Break=ALetter.  Obviously
>> <U+0001, U+0308> does not match either.
>>
>> Secondly, if you read
>> http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you
>> will see that the final rule of "Any ÷ Any" is implicit.
>>
>> Richard.
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/c8436bac/attachment.html>


More information about the CLDR-Users mailing list