Word break question

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Mon May 1 01:58:11 CDT 2017


Awesome, thanks Mark! And another thanks to Richard for being so willing to
help on a Sunday :)

-Cameron

On Sun, Apr 30, 2017 at 4:36 PM, Mark Davis ☕️ via CLDR-Users <
cldr-users at unicode.org> wrote:

> Richard, Cameron, Philippe, thanks for tracking this down... I filed a
> ticket at http://unicode.org/cldr/trac/ticket/10226. If you have any
> comments on the proposed solution, please add them there so we don't lose
> them.
>
> Mark
>
> On Sun, Apr 30, 2017 at 4:07 PM, Richard Wordingham via CLDR-Users <
> cldr-users at unicode.org> wrote:
>
>> On Sun, 30 Apr 2017 15:09:17 -0700
>> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
>>
>> > Hey Richard,
>> >
>> > Unfortunately the Hebrew letters cannot be ignored since the $AHLetter
>> > variable introduces a character class, which is the source of my
>> > confusion. Here are all the variables in question:
>> >
>> > $AHLetter = [$ALetter(2) $Hebrew_Letter(2)]
>> > $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter}
>> > $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*)
>> > $ALetter(1) = \p{Word_Break=ALetter}
>> > $ALetter(2) = ($ALetter(1) $FEZ*)
>> > $FEZ = [$Format $Extend $ZWJ]
>> > $Format = \p{Word_Break=Format}
>> > $Extend = \p{Word_Break=Extend}
>> > $ZWJ = \p{Word_Break=ZWJ}
>> >
>>
>> <snip>
>>
>> > How should my implementation handle these cases?
>>
>> It would have been friendlier if instead of doing macro-like
>> expansions, it had compounded finite state machines.  Then, it would
>> have reported an error at "$AHLetter = [$ALetter(2)
>> $Hebrew_Letter(2)]".  Basically, the CLDR definition is wrong!  What
>> CLDR should have is
>>
>> $AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)]
>> $AHLetter(2) = ($AHLetter(1) $FEZ*)
>>
>> Alternatively, it could have
>> $AHLetter = ($ALetter | $Hebrew_Letter)
>>
>> At this point, you may realise that ICU does not derive the break
>> iterators from the CLDR definitions.  Instead, they are derived
>> manually from the specifications.  What can then happen is that
>> someone works from what the specification should say, rather than from
>> what it does say.
>>
>> Richard.
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at unicode.org
>> http://unicode.org/mailman/listinfo/cldr-users
>>
>
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/3e69f942/attachment.html>


More information about the CLDR-Users mailing list