Word break question

Mark Davis ☕️ via CLDR-Users cldr-users at unicode.org
Sun Apr 30 18:36:22 CDT 2017


Richard, Cameron, Philippe, thanks for tracking this down... I filed a
ticket at http://unicode.org/cldr/trac/ticket/10226. If you have any
comments on the proposed solution, please add them there so we don't lose
them.

Mark

On Sun, Apr 30, 2017 at 4:07 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sun, 30 Apr 2017 15:09:17 -0700
> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > Hey Richard,
> >
> > Unfortunately the Hebrew letters cannot be ignored since the $AHLetter
> > variable introduces a character class, which is the source of my
> > confusion. Here are all the variables in question:
> >
> > $AHLetter = [$ALetter(2) $Hebrew_Letter(2)]
> > $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter}
> > $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*)
> > $ALetter(1) = \p{Word_Break=ALetter}
> > $ALetter(2) = ($ALetter(1) $FEZ*)
> > $FEZ = [$Format $Extend $ZWJ]
> > $Format = \p{Word_Break=Format}
> > $Extend = \p{Word_Break=Extend}
> > $ZWJ = \p{Word_Break=ZWJ}
> >
>
> <snip>
>
> > How should my implementation handle these cases?
>
> It would have been friendlier if instead of doing macro-like
> expansions, it had compounded finite state machines.  Then, it would
> have reported an error at "$AHLetter = [$ALetter(2)
> $Hebrew_Letter(2)]".  Basically, the CLDR definition is wrong!  What
> CLDR should have is
>
> $AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)]
> $AHLetter(2) = ($AHLetter(1) $FEZ*)
>
> Alternatively, it could have
> $AHLetter = ($ALetter | $Hebrew_Letter)
>
> At this point, you may realise that ICU does not derive the break
> iterators from the CLDR definitions.  Instead, they are derived
> manually from the specifications.  What can then happen is that
> someone works from what the specification should say, rather than from
> what it does say.
>
> Richard.
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/dc66437d/attachment.html>


More information about the CLDR-Users mailing list