Word break question

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Sun Apr 30 18:07:32 CDT 2017


On Sun, 30 Apr 2017 15:09:17 -0700
Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:

> Hey Richard,
> 
> Unfortunately the Hebrew letters cannot be ignored since the $AHLetter
> variable introduces a character class, which is the source of my
> confusion. Here are all the variables in question:
> 
> $AHLetter = [$ALetter(2) $Hebrew_Letter(2)]
> $HebrewLetter(1) = \p{Word_Break=Hebrew_Letter}
> $HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*)
> $ALetter(1) = \p{Word_Break=ALetter}
> $ALetter(2) = ($ALetter(1) $FEZ*)
> $FEZ = [$Format $Extend $ZWJ]
> $Format = \p{Word_Break=Format}
> $Extend = \p{Word_Break=Extend}
> $ZWJ = \p{Word_Break=ZWJ}
> 

<snip>

> How should my implementation handle these cases?

It would have been friendlier if instead of doing macro-like
expansions, it had compounded finite state machines.  Then, it would
have reported an error at "$AHLetter = [$ALetter(2)
$Hebrew_Letter(2)]".  Basically, the CLDR definition is wrong!  What
CLDR should have is

$AHLetter(1) = [$ALetter(1) $Hebrew_Letter(1)]
$AHLetter(2) = ($AHLetter(1) $FEZ*)

Alternatively, it could have
$AHLetter = ($ALetter | $Hebrew_Letter)

At this point, you may realise that ICU does not derive the break
iterators from the CLDR definitions.  Instead, they are derived
manually from the specifications.  What can then happen is that
someone works from what the specification should say, rather than from
what it does say.

Richard.


More information about the CLDR-Users mailing list