Word break question

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sun Apr 30 17:09:17 CDT 2017

Hey Richard,

Unfortunately the Hebrew letters cannot be ignored since the $AHLetter
variable introduces a character class, which is the source of my confusion.
Here are all the variables in question:

$AHLetter = [$ALetter(2) $Hebrew_Letter(2)]
$HebrewLetter(1) = \p{Word_Break=Hebrew_Letter}
$HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*)
$ALetter(1) = \p{Word_Break=ALetter}
$ALetter(2) = ($ALetter(1) $FEZ*)
$FEZ = [$Format $Extend $ZWJ]
$Format = \p{Word_Break=Format}
$Extend = \p{Word_Break=Extend}
$ZWJ = \p{Word_Break=ZWJ}

The regular expressions for either side of the rule can be constructed
using a series of simple substitutions:

[$ALetter(2) $Hebrew_Letter(2)]
[($ALetter(1) $FEZ*) ($Hebrew_Letter(1) $FEZ*)]
[(\p{Word_Break=ALetter} $FEZ*) (\p{Word_Break=Hebrew_Letter} $FEZ*)]
[(\p{Word_Break=ALetter} [$Format $Extend
$ZWJ]*) (\p{Word_Break=Hebrew_Letter} [$Format $Extend $ZWJ]*)]
[(\p{Word_Break=ALetter} [\p{Word_Break=Format} $Extend
$ZWJ]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format} $Extend
[(\p{Word_Break=ALetter} [\p{Word_Break=Format}
\p{Word_Break=Extend} $ZWJ]*) (\p{Word_Break=Hebrew_Letter}
[\p{Word_Break=Format} \p{Word_Break=Extend} $ZWJ]*)]
[(\p{Word_Break=ALetter} [\p{Word_Break=Format} \p{Word_Break=Extend}
\p{Word_Break=ZWJ}]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format}
\p{Word_Break=Extend} \p{Word_Break=ZWJ}]*)]

As you can see, the resulting regular expression contains parentheses
within character classes (and in fact some nested ones too). How should my
implementation handle these cases?



On Sun, Apr 30, 2017 at 2:01 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sun, 30 Apr 2017 12:28:51 -0700
> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
> > Richard, thanks again for clarifying the notation in use by the
> > segmentation rules - I now understand the left- and right-hand sides
> > to be regular expressions. It's still not clear to me how to interpret
> > parentheses *inside* character classes however. Consider the following
> > generalized case:
> >
> > [(abc d*)]
> You should not be getting parentheses inside character classes.
> Ignoring Hebrew letters as a distraction, the rule is
> $ALetter × $ALetter
> and $ALetter has the value
> (\p{Word_Break=ALetter} $FEZ*)
> This is a regular expression; it is not defined by a single character
> class (or Unicode set).
> At each point for which no break decision has been made, there shall
> be no break if a string immediately before and a string immediately
> after match that pattern.
> Richard.
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/a547237f/attachment.html>

More information about the CLDR-Users mailing list