Word break question

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sun Apr 30 17:09:17 CDT 2017


Hey Richard,

Unfortunately the Hebrew letters cannot be ignored since the $AHLetter
variable introduces a character class, which is the source of my confusion.
Here are all the variables in question:

$AHLetter = [$ALetter(2) $Hebrew_Letter(2)]
$HebrewLetter(1) = \p{Word_Break=Hebrew_Letter}
$HebrewLetter(2) = ($Hebrew_Letter(1) $FEZ*)
$ALetter(1) = \p{Word_Break=ALetter}
$ALetter(2) = ($ALetter(1) $FEZ*)
$FEZ = [$Format $Extend $ZWJ]
$Format = \p{Word_Break=Format}
$Extend = \p{Word_Break=Extend}
$ZWJ = \p{Word_Break=ZWJ}

The regular expressions for either side of the rule can be constructed
using a series of simple substitutions:

$AHLetter
[$ALetter(2) $Hebrew_Letter(2)]
[($ALetter(1) $FEZ*) ($Hebrew_Letter(1) $FEZ*)]
[(\p{Word_Break=ALetter} $FEZ*) (\p{Word_Break=Hebrew_Letter} $FEZ*)]
[(\p{Word_Break=ALetter} [$Format $Extend
$ZWJ]*) (\p{Word_Break=Hebrew_Letter} [$Format $Extend $ZWJ]*)]
[(\p{Word_Break=ALetter} [\p{Word_Break=Format} $Extend
$ZWJ]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format} $Extend
$ZWJ]*)]
[(\p{Word_Break=ALetter} [\p{Word_Break=Format}
\p{Word_Break=Extend} $ZWJ]*) (\p{Word_Break=Hebrew_Letter}
[\p{Word_Break=Format} \p{Word_Break=Extend} $ZWJ]*)]
[(\p{Word_Break=ALetter} [\p{Word_Break=Format} \p{Word_Break=Extend}
\p{Word_Break=ZWJ}]*) (\p{Word_Break=Hebrew_Letter} [\p{Word_Break=Format}
\p{Word_Break=Extend} \p{Word_Break=ZWJ}]*)]

As you can see, the resulting regular expression contains parentheses
within character classes (and in fact some nested ones too). How should my
implementation handle these cases?

Thanks,

-Cameron

On Sun, Apr 30, 2017 at 2:01 PM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sun, 30 Apr 2017 12:28:51 -0700
> Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > Richard, thanks again for clarifying the notation in use by the
> > segmentation rules - I now understand the left- and right-hand sides
> > to be regular expressions. It's still not clear to me how to interpret
> > parentheses *inside* character classes however. Consider the following
> > generalized case:
> >
> > [(abc d*)]
>
> You should not be getting parentheses inside character classes.
> Ignoring Hebrew letters as a distraction, the rule is
>
> $ALetter × $ALetter
>
> and $ALetter has the value
>
> (\p{Word_Break=ALetter} $FEZ*)
>
> This is a regular expression; it is not defined by a single character
> class (or Unicode set).
>
> At each point for which no break decision has been made, there shall
> be no break if a string immediately before and a string immediately
> after match that pattern.
>
> Richard.
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/a547237f/attachment.html>


More information about the CLDR-Users mailing list