Word break question

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Sun Apr 30 16:01:49 CDT 2017


On Sun, 30 Apr 2017 12:28:51 -0700
Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:

> Richard, thanks again for clarifying the notation in use by the
> segmentation rules - I now understand the left- and right-hand sides
> to be regular expressions. It's still not clear to me how to interpret
> parentheses *inside* character classes however. Consider the following
> generalized case:
> 
> [(abc d*)]

You should not be getting parentheses inside character classes.
Ignoring Hebrew letters as a distraction, the rule is

$ALetter × $ALetter

and $ALetter has the value

(\p{Word_Break=ALetter} $FEZ*)

This is a regular expression; it is not defined by a single character
class (or Unicode set).

At each point for which no break decision has been made, there shall
be no break if a string immediately before and a string immediately
after match that pattern. 

Richard.



More information about the CLDR-Users mailing list