Word break question

Richard Wordingham via CLDR-Users cldr-users at unicode.org
Sun Apr 30 06:21:44 CDT 2017


On Sat, 29 Apr 2017 22:41:33 -0700
Cameron Dutro via CLDR-Users <cldr-users at unicode.org> wrote:

> Hey Richard,
> 
> Thank you for your response, it's been quite helpful.
> 
> I'm aware of the difference between the various types of brackets. I
> think my implementation is treating the round brackets as literal
> characters because rules are compiled into a regular expression.
> Variable replacement gives this as (nearly) the final expanded form
> of the rule in question:
> 
> [(\p{Word_Break=ALetter} $FEZ*) \p{Word_Break=Hebrew_Letter}] ×
> [(\p{Word_Break=ALetter} $FEZ*) \\p{Word_Break=Hebrew_Letter}]

I trust this expansion has been abbreviated for citing in an email -
Aletter and Hebrew_Letter should be treated the same.

> As you can see, the parentheses exist *within* the character class,
> and are therefore treated as literal characters.
> 
> I understand that the transformation rules are unicode sets as
> opposed to true regular expressions. Is there documentation available
> that explains the differences between the two, or perhaps the syntax
> and expected behavior of a unicode set?

I'm not sure what you mean by a 'true regular expression'; whatever
one might think of Unicode sets, they are true definitions of regular
languages on the alphabet of code points.  They are thus
regular expressions, though possibly not as you know them.  There is a
slight gap in the set of strings of code points; one cannot have a
leading surrogate followed by a trailing surrogate if one is processing
a well-formed Unicode string, for that is the UTF-16 representation of
a single code point in a supplemental plane.

There is documentation on the syntax of Unicode sets at
http://unicode.org/reports/tr35/tr35.html#Unicode_Sets.  You will see
that Version 31 Section 5.3.3.4 is very restrictive in what string
specifications it allows.  Moreover, the opening paragraph of Section
5.3.3 says, "A UnicodeSet represents a finite set of Unicode code
points and strings".  "(\p{Word_Break=ALetter} $FEZ*)" defines an
infinite set of strings, so would be illegal even if it were
syntactically correct.

The rules themselves are given in the form:

regular_expression boundary_decision regular_expression

These regular expressions are usually not Unicode sets:

1) They are not given in such notation.
2) At http://www.unicode.org/reports/tr29/#Notation there is the
statement, "The left and right sides use the boundary property values in
regular expressions."
3) The sets of applicable boundary strings are mostly infinite.

Richard.



More information about the CLDR-Users mailing list