Word break question

Cameron Dutro via CLDR-Users cldr-users at unicode.org
Sun Apr 30 14:28:51 CDT 2017


Philippe, thanks for your response. As Richard said, the variable entries
are "evaluated" from first to last, which obviates the self-referencing
problem you mentioned. My Ruby implementation follows this rule, so I don't
think that's the problem.

Richard, thanks again for clarifying the notation in use by the
segmentation rules - I now understand the left- and right-hand sides to be
regular expressions. It's still not clear to me how to interpret
parentheses *inside* character classes however. Consider the following
generalized case:

[(abc d*)]

The above regular expression is nonsensical unless one considers the
parentheses and asterisk to be individual characters in the character class
as opposed to a grouping of characters that must match in order (eg. a
capturing group). Of the programming languages I've used, I know of none
that would treat the parentheses as anything but literal characters. As far
as I can tell, such behavior isn't even mentioned in UTS #18
<http://unicode.org/reports/tr18>. There is the provision for character
*groups* within character classes - for example [{abc}{def}] - but that
doesn't take repetitions like * and + into account.

That said, it appears that any conformant implementation of the
segmentation rules must make an allowance for parentheses inside character
classes. How then should [(abc d*)] be interpreted? I can think of several
interpretations:


   1. Simply remove the square brackets.

   2. Replace grouping symbols () with grouping symbols {}, which are
   explicitly allowed/supported in Unicode regular expressions (UTS #18).
   Unfortunately the issue of how to interpret repetition symbols is still in
   question.

   3. Allow repetition symbols in character classes. This would require
   rewriting the regular expression above as something like (?:abc|d*).

Any guidance would be much appreciated!

-Cameron

On Sun, Apr 30, 2017 at 6:11 AM, Richard Wordingham via CLDR-Users <
cldr-users at unicode.org> wrote:

> On Sun, 30 Apr 2017 14:49:37 +0200
> Philippe Verdy via CLDR-Users <cldr-users at unicode.org> wrote:
>
> > ... Even if you drop the
> > (unnecesssary) parentheses in <variable id="$ALetter">$ALetter
> > $FEZ*</variable> it will not be correct.
>
> > In fact this variable definition is silly because it is
> > self-referencing itself, so it would expand to
> > <variable id="$ALetter">(($ALetter $FEZ*) $FEZ*)</variable>, then
> > <variable id="$ALetter">((($ALetter $FEZ*) $FEZ*) $FEZ*)</variable>,
> > and so on infinitely.
>
> That is another reason for calling it a *variable*; it is not a
> *constant*.  If you read the text at
> http://unicode.org/reports/tr35/tr35-general.html#Segmentations, you
> would find the statement "The ordering of variables is important; they
> are evaluated in order from first to last (see Section 9.1 Segmentation
> Inheritance)".
>
> Richard.
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at unicode.org
> http://unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170430/d50c35b0/attachment-0001.html>


More information about the CLDR-Users mailing list