Evaluating segmentation rules question

Andy Heninger andy.heninger at gmail.com
Fri Jul 10 19:08:21 CDT 2020


>
> The root rules for word boundaries ends with:
>     <rule id="16"> [^$RI] ($RI $RI)* $RI × $RI </rule>
>     <!-- Otherwise, break everywhere (including around ideographs). —>
> Which makes the final rule "Any ÷ Any” implicit as I read the spec?


This looks like an omission in the CLDR word rules. UAX-29 word break
<https://unicode.org/reports/tr29/#WB999>, from which the cldr rules
derive, includes the default.

Do not break within emoji flag sequences. That is, do not break between
regional indicator (RI) symbols if there is an odd number of RI characters
before the break point.

WB15    sot (RI RI)* RI × RI

WB16    [^RI] (RI RI)* RI × RI

Otherwise, break everywhere (including around ideographs).

WB999    Any ÷ Any

 At any point in a given subject string, the whole String is considered
> during the match process.  In my implementation I have
> been discarding text as I “move” the pointer forward.


Yes. Although it turns out you never need to look further backwards than
the previous break position, assuming you are finding them all in order.

  -- Andy

On Fri, Jul 10, 2020 at 3:36 PM Kip Cole <kipcole9 at gmail.com> wrote:

> Andy, thanks for the assist, much appreciated.
>
> >  If some set of rules failed to include a default, that would be a bug
> in the rules.
>
> The root rules for word boundaries ends with:
>
>     <rule id="16"> [^$RI] ($RI $RI)* $RI × $RI </rule>
>     <!-- Otherwise, break everywhere (including around ideographs). —>
>
> Which makes the final rule "Any ÷ Any” implicit as I read the spec?
>
> > Between the space and the "*T*", rule SB11
>
> I think I see the light now. At any point in a given subject string, the
> whole
> String is considered during the match process.  In my implementation I have
> been discarding text as I “move” the pointer forward. Therefore when the
> pointer
> Is at the “ Two.” point I no longer have the prior “.” In scope and hence
> SB11
> fails.
>
> Back to that part of the drawing board …..
>
> Many thanks, —Kip
>
>
>
> On 11 Jul 2020, at 4:41 am, Andy Heninger <andy.heninger at gmail.com> wrote:
>
> Kip Cole writes:
>
>> I have been assuming that the general algorithm for evaluating
>> segmentation rules is this:
>
> ...
>
>
> I think you are quite close, with just a couple of comments...
>
>
>> 3. If the rule matches, break or don’t break depending on the operator of
>> the rule (one of “×÷”) *and then move the string pointer forward*
>
> ...
>
>> 6. Repeat until the end of the string
>
>
> The algorithm tests any single arbitrary position in the string for being
> or not being a boundary. If you want to apply it to every position of a
> string, in sequence, that's fine, but it's not required.
>
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will
>> always match and break and then advance the string pointer
>
>
> For some types of boundaries, the default is  "Any × Any"; for others it
> is "Any ÷ Any". In any event, the default is always included explicitly in
> the rules, so the algorithm itself doesn't need to mention it. If some set
> of rules failed to include a default, that would be a bug in the rules.
>
> For the specific question on the sentence break of  “*One. Two.*”
>
>> The string pointer is here:  “. Two.”
>
>
> Between the first "." and the space there is no boundary, with the rules
> applied as you described.
>
> Between the space and the "*T*", rule SB11
>
> *SATerm Close* Sp* ParaSep? ÷*
>
> applies, causing a boundary. The space character binds to the preceding
> sentence, not the following one.
>
> It's often easier to look at the rules in Unicode UAX 29
> <https://unicode.org/reports/tr29/#Sentence_Boundary_Rules> than in CLDR.
> The UAX rules usually match the root CLDR rules.
>
>   -- Andy
>
>
>
>
>
> On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <
> cldr-users at unicode.org> wrote:
>
>> I have been assuming that the general algorithm for evaluating
>> segmentation rules is this:
>>
>> 1. At the current pointer in the subject string
>> 2. Evaluate rules in order until a rule “passes” (ie matches)
>> 3. If the rule matches, break or don’t break depending on the operator of
>> the rule (one of “×÷”) and then move the string pointer forward
>> 4. If the rule does not match, try the next rule
>> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will
>> always match and break and then advance the string pointer
>> 6. Repeat until the end of the string
>>
>> However when applying this approach to the sentence break rules in the
>> root locale for the string “One. Two.”  the following is resolved:
>>
>> The string pointer is here:  “. Two.” Apply the following sentence break
>> rules (partial)
>>
>> <!-- Break after sentence terminators, but include closing punctuation,
>> trailing spaces, and any paragraph separator. [See note below.] Include
>> closing punctuation, trailing spaces, and (optionally) a paragraph
>> separator. -->
>> <rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
>> <!-- Note the fix to $Sp*, $Sep? -->
>> <rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
>> <rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>
>>
>> Rule 9 will match:
>>   "$SATerm $Close*" matches the “.”
>>   "( $Close | $Sp | $ParaSep )" matches the “ Two.”
>>
>> Since it matches, and is a `no break` match then rule processing finishes
>> and the string pointer is advanced. Therefore there is never a sentence
>> break. Removing rule 9 results in rule processing to get to Rule 11 which
>> matches and then breaks as expected.
>>
>> Am I incorrectly understanding the flow of rule evaluation?
>>
>> Thanks for the help as always,
>>
>> —Kip
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at corp.unicode.org
>> https://corp.unicode.org/mailman/listinfo/cldr-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200710/6c22c673/attachment.htm>


More information about the CLDR-Users mailing list