Evaluating segmentation rules question

Kip Cole kipcole9 at gmail.com
Fri Jul 10 02:29:49 CDT 2020


I have been assuming that the general algorithm for evaluating segmentation rules is this:

1. At the current pointer in the subject string
2. Evaluate rules in order until a rule “passes” (ie matches)
3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward
4. If the rule does not match, try the next rule
5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer
6. Repeat until the end of the string

However when applying this approach to the sentence break rules in the root locale for the string “One. Two.”  the following is resolved:

The string pointer is here:  “. Two.” Apply the following sentence break rules (partial)

<!-- Break after sentence terminators, but include closing punctuation, trailing spaces, and any paragraph separator. [See note below.] Include closing punctuation, trailing spaces, and (optionally) a paragraph separator. -->
<rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
<!-- Note the fix to $Sp*, $Sep? -->
<rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
<rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>

Rule 9 will match:
  "$SATerm $Close*" matches the “.”
  "( $Close | $Sp | $ParaSep )" matches the “ Two.”

Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.

Am I incorrectly understanding the flow of rule evaluation?

Thanks for the help as always,

—Kip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200710/6c616dfe/attachment.htm>


More information about the CLDR-Users mailing list