Evaluating segmentation rules question

Andy Heninger andy.heninger at gmail.com
Fri Jul 10 15:41:02 CDT 2020


Kip Cole writes:

> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:

...


I think you are quite close, with just a couple of comments...


> 3. If the rule matches, break or don’t break depending on the operator of
> the rule (one of “×÷”) *and then move the string pointer forward*

...

> 6. Repeat until the end of the string


The algorithm tests any single arbitrary position in the string for being
or not being a boundary. If you want to apply it to every position of a
string, in sequence, that's fine, but it's not required.

5. If no rule matches, apply the default rule of "Any ÷ Any" which will
> always match and break and then advance the string pointer


For some types of boundaries, the default is  "Any × Any"; for others it
is "Any ÷ Any". In any event, the default is always included explicitly in
the rules, so the algorithm itself doesn't need to mention it. If some set
of rules failed to include a default, that would be a bug in the rules.

For the specific question on the sentence break of  “*One. Two.*”

> The string pointer is here:  “. Two.”


Between the first "." and the space there is no boundary, with the rules
applied as you described.

Between the space and the "*T*", rule SB11

*SATerm Close* Sp* ParaSep? ÷*

applies, causing a boundary. The space character binds to the preceding
sentence, not the following one.

It's often easier to look at the rules in Unicode UAX 29
<https://unicode.org/reports/tr29/#Sentence_Boundary_Rules> than in CLDR.
The UAX rules usually match the root CLDR rules.

  -- Andy





On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:

> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:
>
> 1. At the current pointer in the subject string
> 2. Evaluate rules in order until a rule “passes” (ie matches)
> 3. If the rule matches, break or don’t break depending on the operator of
> the rule (one of “×÷”) and then move the string pointer forward
> 4. If the rule does not match, try the next rule
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will
> always match and break and then advance the string pointer
> 6. Repeat until the end of the string
>
> However when applying this approach to the sentence break rules in the
> root locale for the string “One. Two.”  the following is resolved:
>
> The string pointer is here:  “. Two.” Apply the following sentence break
> rules (partial)
>
> <!-- Break after sentence terminators, but include closing punctuation,
> trailing spaces, and any paragraph separator. [See note below.] Include
> closing punctuation, trailing spaces, and (optionally) a paragraph
> separator. -->
> <rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
> <!-- Note the fix to $Sp*, $Sep? -->
> <rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
> <rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>
>
> Rule 9 will match:
>   "$SATerm $Close*" matches the “.”
>   "( $Close | $Sp | $ParaSep )" matches the “ Two.”
>
> Since it matches, and is a `no break` match then rule processing finishes
> and the string pointer is advanced. Therefore there is never a sentence
> break. Removing rule 9 results in rule processing to get to Rule 11 which
> matches and then breaks as expected.
>
> Am I incorrectly understanding the flow of rule evaluation?
>
> Thanks for the help as always,
>
> —Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org
> https://corp.unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200710/a1e0c071/attachment-0001.htm>


More information about the CLDR-Users mailing list