Evaluating segmentation rules question
Andy Heninger
andy.heninger at gmail.com
Fri Jul 10 15:41:02 CDT 2020
Kip Cole writes:
> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:
...
I think you are quite close, with just a couple of comments...
> 3. If the rule matches, break or don’t break depending on the operator of
> the rule (one of “×÷”) *and then move the string pointer forward*
...
> 6. Repeat until the end of the string
The algorithm tests any single arbitrary position in the string for being
or not being a boundary. If you want to apply it to every position of a
string, in sequence, that's fine, but it's not required.
5. If no rule matches, apply the default rule of "Any ÷ Any" which will
> always match and break and then advance the string pointer
For some types of boundaries, the default is "Any × Any"; for others it
is "Any ÷ Any". In any event, the default is always included explicitly in
the rules, so the algorithm itself doesn't need to mention it. If some set
of rules failed to include a default, that would be a bug in the rules.
For the specific question on the sentence break of “*One. Two.*”
> The string pointer is here: “. Two.”
Between the first "." and the space there is no boundary, with the rules
applied as you described.
Between the space and the "*T*", rule SB11
*SATerm Close* Sp* ParaSep? ÷*
applies, causing a boundary. The space character binds to the preceding
sentence, not the following one.
It's often easier to look at the rules in Unicode UAX 29
<https://unicode.org/reports/tr29/#Sentence_Boundary_Rules> than in CLDR.
The UAX rules usually match the root CLDR rules.
-- Andy
On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:
> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:
>
> 1. At the current pointer in the subject string
> 2. Evaluate rules in order until a rule “passes” (ie matches)
> 3. If the rule matches, break or don’t break depending on the operator of
> the rule (one of “×÷”) and then move the string pointer forward
> 4. If the rule does not match, try the next rule
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will
> always match and break and then advance the string pointer
> 6. Repeat until the end of the string
>
> However when applying this approach to the sentence break rules in the
> root locale for the string “One. Two.” the following is resolved:
>
> The string pointer is here: “. Two.” Apply the following sentence break
> rules (partial)
>
> <!-- Break after sentence terminators, but include closing punctuation,
> trailing spaces, and any paragraph separator. [See note below.] Include
> closing punctuation, trailing spaces, and (optionally) a paragraph
> separator. -->
> <rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
> <!-- Note the fix to $Sp*, $Sep? -->
> <rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
> <rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>
>
> Rule 9 will match:
> "$SATerm $Close*" matches the “.”
> "( $Close | $Sp | $ParaSep )" matches the “ Two.”
>
> Since it matches, and is a `no break` match then rule processing finishes
> and the string pointer is advanced. Therefore there is never a sentence
> break. Removing rule 9 results in rule processing to get to Rule 11 which
> matches and then breaks as expected.
>
> Am I incorrectly understanding the flow of rule evaluation?
>
> Thanks for the help as always,
>
> —Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org
> https://corp.unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200710/a1e0c071/attachment-0001.htm>
More information about the CLDR-Users
mailing list