Evaluating segmentation rules question
Kip Cole
kipcole9 at gmail.com
Fri Jul 10 17:35:51 CDT 2020
Andy, thanks for the assist, much appreciated.
> If some set of rules failed to include a default, that would be a bug in the rules.
The root rules for word boundaries ends with:
<rule id="16"> [^$RI] ($RI $RI)* $RI × $RI </rule>
<!-- Otherwise, break everywhere (including around ideographs). —>
Which makes the final rule "Any ÷ Any” implicit as I read the spec?
> Between the space and the "T", rule SB11
I think I see the light now. At any point in a given subject string, the whole
String is considered during the match process. In my implementation I have
been discarding text as I “move” the pointer forward. Therefore when the pointer
Is at the “ Two.” point I no longer have the prior “.” In scope and hence SB11
fails.
Back to that part of the drawing board …..
Many thanks, —Kip
> On 11 Jul 2020, at 4:41 am, Andy Heninger <andy.heninger at gmail.com> wrote:
>
> Kip Cole writes:
> I have been assuming that the general algorithm for evaluating segmentation rules is this:
> ...
>
> I think you are quite close, with just a couple of comments...
>
> 3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward
> ...
> 6. Repeat until the end of the string
>
> The algorithm tests any single arbitrary position in the string for being or not being a boundary. If you want to apply it to every position of a string, in sequence, that's fine, but it's not required.
>
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer
>
> For some types of boundaries, the default is "Any × Any"; for others it is "Any ÷ Any". In any event, the default is always included explicitly in the rules, so the algorithm itself doesn't need to mention it. If some set of rules failed to include a default, that would be a bug in the rules.
>
> For the specific question on the sentence break of “One. Two.”
> The string pointer is here: “. Two.”
>
> Between the first "." and the space there is no boundary, with the rules applied as you described.
>
> Between the space and the "T", rule SB11
> SATerm Close* Sp* ParaSep? ÷
> applies, causing a boundary. The space character binds to the preceding sentence, not the following one.
>
> It's often easier to look at the rules in Unicode UAX 29 <https://unicode.org/reports/tr29/#Sentence_Boundary_Rules> than in CLDR. The UAX rules usually match the root CLDR rules.
>
> -- Andy
>
>
>
>
>
> On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <cldr-users at unicode.org <mailto:cldr-users at unicode.org>> wrote:
> I have been assuming that the general algorithm for evaluating segmentation rules is this:
>
> 1. At the current pointer in the subject string
> 2. Evaluate rules in order until a rule “passes” (ie matches)
> 3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward
> 4. If the rule does not match, try the next rule
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer
> 6. Repeat until the end of the string
>
> However when applying this approach to the sentence break rules in the root locale for the string “One. Two.” the following is resolved:
>
> The string pointer is here: “. Two.” Apply the following sentence break rules (partial)
>
> <!-- Break after sentence terminators, but include closing punctuation, trailing spaces, and any paragraph separator. [See note below.] Include closing punctuation, trailing spaces, and (optionally) a paragraph separator. -->
> <rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
> <!-- Note the fix to $Sp*, $Sep? -->
> <rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
> <rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>
>
> Rule 9 will match:
> "$SATerm $Close*" matches the “.”
> "( $Close | $Sp | $ParaSep )" matches the “ Two.”
>
> Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.
>
> Am I incorrectly understanding the flow of rule evaluation?
>
> Thanks for the help as always,
>
> —Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org <mailto:CLDR-Users at corp.unicode.org>
> https://corp.unicode.org/mailman/listinfo/cldr-users <https://corp.unicode.org/mailman/listinfo/cldr-users>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200711/485041d3/attachment.htm>
More information about the CLDR-Users
mailing list