Evaluating segmentation rules question

Kip Cole kipcole9 at gmail.com
Fri Jul 10 17:35:51 CDT 2020


Andy, thanks for the assist, much appreciated.

>  If some set of rules failed to include a default, that would be a bug in the rules.

The root rules for word boundaries ends with:

    <rule id="16"> [^$RI] ($RI $RI)* $RI × $RI </rule>
    <!-- Otherwise, break everywhere (including around ideographs). —>

Which makes the final rule "Any ÷ Any” implicit as I read the spec?

> Between the space and the "T", rule SB11 

I think I see the light now. At any point in a given subject string, the whole
String is considered during the match process.  In my implementation I have
been discarding text as I “move” the pointer forward. Therefore when the pointer
Is at the “ Two.” point I no longer have the prior “.” In scope and hence SB11
fails.  

Back to that part of the drawing board ….. 

Many thanks, —Kip



> On 11 Jul 2020, at 4:41 am, Andy Heninger <andy.heninger at gmail.com> wrote:
> 
> Kip Cole writes: 
> I have been assuming that the general algorithm for evaluating segmentation rules is this:
> ... 
> 
> I think you are quite close, with just a couple of comments...
>  
> 3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward
> ... 
> 6. Repeat until the end of the string
> 
> The algorithm tests any single arbitrary position in the string for being or not being a boundary. If you want to apply it to every position of a string, in sequence, that's fine, but it's not required.
> 
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer
> 
> For some types of boundaries, the default is  "Any × Any"; for others it is "Any ÷ Any". In any event, the default is always included explicitly in the rules, so the algorithm itself doesn't need to mention it. If some set of rules failed to include a default, that would be a bug in the rules.
> 
> For the specific question on the sentence break of  “One. Two.”
> The string pointer is here:  “. Two.”
>  
> Between the first "." and the space there is no boundary, with the rules applied as you described.
> 
> Between the space and the "T", rule SB11 
> SATerm Close* Sp* ParaSep? ÷
> applies, causing a boundary. The space character binds to the preceding sentence, not the following one.
> 
> It's often easier to look at the rules in Unicode UAX 29 <https://unicode.org/reports/tr29/#Sentence_Boundary_Rules> than in CLDR. The UAX rules usually match the root CLDR rules.
> 
>   -- Andy
> 
> 
> 
> 
> 
> On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <cldr-users at unicode.org <mailto:cldr-users at unicode.org>> wrote:
> I have been assuming that the general algorithm for evaluating segmentation rules is this:
> 
> 1. At the current pointer in the subject string
> 2. Evaluate rules in order until a rule “passes” (ie matches)
> 3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward
> 4. If the rule does not match, try the next rule
> 5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer
> 6. Repeat until the end of the string
> 
> However when applying this approach to the sentence break rules in the root locale for the string “One. Two.”  the following is resolved:
> 
> The string pointer is here:  “. Two.” Apply the following sentence break rules (partial)
> 
> <!-- Break after sentence terminators, but include closing punctuation, trailing spaces, and any paragraph separator. [See note below.] Include closing punctuation, trailing spaces, and (optionally) a paragraph separator. -->
> <rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule>
> <!-- Note the fix to $Sp*, $Sep? -->
> <rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule>
> <rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule>
> 
> Rule 9 will match:
>   "$SATerm $Close*" matches the “.”
>   "( $Close | $Sp | $ParaSep )" matches the “ Two.”
> 
> Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.
> 
> Am I incorrectly understanding the flow of rule evaluation?
> 
> Thanks for the help as always,
> 
> —Kip
> 
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org <mailto:CLDR-Users at corp.unicode.org>
> https://corp.unicode.org/mailman/listinfo/cldr-users <https://corp.unicode.org/mailman/listinfo/cldr-users>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/cldr-users/attachments/20200711/485041d3/attachment.htm>


More information about the CLDR-Users mailing list