<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Andy, thanks for the assist, much appreciated.<div class=""><br class=""></div><div class="">> If some set of rules failed to include a default, that would be a bug in the rules.</div><div class=""><br class=""></div><div class="">The root rules for word boundaries ends with:</div><div class=""><br class=""></div><div class=""><div class=""> <rule id="16"> [^$RI] ($RI $RI)* $RI × $RI </rule></div><div class=""> <!-- Otherwise, break everywhere (including around ideographs). —></div><div class=""><br class=""></div><div class="">Which makes the final rule "Any ÷ Any” implicit as I read the spec?</div><div class=""><br class=""></div><div class="">> Between the space and the "<b class="">T</b>", rule SB11 </div><div class=""><br class=""></div><div class="">I think I see the light now. At any point in a given subject string, the whole</div><div class="">String is considered during the match process. In my implementation I have</div><div class="">been discarding text as I “move” the pointer forward. Therefore when the pointer</div><div class="">Is at the “ Two.” point I no longer have the prior “.” In scope and hence SB11</div><div class="">fails. </div><div class=""><br class=""></div><div class="">Back to that part of the drawing board ….. </div><div class=""><br class=""></div><div class="">Many thanks, —Kip</div><div class=""><br class=""></div><div class=""><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On 11 Jul 2020, at 4:41 am, Andy Heninger <<a href="mailto:andy.heninger@gmail.com" class="">andy.heninger@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="">Kip Cole writes: </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I have been assuming that the general algorithm for evaluating segmentation rules is this:</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">... </blockquote><div class=""><br class=""></div><div class="">I think you are quite close, with just a couple of comments...</div><div class=""> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) <b class="">and then move the string pointer forward</b></blockquote><div class="">... </div><div class=""><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">6. Repeat until the end of the string</blockquote><div class=""><br class=""></div></div><div class="">The algorithm tests any single arbitrary position in the string for being or not being a boundary. If you want to apply it to every position of a string, in sequence, that's fine, but it's not required.</div><div class=""><br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer</blockquote><div class=""><br class=""></div><div class="">For some types of boundaries, the default is "Any × Any"; for others it is "Any ÷ Any". In any event, the default is always included explicitly in the rules, so the algorithm itself doesn't need to mention it. If some set of rules failed to include a default, that would be a bug in the rules.<br class=""></div><div class=""><br class=""></div><div class="">For the specific question on the sentence break of “<b class="">One. Two.</b>” </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The string pointer is here: “. Two.”</blockquote><div class=""> </div><div class="">Between the first "." and the space there is no boundary, with the rules applied as you described.</div><div class=""><br class=""></div><div class="">Between the space and the "<b class="">T</b>", rule SB11 </div><blockquote style="margin:0 0 0 40px;border:none;padding:0px" class=""><div class=""><b class="">SATerm Close* Sp* ParaSep? ÷</b></div></blockquote><div class="">applies, causing a boundary. The space character binds to the preceding sentence, not the following one.</div><div class=""><br class=""></div><div class="">It's often easier to look at the rules in Unicode <a href="https://unicode.org/reports/tr29/#Sentence_Boundary_Rules" class="">UAX 29</a> than in CLDR. The UAX rules usually match the root CLDR rules.</div><div class=""><br class=""></div><div class=""> -- Andy</div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""><br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <<a href="mailto:cldr-users@unicode.org" class="">cldr-users@unicode.org</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;" class="">I have been assuming that the general algorithm for evaluating segmentation rules is this:<div class=""><br class=""></div><div class="">1. At the current pointer in the subject string</div><div class="">2. Evaluate rules in order until a rule “passes” (ie matches)</div><div class="">3. If the rule matches, break or don’t break depending on the operator of the rule (one of “×÷”) and then move the string pointer forward</div><div class="">4. If the rule does not match, try the next rule</div><div class="">5. If no rule matches, apply the default rule of "Any ÷ Any" which will always match and break and then advance the string pointer</div><div class="">6. Repeat until the end of the string</div><div class=""><br class=""></div><div class="">However when applying this approach to the sentence break rules in the root locale for the string “One. Two.” the following is resolved:</div><div class=""><br class=""></div><div class="">The string pointer is here: “. Two.” Apply the following sentence break rules (partial)</div><div class=""><br class=""></div><div class=""><div class=""><!-- Break after sentence terminators, but include closing punctuation, trailing spaces, and any paragraph separator. [See note below.] Include closing punctuation, trailing spaces, and (optionally) a paragraph separator. --></div><div class=""><rule id="9"> $SATerm $Close* × ( $Close | $Sp | $ParaSep ) </rule></div><div class=""><!-- Note the fix to $Sp*, $Sep? --></div><div class=""><rule id="10"> $SATerm $Close* $Sp* × ( $Sp | $ParaSep ) </rule></div><div class=""><rule id="11"> $SATerm $Close* $Sp* $ParaSep? ÷ </rule></div></div><div class=""><br class=""></div><div class="">Rule 9 will match:</div><div class=""> "<font class=""><span class="">$SATerm $Close*" matches the “.”</span></font></div><div class=""><font class=""><span class=""> </span></font><span style="" class=""> "</span><font class=""><span class="">( $Close | $Sp | $ParaSep )" matches the “ Two.”</span></font></div><div class=""><font class=""><span class=""><br class=""></span></font></div><div class=""><font class="">Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.</font></div><div class=""><font class=""><br class=""></font></div><div class=""><font class="">Am I incorrectly understanding the flow of rule evaluation?</font></div><div class=""><font class=""><br class=""></font></div><div class=""><font class="">Thanks for the help as always,</font></div><div class=""><font class=""><br class=""></font></div><div class=""><font class="">—Kip</font></div><div class=""><font class=""><br class=""></font></div></div>_______________________________________________<br class="">
CLDR-Users mailing list<br class="">
<a href="mailto:CLDR-Users@corp.unicode.org" target="_blank" class="">CLDR-Users@corp.unicode.org</a><br class="">
<a href="https://corp.unicode.org/mailman/listinfo/cldr-users" rel="noreferrer" target="_blank" class="">https://corp.unicode.org/mailman/listinfo/cldr-users</a><br class="">
</blockquote></div>
</div></blockquote></div><br class=""></div></body></html>