From kipcole9 at gmail.com Fri Jul 10 02:29:49 2020 From: kipcole9 at gmail.com (Kip Cole) Date: Fri, 10 Jul 2020 15:29:49 +0800 Subject: Evaluating segmentation rules question Message-ID: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com> I have been assuming that the general algorithm for evaluating segmentation rules is this: 1. At the current pointer in the subject string 2. Evaluate rules in order until a rule ?passes? (ie matches) 3. If the rule matches, break or don?t break depending on the operator of the rule (one of ????) and then move the string pointer forward 4. If the rule does not match, try the next rule 5. If no rule matches, apply the default rule of "Any ? Any" which will always match and break and then advance the string pointer 6. Repeat until the end of the string However when applying this approach to the sentence break rules in the root locale for the string ?One. Two.? the following is resolved: The string pointer is here: ?. Two.? Apply the following sentence break rules (partial) $SATerm $Close* ? ( $Close | $Sp | $ParaSep ) $SATerm $Close* $Sp* ? ( $Sp | $ParaSep ) $SATerm $Close* $Sp* $ParaSep? ? Rule 9 will match: "$SATerm $Close*" matches the ?.? "( $Close | $Sp | $ParaSep )" matches the ? Two.? Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected. Am I incorrectly understanding the flow of rule evaluation? Thanks for the help as always, ?Kip -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.heninger at gmail.com Fri Jul 10 15:41:02 2020 From: andy.heninger at gmail.com (Andy Heninger) Date: Fri, 10 Jul 2020 13:41:02 -0700 Subject: Evaluating segmentation rules question In-Reply-To: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com> References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com> Message-ID: Kip Cole writes: > I have been assuming that the general algorithm for evaluating > segmentation rules is this: ... I think you are quite close, with just a couple of comments... > 3. If the rule matches, break or don?t break depending on the operator of > the rule (one of ????) *and then move the string pointer forward* ... > 6. Repeat until the end of the string The algorithm tests any single arbitrary position in the string for being or not being a boundary. If you want to apply it to every position of a string, in sequence, that's fine, but it's not required. 5. If no rule matches, apply the default rule of "Any ? Any" which will > always match and break and then advance the string pointer For some types of boundaries, the default is "Any ? Any"; for others it is "Any ? Any". In any event, the default is always included explicitly in the rules, so the algorithm itself doesn't need to mention it. If some set of rules failed to include a default, that would be a bug in the rules. For the specific question on the sentence break of ?*One. Two.*? > The string pointer is here: ?. Two.? Between the first "." and the space there is no boundary, with the rules applied as you described. Between the space and the "*T*", rule SB11 *SATerm Close* Sp* ParaSep? ?* applies, causing a boundary. The space character binds to the preceding sentence, not the following one. It's often easier to look at the rules in Unicode UAX 29 than in CLDR. The UAX rules usually match the root CLDR rules. -- Andy On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users < cldr-users at unicode.org> wrote: > I have been assuming that the general algorithm for evaluating > segmentation rules is this: > > 1. At the current pointer in the subject string > 2. Evaluate rules in order until a rule ?passes? (ie matches) > 3. If the rule matches, break or don?t break depending on the operator of > the rule (one of ????) and then move the string pointer forward > 4. If the rule does not match, try the next rule > 5. If no rule matches, apply the default rule of "Any ? Any" which will > always match and break and then advance the string pointer > 6. Repeat until the end of the string > > However when applying this approach to the sentence break rules in the > root locale for the string ?One. Two.? the following is resolved: > > The string pointer is here: ?. Two.? Apply the following sentence break > rules (partial) > > > $SATerm $Close* ? ( $Close | $Sp | $ParaSep ) > > $SATerm $Close* $Sp* ? ( $Sp | $ParaSep ) > $SATerm $Close* $Sp* $ParaSep? ? > > Rule 9 will match: > "$SATerm $Close*" matches the ?.? > "( $Close | $Sp | $ParaSep )" matches the ? Two.? > > Since it matches, and is a `no break` match then rule processing finishes > and the string pointer is advanced. Therefore there is never a sentence > break. Removing rule 9 results in rule processing to get to Rule 11 which > matches and then breaks as expected. > > Am I incorrectly understanding the flow of rule evaluation? > > Thanks for the help as always, > > ?Kip > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at corp.unicode.org > https://corp.unicode.org/mailman/listinfo/cldr-users > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kipcole9 at gmail.com Fri Jul 10 17:35:51 2020 From: kipcole9 at gmail.com (Kip Cole) Date: Sat, 11 Jul 2020 06:35:51 +0800 Subject: Evaluating segmentation rules question In-Reply-To: References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com> Message-ID: Andy, thanks for the assist, much appreciated. > If some set of rules failed to include a default, that would be a bug in the rules. The root rules for word boundaries ends with: [^$RI] ($RI $RI)* $RI ? $RI > $SATerm $Close* ? ( $Close | $Sp | $ParaSep ) > > $SATerm $Close* $Sp* ? ( $Sp | $ParaSep ) > $SATerm $Close* $Sp* $ParaSep? ? > > Rule 9 will match: > "$SATerm $Close*" matches the ?.? > "( $Close | $Sp | $ParaSep )" matches the ? Two.? > > Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected. > > Am I incorrectly understanding the flow of rule evaluation? > > Thanks for the help as always, > > ?Kip > > _______________________________________________ > CLDR-Users mailing list > CLDR-Users at corp.unicode.org > https://corp.unicode.org/mailman/listinfo/cldr-users -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy.heninger at gmail.com Fri Jul 10 19:08:21 2020 From: andy.heninger at gmail.com (Andy Heninger) Date: Fri, 10 Jul 2020 17:08:21 -0700 Subject: Evaluating segmentation rules question In-Reply-To: References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com> Message-ID: > > The root rules for word boundaries ends with: > [^$RI] ($RI $RI)* $RI ? $RI > >> $SATerm $Close* ? ( $Close | $Sp | $ParaSep ) >> >> $SATerm $Close* $Sp* ? ( $Sp | $ParaSep ) >> $SATerm $Close* $Sp* $ParaSep? ? >> >> Rule 9 will match: >> "$SATerm $Close*" matches the ?.? >> "( $Close | $Sp | $ParaSep )" matches the ? Two.? >> >> Since it matches, and is a `no break` match then rule processing finishes >> and the string pointer is advanced. Therefore there is never a sentence >> break. Removing rule 9 results in rule processing to get to Rule 11 which >> matches and then breaks as expected. >> >> Am I incorrectly understanding the flow of rule evaluation? >> >> Thanks for the help as always, >> >> ?Kip >> >> _______________________________________________ >> CLDR-Users mailing list >> CLDR-Users at corp.unicode.org >> https://corp.unicode.org/mailman/listinfo/cldr-users >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kipcole9 at gmail.com Mon Jul 20 11:27:38 2020 From: kipcole9 at gmail.com (Kip Cole) Date: Tue, 21 Jul 2020 00:27:38 +0800 Subject: Looking for transform rules for Any-Latin Message-ID: <070637DE-622D-4DD1-BE83-C9819CB42CC8@gmail.com> ICU includes a transliterator for `Any-Latin` but for the life of me I cannot find its rules. Its not in the CLDR transforms directory that I can see as forward or backward, not an implicit transform as best I can tell from TR35. And I can?t even find it text searching the repo. Which suggests its derived somehow but I can?t find it. Any suggestions on where to look to find the rules defining the `Any-Latin` transform? Many thanks, ?Kip