From kipcole9 at gmail.com Fri Jul 10 02:29:49 2020
From: kipcole9 at gmail.com (Kip Cole)
Date: Fri, 10 Jul 2020 15:29:49 +0800
Subject: Evaluating segmentation rules question
Message-ID: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com>
I have been assuming that the general algorithm for evaluating segmentation rules is this:
1. At the current pointer in the subject string
2. Evaluate rules in order until a rule ?passes? (ie matches)
3. If the rule matches, break or don?t break depending on the operator of the rule (one of ????) and then move the string pointer forward
4. If the rule does not match, try the next rule
5. If no rule matches, apply the default rule of "Any ? Any" which will always match and break and then advance the string pointer
6. Repeat until the end of the string
However when applying this approach to the sentence break rules in the root locale for the string ?One. Two.? the following is resolved:
The string pointer is here: ?. Two.? Apply the following sentence break rules (partial)
$SATerm $Close* ? ( $Close | $Sp | $ParaSep )
$SATerm $Close* $Sp* ? ( $Sp | $ParaSep )
$SATerm $Close* $Sp* $ParaSep? ?
Rule 9 will match:
"$SATerm $Close*" matches the ?.?
"( $Close | $Sp | $ParaSep )" matches the ? Two.?
Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.
Am I incorrectly understanding the flow of rule evaluation?
Thanks for the help as always,
?Kip
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From andy.heninger at gmail.com Fri Jul 10 15:41:02 2020
From: andy.heninger at gmail.com (Andy Heninger)
Date: Fri, 10 Jul 2020 13:41:02 -0700
Subject: Evaluating segmentation rules question
In-Reply-To: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com>
References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com>
Message-ID:
Kip Cole writes:
> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:
...
I think you are quite close, with just a couple of comments...
> 3. If the rule matches, break or don?t break depending on the operator of
> the rule (one of ????) *and then move the string pointer forward*
...
> 6. Repeat until the end of the string
The algorithm tests any single arbitrary position in the string for being
or not being a boundary. If you want to apply it to every position of a
string, in sequence, that's fine, but it's not required.
5. If no rule matches, apply the default rule of "Any ? Any" which will
> always match and break and then advance the string pointer
For some types of boundaries, the default is "Any ? Any"; for others it
is "Any ? Any". In any event, the default is always included explicitly in
the rules, so the algorithm itself doesn't need to mention it. If some set
of rules failed to include a default, that would be a bug in the rules.
For the specific question on the sentence break of ?*One. Two.*?
> The string pointer is here: ?. Two.?
Between the first "." and the space there is no boundary, with the rules
applied as you described.
Between the space and the "*T*", rule SB11
*SATerm Close* Sp* ParaSep? ?*
applies, causing a boundary. The space character binds to the preceding
sentence, not the following one.
It's often easier to look at the rules in Unicode UAX 29
than in CLDR.
The UAX rules usually match the root CLDR rules.
-- Andy
On Fri, Jul 10, 2020 at 12:31 AM Kip Cole via CLDR-Users <
cldr-users at unicode.org> wrote:
> I have been assuming that the general algorithm for evaluating
> segmentation rules is this:
>
> 1. At the current pointer in the subject string
> 2. Evaluate rules in order until a rule ?passes? (ie matches)
> 3. If the rule matches, break or don?t break depending on the operator of
> the rule (one of ????) and then move the string pointer forward
> 4. If the rule does not match, try the next rule
> 5. If no rule matches, apply the default rule of "Any ? Any" which will
> always match and break and then advance the string pointer
> 6. Repeat until the end of the string
>
> However when applying this approach to the sentence break rules in the
> root locale for the string ?One. Two.? the following is resolved:
>
> The string pointer is here: ?. Two.? Apply the following sentence break
> rules (partial)
>
>
> $SATerm $Close* ? ( $Close | $Sp | $ParaSep )
>
> $SATerm $Close* $Sp* ? ( $Sp | $ParaSep )
> $SATerm $Close* $Sp* $ParaSep? ?
>
> Rule 9 will match:
> "$SATerm $Close*" matches the ?.?
> "( $Close | $Sp | $ParaSep )" matches the ? Two.?
>
> Since it matches, and is a `no break` match then rule processing finishes
> and the string pointer is advanced. Therefore there is never a sentence
> break. Removing rule 9 results in rule processing to get to Rule 11 which
> matches and then breaks as expected.
>
> Am I incorrectly understanding the flow of rule evaluation?
>
> Thanks for the help as always,
>
> ?Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org
> https://corp.unicode.org/mailman/listinfo/cldr-users
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From kipcole9 at gmail.com Fri Jul 10 17:35:51 2020
From: kipcole9 at gmail.com (Kip Cole)
Date: Sat, 11 Jul 2020 06:35:51 +0800
Subject: Evaluating segmentation rules question
In-Reply-To:
References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com>
Message-ID:
Andy, thanks for the assist, much appreciated.
> If some set of rules failed to include a default, that would be a bug in the rules.
The root rules for word boundaries ends with:
[^$RI] ($RI $RI)* $RI ? $RI
> $SATerm $Close* ? ( $Close | $Sp | $ParaSep )
>
> $SATerm $Close* $Sp* ? ( $Sp | $ParaSep )
> $SATerm $Close* $Sp* $ParaSep? ?
>
> Rule 9 will match:
> "$SATerm $Close*" matches the ?.?
> "( $Close | $Sp | $ParaSep )" matches the ? Two.?
>
> Since it matches, and is a `no break` match then rule processing finishes and the string pointer is advanced. Therefore there is never a sentence break. Removing rule 9 results in rule processing to get to Rule 11 which matches and then breaks as expected.
>
> Am I incorrectly understanding the flow of rule evaluation?
>
> Thanks for the help as always,
>
> ?Kip
>
> _______________________________________________
> CLDR-Users mailing list
> CLDR-Users at corp.unicode.org
> https://corp.unicode.org/mailman/listinfo/cldr-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From andy.heninger at gmail.com Fri Jul 10 19:08:21 2020
From: andy.heninger at gmail.com (Andy Heninger)
Date: Fri, 10 Jul 2020 17:08:21 -0700
Subject: Evaluating segmentation rules question
In-Reply-To:
References: <36D48479-959C-46D8-8292-196B12980C5B@gmail.com>
Message-ID:
>
> The root rules for word boundaries ends with:
> [^$RI] ($RI $RI)* $RI ? $RI
>
>> $SATerm $Close* ? ( $Close | $Sp | $ParaSep )
>>
>> $SATerm $Close* $Sp* ? ( $Sp | $ParaSep )
>> $SATerm $Close* $Sp* $ParaSep? ?
>>
>> Rule 9 will match:
>> "$SATerm $Close*" matches the ?.?
>> "( $Close | $Sp | $ParaSep )" matches the ? Two.?
>>
>> Since it matches, and is a `no break` match then rule processing finishes
>> and the string pointer is advanced. Therefore there is never a sentence
>> break. Removing rule 9 results in rule processing to get to Rule 11 which
>> matches and then breaks as expected.
>>
>> Am I incorrectly understanding the flow of rule evaluation?
>>
>> Thanks for the help as always,
>>
>> ?Kip
>>
>> _______________________________________________
>> CLDR-Users mailing list
>> CLDR-Users at corp.unicode.org
>> https://corp.unicode.org/mailman/listinfo/cldr-users
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
From kipcole9 at gmail.com Mon Jul 20 11:27:38 2020
From: kipcole9 at gmail.com (Kip Cole)
Date: Tue, 21 Jul 2020 00:27:38 +0800
Subject: Looking for transform rules for Any-Latin
Message-ID: <070637DE-622D-4DD1-BE83-C9819CB42CC8@gmail.com>
ICU includes a transliterator for `Any-Latin` but for the life of me I cannot find its rules. Its not in the CLDR transforms directory that I can see as forward or backward, not an implicit transform as best I can tell from TR35. And I can?t even find it text searching the repo. Which suggests its derived somehow but I can?t find it.
Any suggestions on where to look to find the rules defining the `Any-Latin` transform?
Many thanks, ?Kip