Potential contradiction between the WordBreak test data and UAX #29
Tom Hacohen
tom at osg.samsung.com
Wed Nov 23 05:28:41 CST 2016
On 23/11/16 11:20, Philippe Verdy wrote:
> 2016-11-23 12:00 GMT+01:00 Tom Hacohen <tom at osg.samsung.com
> <mailto:tom at osg.samsung.com>>:
>
>
> Also take another look at
> http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules
> <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules>
> specifically the table that shows another way of writing the ignore
> rule. This again shows my understanding of rule 4 is correct.
>
> Specially look at the following equivalence:
> X Y × Z W ⇒ X (Extend | Format)* Y (Extend | Format)* ×
> Z (Extend | Format)* W
>
>
> This expansion does not occur before rule WB4; it cannot be used to
> transform rules WB1 to WB3c; this is explicitly stated in the algorithm.
> And because the rule WB3c handles your case, you are misinterpreting the
> specs as if it was applying there too...
>
I took a look at the ICU sources, and they explicitly mention this case,
so it seems I was mistaken with interpreting the intention of the UAX. I
still find it confusing, but based on this thread, it seems to just be me.
Sorry for the noise.
The comment from the ICU source code:
# Rule 3c ZWJ x (Extended_Pict | EmojiNRK). Precedes WB4, so no
intervening Extend chars allowed.
Thanks for your help,
Tom
More information about the Unicode
mailing list