Potential contradiction between the WordBreak test data and UAX #29

Tom Hacohen tom at osg.samsung.com
Wed Nov 23 05:28:41 CST 2016

On 23/11/16 11:20, Philippe Verdy wrote:
> 2016-11-23 12:00 GMT+01:00 Tom Hacohen <tom at osg.samsung.com
> <mailto:tom at osg.samsung.com>>:
>     Also take another look at
>     http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules
>     <http://www.unicode.org/reports/tr29/#Grapheme_Cluster_and_Format_Rules>
>     specifically the table that shows another way of writing the ignore
>     rule. This again shows my understanding of rule 4 is correct.
>     Specially look at the following equivalence:
>     X Y × Z W       ⇒       X (Extend | Format)* Y (Extend | Format)* ×
>     Z (Extend | Format)* W
> This expansion does not occur before rule WB4; it cannot be used to
> transform rules WB1 to WB3c; this is explicitly stated in the algorithm.
> And because the rule WB3c handles your case, you are misinterpreting the
> specs as if it was applying there too...

I took a look at the ICU sources, and they explicitly mention this case, 
so it seems I was mistaken with interpreting the intention of the UAX. I 
still find it confusing, but based on this thread, it seems to just be me.

Sorry for the noise.

The comment from the ICU source code:
# Rule 3c   ZWJ x (Extended_Pict | EmojiNRK).  Precedes WB4, so no 
intervening Extend chars allowed.

Thanks for your help,

More information about the Unicode mailing list