Potential contradiction between the WordBreak test data and UAX #29

Daniel Bünzli daniel.buenzli at erratique.ch
Wed Nov 23 04:01:59 CST 2016


On Tuesday 22 November 2016 at 13:07, Tom Hacohen wrote:
> However, looking at the test case and the UAX[2], this does not look
> correct. More specifically, because of rule 4:
> ZWJ Extended GAZ -> ZWJ GAZ
> And then according to rule 3c, there should be no break opportunity 
> between them. 

I'd say this is not the right operational model. From [1]: 

"The rules are processed from top to bottom. As soon as a rule matches and produces a boundary status (boundary or no boundary) for that offset, the process is terminated."

So in this case between COMBINING DIAERESIS and HEAVY BLACK HEART rule WB4 quicks in. It does not produce a boundary status, it only changes your offset context to ZWJ GAZ, as you mention. Now you continue applying the rules sequentially with WB6 which does not match, with WB7 which does not match,... and you'll get to WB999 which matches and produces a boundary status. 

After WB4 you do not restart the matching process from the beginning, as you do, leading you to say that WB3c should apply.

Best, 

Daniel


[1] http://www.unicode.org/reports/tr29/#Notation




More information about the Unicode mailing list