Potential contradiction between the WordBreak test data and UAX #29

Tom Hacohen tom at osg.samsung.com
Tue Nov 22 06:07:16 CST 2016


I recently updated libunibreak[1] according to unicode 9.0.0. I thought 
I implemented it correctly, however it fails against two of the tests in 
the reference test data:

÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0] 
(Glue_After_Zwj) ÷ [0.3]


÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0] 
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]

More specifically, it fails in both after the "combining diaeresis". My 
implementation marks it as a break, whereas the test data as not. The 
reference implementation, as expected, agrees with the test data.

However, looking at the test case and the UAX[2], this does not look 
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity 
between them. The reference implementation, however, uses rule 999 here, 
which I believe is incorrect.

Am I missing anything, or is this an issue with the reference test data 
and reference implementation?


[1]: https://github.com/adah1972/libunibreak
[2]: http://www.unicode.org/reports/tr29/#WB1

More information about the Unicode mailing list