Potential contradiction between the WordBreak test data and UAX #29
tom at osg.samsung.com
Tue Nov 22 06:07:16 CST 2016
I recently updated libunibreak according to unicode 9.0.0. I thought
I implemented it correctly, however it fails against two of the tests in
the reference test data:
÷ 200D × 0308 ÷ 2764 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
(Glue_After_Zwj) ÷ [0.3]
÷ 200D × 0308 ÷ 1F466 ÷ # ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
More specifically, it fails in both after the "combining diaeresis". My
implementation marks it as a break, whereas the test data as not. The
reference implementation, as expected, agrees with the test data.
However, looking at the test case and the UAX, this does not look
correct. More specifically, because of rule 4:
ZWJ Extended GAZ -> ZWJ GAZ
And then according to rule 3c, there should be no break opportunity
between them. The reference implementation, however, uses rule 999 here,
which I believe is incorrect.
Am I missing anything, or is this an issue with the reference test data
and reference implementation?
More information about the Unicode