Potential contradiction between the WordBreak test data and UAX #29

Philippe Verdy verdy_p at wanadoo.fr
Tue Nov 22 20:49:08 CST 2016

IMHO, the ZWJ should glue with the last symbol following your examples.
But the combining diaeresis following the ZWJ extends it (even if in my
opinion it is "defective" and would likely display on a dotted ciurcle in
renderers, but not defective for the string definition of combining
So ignore it and test whever the last symbols glues with ZWJ (it should, so
there's no break in the reference implementation).

WB4: X (Extend | Format | ZWJ)*→X

Extend: [ExtendGrapheme_Extend=Yes]  This includes:
  General_Category = Nonspacing_Mark (this includes the combining diaeresis)
  General_Category = Enclosing_Mark
  plus a few General_Category = Spacing_Mark needed for canonical

So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ (EBG|
Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input

But rule WB3c comes before and prohibits it:

WB3c: ZWJ × (Glue_After_Zwj | EBG)

This means that you have first:


and this does not match the rule WB4 which is not matching for:

X × (Extend | Format | ZWJ)*→X

(it cannot remove the extenders if there's a no-break before them, it is
valid only when the break oppotunity is still unspecified. As soon as a
rule as produced a "break here" or "nobreak here" at a given position, you
must advance after this position (the rules are based on a small finite
state machine). So after :


it just remains in your input queue:

"COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so ZWJ
is elminated)

Now comes WB4: X (Extend | Format | ZWJ)* → X

There's no more any "X" to match before the combining diaeresis: your input
queue starts by the combining diareasis matching "X", the following
character (EBG) does not match within "(Extend | Format | ZWJ)*" (which
matches an empty string and does not contain the combining diaresis already
matched in "X"), rule WB4 has then no replacement effect and preserves the
initial "X" (i.e. the combining diaeresis)


2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom at osg.samsung.com>:

> Dear,
> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I
> implemented it correctly, however it fails against two of the tests in the
> reference test data:
> ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> (Glue_After_Zwj) ÷ [0.3]
> and
> ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
> More specifically, it fails in both after the "combining diaeresis". My
> implementation marks it as a break, whereas the test data as not. The
> reference implementation, as expected, agrees with the test data.
> However, looking at the test case and the UAX[2], this does not look
> correct. More specifically, because of rule 4:
> ZWJ Extended GAZ -> ZWJ GAZ
> And then according to rule 3c, there should be no break opportunity
> between them. The reference implementation, however, uses rule 999 here,
> which I believe is incorrect.
> Am I missing anything, or is this an issue with the reference test data
> and reference implementation?
> Thanks,
> Tom.
> [1]: https://github.com/adah1972/libunibreak
> [2]: http://www.unicode.org/reports/tr29/#WB1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161123/8eee3518/attachment.html>

More information about the Unicode mailing list