Potential contradiction between the WordBreak test data and UAX #29

Tom Hacohen tom at osg.samsung.com
Wed Nov 23 03:13:28 CST 2016


You said:
 > So ignore it and test whever the last symbols glues with ZWJ (it should,
 > so there's no break in the reference implementation).

Which makes me think you misread the example I quoted. There is a break 
in the reference implementation, though I argue (like you just did) that 
there shouldn't be. So I think you agree with me and also think it's broken.

Otherwise, I'm not sure I fully understand what you are saying, but if 
what you are saying is correct, then following the same logic, other 
rules would fail, specifically:

÷ 0061 × 2060 × 0030 ÷  #  ÷ [0.2] LATIN SMALL LETTER A (ALetter) × 
[4.0] WORD JOINER (Format_FE) × [9.0] DIGIT ZERO (Numeric) ÷ [0.3]

After the FE here there's no BREAK because:
ALetter Format Numeric -> ALetter Numeric
Which then following rule 9.0 is a no-break.

This is exactly the rule (4) as described in my previous email, just 
with a different follow-up rule (9 instead of 3c). I don't see how rule 
precedence would matter here, as there is no case for which two rules apply.

--
Tom.

On 23/11/16 02:49, Philippe Verdy wrote:
> IMHO, the ZWJ should glue with the last symbol following your examples.
> But the combining diaeresis following the ZWJ extends it (even if in my
> opinion it is "defective" and would likely display on a dotted ciurcle
> in renderers, but not defective for the string definition of combining
> sequences).
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).
>
> WB4: X (Extend | Format | ZWJ)*→X
>
> Extend: [ExtendGrapheme_Extend=Yes]  This includes:
>   General_Category = Nonspacing_Mark (this includes the combining diaeresis)
>   General_Category = Enclosing_Mark
>   U+200C ZERO WIDTH NON-JOINER
>   plus a few General_Category = Spacing_Mark needed for canonical
> equivalence.
>
> So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ
> (EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the
> input queue
>
> But rule WB3c comes before and prohibits it:
>
> WB3c: ZWJ × (Glue_After_Zwj | EBG)
>
> This means that you have first:
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> and this does not match the rule WB4 which is not matching for:
>
> X × (Extend | Format | ZWJ)*→X
>
> (it cannot remove the extenders if there's a no-break before them, it is
> valid only when the break oppotunity is still unspecified. As soon as a
> rule as produced a "break here" or "nobreak here" at a given position,
> you must advance after this position (the rules are based on a small
> finite state machine). So after :
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> it just remains in your input queue:
>
> "COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
> ZWJ is elminated)
>
> Now comes WB4: X (Extend | Format | ZWJ)* → X
>
> There's no more any "X" to match before the combining diaeresis: your
> input queue starts by the combining diareasis matching "X", the
> following character (EBG) does not match within "(Extend | Format |
> ZWJ)*" (which matches an empty string and does not contain the combining
> diaresis already matched in "X"), rule WB4 has then no replacement
> effect and preserves the initial "X" (i.e. the combining diaeresis)
>
> .
>
> 	
> 	
>
>
>
>
> 2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom at osg.samsung.com
> <mailto:tom at osg.samsung.com>>:
>
>     Dear,
>
>     I recently updated libunibreak[1] according to unicode 9.0.0. I
>     thought I implemented it correctly, however it fails against two of
>     the tests in the reference test data:
>
>     ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>     COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
>     (Glue_After_Zwj) ÷ [0.3]
>
>     and
>
>     ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) ×
>     [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>
>
>     More specifically, it fails in both after the "combining diaeresis".
>     My implementation marks it as a break, whereas the test data as not.
>     The reference implementation, as expected, agrees with the test data.
>
>
>     However, looking at the test case and the UAX[2], this does not look
>     correct. More specifically, because of rule 4:
>     ZWJ Extended GAZ -> ZWJ GAZ
>     And then according to rule 3c, there should be no break opportunity
>     between them. The reference implementation, however, uses rule 999
>     here, which I believe is incorrect.
>
>
>     Am I missing anything, or is this an issue with the reference test
>     data and reference implementation?
>
>     Thanks,
>     Tom.
>
>     [1]: https://github.com/adah1972/libunibreak
>     <https://github.com/adah1972/libunibreak>
>     [2]: http://www.unicode.org/reports/tr29/#WB1
>     <http://www.unicode.org/reports/tr29/#WB1>
>
>



More information about the Unicode mailing list