Potential contradiction between the WordBreak test data and UAX #29

Philippe Verdy verdy_p at wanadoo.fr
Tue Nov 22 20:56:39 CST 2016


Note also this statement at the begining of the specification:

Single boundaries. Each rule has exactly one boundary position. This
restriction is more a limitation on the specification methods, because a
rule with multiple boundaries could be expressed instead as multiple rules.
For example:
 *  “a b ÷ c d ÷ e f” could be broken into two rules “a b ÷ c d e f” and “a
b c d ÷ e f”
 *  “a b × c d × e f” could be broken into two rules “a b × c d e f” and “a
b c d × e f”

The rules are not built to allow keeping and processing multiple boundary
positions. Only one is considered: once a break or no-break decision is
made on a position, everything that is before that position is discarded
from the input and will no longer be used in further rule. The engines
loops at the first rule, just from that new boundary position to find
matching rules, without ever looking backward.

2016-11-23 3:49 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> IMHO, the ZWJ should glue with the last symbol following your examples.
> But the combining diaeresis following the ZWJ extends it (even if in my
> opinion it is "defective" and would likely display on a dotted ciurcle in
> renderers, but not defective for the string definition of combining
> sequences).
> So ignore it and test whever the last symbols glues with ZWJ (it should,
> so there's no break in the reference implementation).
>
> WB4: X (Extend | Format | ZWJ)*→X
>
> Extend: [ExtendGrapheme_Extend=Yes]  This includes:
>   General_Category = Nonspacing_Mark (this includes the combining
> diaeresis)
>   General_Category = Enclosing_Mark
>   U+200C ZERO WIDTH NON-JOINER
>   plus a few General_Category = Spacing_Mark needed for canonical
> equivalence.
>
> So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ (EBG|
> Glue_After_Zwj) from rule WB4 eliminate the combining mark from the input
> queue
>
> But rule WB3c comes before and prohibits it:
>
> WB3c: ZWJ × (Glue_After_Zwj | EBG)
>
> This means that you have first:
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> and this does not match the rule WB4 which is not matching for:
>
> X × (Extend | Format | ZWJ)*→X
>
> (it cannot remove the extenders if there's a no-break before them, it is
> valid only when the break oppotunity is still unspecified. As soon as a
> rule as produced a "break here" or "nobreak here" at a given position, you
> must advance after this position (the rules are based on a small finite
> state machine). So after :
>
> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>
> it just remains in your input queue:
>
> "COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
> ZWJ is elminated)
>
> Now comes WB4: X (Extend | Format | ZWJ)* → X
>
> There's no more any "X" to match before the combining diaeresis: your
> input queue starts by the combining diareasis matching "X", the following
> character (EBG) does not match within "(Extend | Format | ZWJ)*" (which
> matches an empty string and does not contain the combining diaresis already
> matched in "X"), rule WB4 has then no replacement effect and preserves the
> initial "X" (i.e. the combining diaeresis)
>
> .
>
>
>
>
>
>
> 2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom at osg.samsung.com>:
>
>> Dear,
>>
>> I recently updated libunibreak[1] according to unicode 9.0.0. I thought I
>> implemented it correctly, however it fails against two of the tests in the
>> reference test data:
>>
>> ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
>> (Glue_After_Zwj) ÷ [0.3]
>>
>> and
>>
>> ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>> COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>>
>>
>> More specifically, it fails in both after the "combining diaeresis". My
>> implementation marks it as a break, whereas the test data as not. The
>> reference implementation, as expected, agrees with the test data.
>>
>>
>> However, looking at the test case and the UAX[2], this does not look
>> correct. More specifically, because of rule 4:
>> ZWJ Extended GAZ -> ZWJ GAZ
>> And then according to rule 3c, there should be no break opportunity
>> between them. The reference implementation, however, uses rule 999 here,
>> which I believe is incorrect.
>>
>>
>> Am I missing anything, or is this an issue with the reference test data
>> and reference implementation?
>>
>> Thanks,
>> Tom.
>>
>> [1]: https://github.com/adah1972/libunibreak
>> [2]: http://www.unicode.org/reports/tr29/#WB1
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161123/474dd9ae/attachment.html>


More information about the Unicode mailing list