Potential contradiction between the WordBreak test data and UAX #29

Wed Nov 23 05:14:51 CST 2016

You say "theres's no case where two rules apply". I don't think this is
right, rules apply in the precedence order as long as they've not produced
a decision for generating a "break here" or no break here". This is
especially important for rules that generate only a replacement, that are
executed in the displayed order. because multiple rules may have their left
side member match simultaneously.

You have to read them as if this was a:

if (condition1) then (replacement1)
else if (condition2) then (replacement2)
else if (condition3) then (replacement3)
...
else if (conditionnN) then (replacementN)

The order of conditions (i.e. the order of rules) is significant when
several one may be true simultaneously.

Then when handling the replacement, of course you restart from the
begining. But what happens on the input stream is very different if it
contains a "break here" or "no break here" (e.g. rule WB3c), or not (e.g.
rule WB4): in the first case, the substitution will not advance the input
stream, it just transforms it (it changes the internal parser state only),
in the second case, the state is transformed but all elements in the put
stream before the "break here" or no-break here" are discarded from the
input stream, leaving only those on the left part of the "break
here"/"nobreak here".

The input state is a FIFO stack where each element contains:
  { a text buffer (or equivalently an index pointing to the relative end
position in the input stream buffer) cumulating all characters (or bytes)
from the input to which the WB class was assigned;
    a WB class (a small integer) to which this input string was mapped
  }
and the input strema buffer.

The automata processes each rule in the listed order: to see if a rule
match it just uses the seond element (the WB class) of elements starting
from those of the bottom of the stack.

If there's not enough elements in trhe FIFO stack to match a rule
completely (in "hungry" mode if that matching rule contains "*" or "+") it
will read additional bytes or a character from the input stream, to append
to the top of the input buffer until it can assign it a WB class, and that
element will just contain that character and that WB class that will be
pushed to the top of the FIFO stack.

When a tested rule matches one or more elements starting from the bottom of
the FIFO,
* the replacement will transform only these elements in the FIFO: all
characters in their internal text buffers are combined if needed if the
replacement reduces the number of WB class items, otherwise the WB class is
just replaced in the relevant element of the FIFO stack, but characters are
kept unchanged.
* Then if the replacement in that matched rule contains a "break here'" or
"no-break" item, all characters in the bottom of the FIFO up to that
position are output: they are popped out from the FIFO, but other items in
the FIFO are kept.

An automata can optimize this FIFO so that the set of rules (equivalent to
an ordered set of regexps) becomes a finite state automata. But as the set
of regexp is ordered, it is possible that from some input some common
prefix in multiple regexps will match simulteneously: their order is
significant.

This is more complex than in the initial specification of word breakers
where there was no "hungry" regexps and matching occured only on pairs of
characters, so that you did not need a FIFO (or the FIFO always contained a
single element, never more, and the text buffer in that element was reduced
to just one character or their encoded bytes): in that case there was still
a significant order or rules, so that only if multiple ones were potentialy
matching the input pair, their order in the specification determined their
precedence (in that case it was possibly to summarize the ordered set of
rules with a simple 2D lookup table).

But if you look at rule WB4: X (Extend | Format | ZWJ)*→X
(which is "hungry" and not bound in length, and which does not pop out any
characters from the input FIFO but still cumulate them in the input state
until it no longer matches longer inputs with "X (Extend | Format | ZWJ)*),
the simple 2D lookup table array approach does no work: it will match
partial input at the same time as other concurrent rules, but concurrent
rules must be ignored if their precedence is lower (because their rule
number is higher).

So the automata cannot be a finite-state automata whose state is
represented only by a single integer in a small bound set (the set of WB
class values).

Note also that the input stream is complemented with additional
pseudo-characters "sot" and "eot" surrounding it: the automata will be
initialized by pushing a {"", sot} element in the FIFO and when the end of
strem is reached, it will push a {"", eot} element to the FIFO. This is
needed for rules WB1 and WB2 (that have the highest precedence in the set
of regexps to match).

The last rule "WB999: Any ÷ Any" is not "hungry" but is equivalent to a
match-all pairs regexp "..", and because it is the last rule, it has the
lowest precedence: it will always match simultaneously with other rules
matching pairs, but will be ignored unless none of the previous rules match.

Not all rules are matching pairs (or longer sequences), notably not rules
WB3a, WB3b that match isolated newlines, but all other rules are matching
at least a pair of character, this means that rules WB3a and WB3b are in
fact those that have the highest precedence.

These rules not matching pairs are:
  WB3a: (Newline | CR | LF)÷
  WB3b: ÷(Newline | CR | LF)

They are in compact form but are equivalent to the expanded form showing
their replacement:
  WB3a: (Newline | CR | LF)÷  → sot
Effectively this is the only rule that matches a single character, all
other rules are matching pairs.

Rule WB999 will match "sot eot" and will discard "sot" from the FIFO,
leaving "eot" alone. ("Any eot" is matched in rule WB2). There's an
additional final (implicit) rule needed to match "eot" alone: it will
terminate the automata. So all other rules are considering at least one
pair and WB999 will match all of them.

2016-11-23 10:13 GMT+01:00 Tom Hacohen <tom at osg.samsung.com>:

> You said:
> > So ignore it and test whever the last symbols glues with ZWJ (it should,
> > so there's no break in the reference implementation).
>
> Which makes me think you misread the example I quoted. There is a break in
> the reference implementation, though I argue (like you just did) that there
> shouldn't be. So I think you agree with me and also think it's broken.
>
> Otherwise, I'm not sure I fully understand what you are saying, but if
> what you are saying is correct, then following the same logic, other rules
> would fail, specifically:
>
> ÷ 0061 × 2060 × 0030 ÷  #  ÷ [0.2] LATIN SMALL LETTER A (ALetter) × [4.0]
> WORD JOINER (Format_FE) × [9.0] DIGIT ZERO (Numeric) ÷ [0.3]
>
> After the FE here there's no BREAK because:
> ALetter Format Numeric -> ALetter Numeric
> Which then following rule 9.0 is a no-break.
>
> This is exactly the rule (4) as described in my previous email, just with
> a different follow-up rule (9 instead of 3c). I don't see how rule
> precedence would matter here, as there is no case for which two rules apply.
>
> --
> Tom.
>
>
> On 23/11/16 02:49, Philippe Verdy wrote:
>
>> IMHO, the ZWJ should glue with the last symbol following your examples.
>> But the combining diaeresis following the ZWJ extends it (even if in my
>> opinion it is "defective" and would likely display on a dotted ciurcle
>> in renderers, but not defective for the string definition of combining
>> sequences).
>> So ignore it and test whever the last symbols glues with ZWJ (it should,
>> so there's no break in the reference implementation).
>>
>> WB4: X (Extend | Format | ZWJ)*→X
>>
>> Extend: [ExtendGrapheme_Extend=Yes]  This includes:
>>   General_Category = Nonspacing_Mark (this includes the combining
>> diaeresis)
>>   General_Category = Enclosing_Mark
>>   U+200C ZERO WIDTH NON-JOINER
>>   plus a few General_Category = Spacing_Mark needed for canonical
>> equivalence.
>>
>> So yes we have: ZWJ "COMBINING DIERESIS" (EBG|Glue_After_Zwj) → ZWJ
>> (EBG|Glue_After_Zwj) from rule WB4 eliminate the combining mark from the
>> input queue
>>
>> But rule WB3c comes before and prohibits it:
>>
>> WB3c: ZWJ × (Glue_After_Zwj | EBG)
>>
>> This means that you have first:
>>
>> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>>
>> and this does not match the rule WB4 which is not matching for:
>>
>> X × (Extend | Format | ZWJ)*→X
>>
>> (it cannot remove the extenders if there's a no-break before them, it is
>> valid only when the break oppotunity is still unspecified. As soon as a
>> rule as produced a "break here" or "nobreak here" at a given position,
>> you must advance after this position (the rules are based on a small
>> finite state machine). So after :
>>
>> ZWJ "COMBINING DIERESIS" GAZ →  ZWJ × "COMBINING DIERESIS" EBG
>>
>> it just remains in your input queue:
>>
>> "COMBINING DIERESIS" EBG  (because "ZWJ ×" is already processed, and so
>> ZWJ is elminated)
>>
>> Now comes WB4: X (Extend | Format | ZWJ)* → X
>>
>> There's no more any "X" to match before the combining diaeresis: your
>> input queue starts by the combining diareasis matching "X", the
>> following character (EBG) does not match within "(Extend | Format |
>> ZWJ)*" (which matches an empty string and does not contain the combining
>> diaresis already matched in "X"), rule WB4 has then no replacement
>> effect and preserves the initial "X" (i.e. the combining diaeresis)
>>
>> .
>>
>>
>>
>>
>>
>>
>>
>> 2016-11-22 13:07 GMT+01:00 Tom Hacohen <tom at osg.samsung.com
>> <mailto:tom at osg.samsung.com>>:
>>
>>
>>     Dear,
>>
>>     I recently updated libunibreak[1] according to unicode 9.0.0. I
>>     thought I implemented it correctly, however it fails against two of
>>     the tests in the reference test data:
>>
>>     ÷ 200D × 0308 ÷ 2764 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) × [4.0]
>>     COMBINING DIAERESIS (Extend_FE) ÷ [999.0] HEAVY BLACK HEART
>>     (Glue_After_Zwj) ÷ [0.3]
>>
>>     and
>>
>>     ÷ 200D × 0308 ÷ 1F466 ÷ #  ÷ [0.2] ZERO WIDTH JOINER (ZWJ_FE) ×
>>     [4.0] COMBINING DIAERESIS (Extend_FE) ÷ [999.0] BOY (EBG) ÷ [0.3]
>>
>>
>>     More specifically, it fails in both after the "combining diaeresis".
>>     My implementation marks it as a break, whereas the test data as not.
>>     The reference implementation, as expected, agrees with the test data.
>>
>>
>>     However, looking at the test case and the UAX[2], this does not look
>>     correct. More specifically, because of rule 4:
>>     ZWJ Extended GAZ -> ZWJ GAZ
>>     And then according to rule 3c, there should be no break opportunity
>>     between them. The reference implementation, however, uses rule 999
>>     here, which I believe is incorrect.
>>
>>
>>     Am I missing anything, or is this an issue with the reference test
>>     data and reference implementation?
>>
>>     Thanks,
>>     Tom.
>>
>>     [1]: https://github.com/adah1972/libunibreak
>>     <https://github.com/adah1972/libunibreak>
>>     [2]: http://www.unicode.org/reports/tr29/#WB1
>>     <http://www.unicode.org/reports/tr29/#WB1>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161123/ff89f1c5/attachment.html>