UAX #29 and WB4
Daniel Bünzli via Unicode
unicode at unicode.org
Wed Mar 4 13:26:42 CST 2020
On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buenzli at erratique.ch) wrote:
> On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buenzli at erratique.ch) wrote:
>
> > Re-reading the text I suspect I should not restart the rules from the first one when a
> WB4
> > rewrite occurs but only apply the subsequent rules. Is that correct ?
>
> However even if that's correct I don't understand how this test case works:
>
> ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO WIDTH JOINER (ZWJ_FE)
> × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
>
> Here the first two chars get rewritten with WB4 to ExtPic then if only subsequent rules
> are applied we end up in WB999 and a break between 200D and 1F6D1.
That's nonsense and not the operational model of the algorithm which IIRC was once clearly stated on this list by Mark Davis (sorry I failed to dig out the message) which is to take each boundary position candidate and apply the rule in sequences taking the first one that matches and then start over with the next one.
In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but then that implicitely adds a non boundary condition -- this is not really evident from the formalism but see the comment above WB4, for that boundary position that settles the non boundary condition. Then we start again applying the rules between 200D and the last 1F6D1 and WB3c matches before WB4 quicks.
I think the behaviour of → rules should be clarified: it's not clear on which data you apply it w.r.t. the boundary position candiate. If I understand correctly if the match spans over the boundary position candidate that simply turns it into a non-boundary. Otherwise you apply the rule on the left of the boundary position candiate.
Regarding the question of my original message it seems at a certain point I knew better:
https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
Sorry for the noise.
Daniel
P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the operational model of the rules a bit (I also have the impression that the formalism to express all that may not be the right one, but then I don't have something better to propose at the time). Also it would be nicer for implementers if they didn't have to factorize rules themselves (e.g. like in the new LB30 rules of UAX14) so that correctness of implemented rules is easier to assert.
More information about the Unicode
mailing list