UAX #29 and WB4

Mark Davis ☕️ via Unicode unicode at unicode.org
Wed Mar 4 17:58:57 CST 2020


One thing we have considered for a while is whether to do a rewrite of the
rules to simplify the processing (and avoid the "treat as" rules), but it
would take a fair amount of design work that we haven't had time to do. If
you (or others) are interested in getting involved, please let us know.

Mark


On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode <
unicode at unicode.org> wrote:

> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buenzli at erratique.ch)
> wrote:
>
> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buenzli at erratique.ch)
> wrote:
> >
> > > Re-reading the text I suspect I should not restart the rules from the
> first one when a
> > WB4
> > > rewrite occurs but only apply the subsequent rules. Is that correct ?
> >
> > However even if that's correct I don't understand how this test case
> works:
> >
> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0] ZERO
> WIDTH JOINER (ZWJ_FE)
> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
> >
> > Here the first two chars get rewritten with WB4 to ExtPic then if only
> subsequent rules
> > are applied we end up in WB999 and a break between 200D and 1F6D1.
>
> That's nonsense and not the operational model of the algorithm which IIRC
> was once clearly stated on this list by Mark Davis (sorry I failed to dig
> out the message) which is to take each boundary position candidate and
> apply the rule in sequences taking the first one that matches and then
> start over with the next one.
>
> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but
> then that implicitely adds a non boundary condition -- this is not really
> evident from the formalism but see the comment above WB4, for that boundary
> position that settles the non boundary condition. Then we start again
> applying the rules between 200D and the last 1F6D1 and WB3c matches before
> WB4 quicks.
>
> I think the behaviour of → rules should be clarified: it's not clear on
> which data you apply it w.r.t. the boundary position candiate. If I
> understand correctly if the match spans over the boundary position
> candidate that simply turns it into a non-boundary. Otherwise you apply the
> rule on the left of the boundary position candiate.
>
> Regarding the question of my original message it seems at a certain point
> I knew better:
>
>   https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
>
> Sorry for the noise.
>
> Daniel
>
> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the
> operational model of the rules a bit (I also have the impression that the
> formalism to express all that may not be the right one, but then I don't
> have something better to propose at the time). Also it would be nicer for
> implementers if they didn't have to factorize rules themselves (e.g. like
> in the new LB30 rules of UAX14) so that correctness of implemented rules is
> easier to assert.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20200304/8241fd6c/attachment.html>


More information about the Unicode mailing list