UAX #29 and WB4

Tue Mar 10 00:00:57 CDT 2020

 daniel.buenzli wrote:

I think the behaviour of → rules should be clarified

I wholeheartedly agree.

If I understand correctly if the match [or a "treat-as" rule] spans over
> the [candidate] boundary position candidate that simply turns it into a
> non-boundary. Otherwise you apply the rule on the left of the boundary
> position candiate.

I have considered the extent of a left-side treat-as match to not continue
beyond the candidate boundary position. This comes into play following a
ZWJ, where it may be absorbed into a "treat as" on the left (WB4), while
some other rule triggers on the right side (WB3C). At any rate, this is
what I do in ICU. It gets very confusing, and is tricky to implement.

Reconsidering how ZWJ rules work could also be a help, if we could figure
out how to keep them out of the "treat as" rules, but use explicit no-break
rules on both sides instead.

  -- Andy

On Wed, Mar 4, 2020 at 4:01 PM Mark Davis ☕️ via Unicode <
unicode at unicode.org> wrote:

> One thing we have considered for a while is whether to do a rewrite of the
> rules to simplify the processing (and avoid the "treat as" rules), but it
> would take a fair amount of design work that we haven't had time to do. If
> you (or others) are interested in getting involved, please let us know.
>
> Mark
>
>
> On Wed, Mar 4, 2020 at 11:30 AM Daniel Bünzli via Unicode <
> unicode at unicode.org> wrote:
>
>> On 4 March 2020 at 18:48:09, Daniel Bünzli (daniel.buenzli at erratique.ch)
>> wrote:
>>
>> > On 4 March 2020 at 18:01:25, Daniel Bünzli (daniel.buenzli at erratique.ch)
>> wrote:
>> >
>> > > Re-reading the text I suspect I should not restart the rules from the
>> first one when a
>> > WB4
>> > > rewrite occurs but only apply the subsequent rules. Is that correct ?
>> >
>> > However even if that's correct I don't understand how this test case
>> works:
>> >
>> > ÷ 1F6D1 × 200D × 1F6D1 ÷ # ÷ [0.2] OCTAGONAL SIGN (ExtPict) × [4.0]
>> ZERO WIDTH JOINER (ZWJ_FE)
>> > × [3.3] OCTAGONAL SIGN (ExtPict) ÷ [0.3]
>> >
>> > Here the first two chars get rewritten with WB4 to ExtPic then if only
>> subsequent rules
>> > are applied we end up in WB999 and a break between 200D and 1F6D1.
>>
>> That's nonsense and not the operational model of the algorithm which IIRC
>> was once clearly stated on this list by Mark Davis (sorry I failed to dig
>> out the message) which is to take each boundary position candidate and
>> apply the rule in sequences taking the first one that matches and then
>> start over with the next one.
>>
>> In that case applying the rules bewteen 1F6D1 and 200D leads to WB4 but
>> then that implicitely adds a non boundary condition -- this is not really
>> evident from the formalism but see the comment above WB4, for that boundary
>> position that settles the non boundary condition. Then we start again
>> applying the rules between 200D and the last 1F6D1 and WB3c matches before
>> WB4 quicks.
>>
>> I think the behaviour of → rules should be clarified: it's not clear on
>> which data you apply it w.r.t. the boundary position candiate. If I
>> understand correctly if the match spans over the boundary position
>> candidate that simply turns it into a non-boundary. Otherwise you apply the
>> rule on the left of the boundary position candiate.
>>
>> Regarding the question of my original message it seems at a certain point
>> I knew better:
>>
>>   https://www.unicode.org/mail-arch/unicode-ml/y2016-m11/0151.html
>>
>> Sorry for the noise.
>>
>> Daniel
>>
>> P.S. I still think the UAX29 and UAX14 could benefit from clarifiying the
>> operational model of the rules a bit (I also have the impression that the
>> formalism to express all that may not be the right one, but then I don't
>> have something better to propose at the time). Also it would be nicer for
>> implementers if they didn't have to factorize rules themselves (e.g. like
>> in the new LB30 rules of UAX14) so that correctness of implemented rules is
>> easier to assert.
>>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20200309/25935c2a/attachment.html>