UAX #29: Ambiguities in WB4, and contributing back testcases

Manish Goregaokar manish at mozilla.com
Thu Dec 22 16:05:18 CST 2016


> Why don't you have the same problem when you determine word breaks in CR Extend LF?

By rule WB4, we don't break between CR and Extend, and treat the
CRxExtend aggregate as CR, and that in turn doesn't break with LF by
WB3.

The rule states that we "treat whatever is on the left side (X
(Format|Extend|ZWJ)*) as if it were whatever is on the right side
(X)".

I guess the confusion is, with → rules, do we apply them globally, or
only apply them when considering subsequent rules?

I suspect the answer here is that you only apply them in order. The
list of rules is not a list of precedences, but rather a list with the
order in which the rules are applied. So a → rule means "Treat the
left side as if it were the right side in the context of all
subsequent rules"

Thanks,
-Manish


On Thu, Dec 22, 2016 at 1:08 PM, Richard Wordingham
<richard.wordingham at ntlworld.com> wrote:
> On Wed, 21 Dec 2016 15:24:21 -0800
> Manish Goregaokar <manish at mozilla.com> wrote:
>
>
>> Aside from that, WB4's[6] greediness is underspecified. In previous
>> versions, the rule was
> <snip>
>
>> However, now the rule is
>>
>> > X (Extend | Format | ZWJ)* → X
>>
>> The problem here is that ZWJ appears in the previous rule as well,
>> WB3c[7]:
>>
>> > ZWJ × (Glue_After_Zwj | EBG)
>>
>> which says that we should not break between a ZWJ and a GAZ ("Glue
>> After ZWJ") character.
>>
>> WB3c has precedence over WB4, which means that a sequence like
>> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the
>> ZWJ is collapsed into the Emoji_Base. This is fine.
>>
>> However, more complicated sequences depend on the greediness of the
>> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
>> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
>> since we have a Extend/ZWJ sequence.
>>
>> WB4 can apply in multiple ways. If it is applied greedily, we get
>> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
>> characters). This does break since you don't break between Emoji_Base
>> and EBG.
>>
>> However, we can apply it conservatively instead. We can get
>> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
>> collapse.
>
> From your terminology, I think you have an error in your transformation
> to a 'regular' expression.  Why don't you have the same problem when
> you determine word breaks in
>
> CR Extend LF?
>
> I'm guessing that you have some mechanism that makes WB3 (CR × LF)
> redundant.  Rule WB3c does *not* transform to
>
> ZWJ(...) × (Glue_After_Zwj | EBG)
>
>
> Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
> ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.
>
> Richard.
>



More information about the Unicode mailing list