UAX #29: Ambiguities in WB4, and contributing back testcases

Richard Wordingham richard.wordingham at ntlworld.com
Thu Dec 22 15:08:22 CST 2016


On Wed, 21 Dec 2016 15:24:21 -0800
Manish Goregaokar <manish at mozilla.com> wrote:


> Aside from that, WB4's[6] greediness is underspecified. In previous
> versions, the rule was
<snip>

> However, now the rule is
> 
> > X (Extend | Format | ZWJ)* → X  
> 
> The problem here is that ZWJ appears in the previous rule as well,
> WB3c[7]:
> 
> > ZWJ × (Glue_After_Zwj | EBG)  
> 
> which says that we should not break between a ZWJ and a GAZ ("Glue
> After ZWJ") character.
> 
> WB3c has precedence over WB4, which means that a sequence like
> `Emoji_Base ZWJ EBG` becomes `Emoji_Base ZWJ×EBG` *first*, before the
> ZWJ is collapsed into the Emoji_Base. This is fine.
> 
> However, more complicated sequences depend on the greediness of the
> Kleene star in WB4. For example, take the sequence `Emoji_Base Extend
> ZWJ Extend EBG`. WB3c does not apply here. However, WB4 can apply
> since we have a Extend/ZWJ sequence.
> 
> WB4 can apply in multiple ways. If it is applied greedily, we get
> `Emoji_Base(..) EBG` (where ellipses are used to denote WB4-collapsed
> characters). This does break since you don't break between Emoji_Base
> and EBG.
> 
> However, we can apply it conservatively instead. We can get
> `Emoji_Base(..) ZWJ(..) EBG`, which does satisfy WB3c, and doesn't
> collapse.

>From your terminology, I think you have an error in your transformation
to a 'regular' expression.  Why don't you have the same problem when
you determine word breaks in

CR Extend LF?

I'm guessing that you have some mechanism that makes WB3 (CR × LF)
redundant.  Rule WB3c does *not* transform to

ZWJ(...) × (Glue_After_Zwj | EBG) 


Naively, I would say that WB4 can be reapplied to `Emoji_Base(..)
ZWJ(..) EBG`, yielding `Emoji_Base EBG` and thus a word break.

Richard.



More information about the Unicode mailing list