Unifying E_Modifier and Extend in UAX 29 (i.e. the necessity of GB10)

Mark Davis ☕️ via Unicode unicode at unicode.org
Tue Jan 2 04:52:27 CST 2018


BTW, relevant to this discussion is a proposal filed http://www.unicode.org/
L2/L2017/17434-emoji-rejex-uts51-def.pdf (The date is wrong, should
be 2017-12-22)

Mark

On Tue, Jan 2, 2018 at 11:41 AM, Mark Davis ☕️ <mark at macchiato.com> wrote:

> We had that originally, but some people objected that some languages
> (Arabic, as I recall) can end a string of letters with a ZWJ, and
> immediately follow it by an emoji (without an intervening space) without
> wanting it to be joined into a grapheme cluster with a following symbol.
> While I personally consider that a degenerate case, we tightened the
> definition to prevent that.
>
> Mark
>
> Mark
>
> On Tue, Jan 2, 2018 at 10:41 AM, Manish Goregaokar <manish at mozilla.com>
> wrote:
>
>> In the current draft GB11 mentions Extended_Pictographic Extend* ZWJ x
>> Extended_Pictographic.
>>
>> Can this similarly be distilled to just ZWJ x Extended_Pictographic? This
>> does affect cases like <indic letter, virama, ZWJ, emoji> or <arabic
>> letter, zwj, emoji> and I'm not certain if that counts as a degenerate
>> case. If we do this then all of the rules except the flag emoji one become
>> things which can be easily calculated with local information, which is nice
>> for implementors.
>>
>> (Also in the current draft I think GB11 needs a `E_Modifier?` somewhere
>> but if we merge that with Extend that's not going to be necessary anyway)
>>
>> -Manish
>>
>> On Tue, Jan 2, 2018 at 3:02 PM, Manish Goregaokar <manish at mozilla.com>
>> wrote:
>>
>>> > Note: we are already planning to get rid of the GAZ/EBG distinction (
>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>>
>>>
>>> This is great! I hadn't noticed this when I last saw that draft (I was
>>> focusing on the Virama stuff). Good to know!
>>>
>>>
>>> > Instead, we'd add one line to
>>> *Extend <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>>
>>> Yeah, this is essentially what I was hoping we could do.
>>>
>>> Is there any way to formally propose this? Or is bringing it up here
>>> good enough?
>>>
>>> Thanks,
>>>
>>> -Manish
>>>
>>> On Mon, Jan 1, 2018 at 9:17 PM, Mark Davis ☕️ via Unicode <
>>> unicode at unicode.org> wrote:
>>>
>>>> This is an interesting suggestion, Manish.
>>>>
>>>> <non-emoji-base, skin tone modifier> is a degenerate case, so if we
>>>> following your suggestion we also could drop E_Base and E_Modifier, and
>>>> rule GB10.
>>>>
>>>> Instead, we'd add one line to *Extend
>>>> <http://www.unicode.org/reports/tr29/tr29-32.html#Extend>:*
>>>>
>>>> OLD
>>>> Grapheme_Extend = Yes
>>>> *and not* GCB = Virama
>>>>
>>>> NEW
>>>> Grapheme_Extend = Yes, or
>>>> Emoji characters listed as Emoji_Modifier=Yes in emoji-data.txt. See [
>>>> UTS51 <http://www.unicode.org/reports/tr41/tr41-21.html#UTS51>].
>>>> *and not* GCB = Virama
>>>>
>>>> Note: we are already planning to get rid of the GAZ/EBG distinction (
>>>> http://www.unicode.org/reports/tr29/tr29-32.html#GB10) in any event.
>>>>
>>>> Mark
>>>>
>>>> On Mon, Jan 1, 2018 at 3:52 PM, Richard Wordingham via Unicode <
>>>> unicode at unicode.org> wrote:
>>>>
>>>>> On Mon, 1 Jan 2018 13:24:29 +0530
>>>>> Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
>>>>>
>>>>> > <random non-emoji, skin tone modifier> sounds very much like a
>>>>> > degenerate case to me.
>>>>>
>>>>> Generally yes, but I'm not sure that they'd be inappropriate for
>>>>> Egyptian hieroglyphs showing human beings.  The choice of determinative
>>>>> can convey unpronounceable semantic information, though I'm not sure
>>>>> that that can be as sensitive as skin colour.  However, in such a case
>>>>> it would also be appropriate to give a skin tone modifier the property
>>>>> Extend.
>>>>>
>>>>> Richard.
>>>>>
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180102/d0e32382/attachment.html>


More information about the Unicode mailing list