Are Emoji ZWJ sequences characters?

William_J_G Overington via Unicode unicode at unicode.org
Mon May 15 09:57:00 CDT 2017


I am concerned about emoji ZWJ sequences being encoded without going through the ISO process and whether Unicode will therefore lose synchronization with ISO/IEC 10646.

I have raised this by email and a very helpful person has advised me that encoding emoji sequences does not mean that Unicode and ISO/IEC 10646 go out of being synchronized because ZWJ sequences are not *characters*, and they have no implications for ISO/IEC 10646, noting that ISO/IEC 10646 does not define ZWJ sequences. 

Now I have great respect for the person who advised me. However I am a researcher and I opine that I need evidence.

Thus I am writing to the mailing list in the hope that there will be a discussion please.

http://www.unicode.org/reports/tr51/tr51-11.html (A proposed update document)

http://www.unicode.org/Public/emoji/5.0/emoji-zwj-sequences.txt

http://www.unicode.org/charts/PDF/U1F300.pdf

http://www.unicode.org/charts/PDF/U1F680.pdf

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even though internally they are sequences.

end quote

In emoji-zwj-sequences.txt there is the following line.

1F468 200D 1F680                            ; Emoji_ZWJ_Sequence  ; man astronaut 

>From U1F300.pdf, 1F468 is MAN

200D is ZWJ

>From U1F680.pdf 1F680 is ROCKET

The reasoning upon which I base my concern is as follows.

0063 is c

0070 is p

0074 is t

If 0063 200D 0074 is used to specifically request a ct ligature in a display of some text, then the meaning of 0063 200D 0074 is the same as the meaning of 0063 0074 and indeed a font with an OpenType table could cause a ct ligature to be displayed even if the sequence is 0063 0074 rather than the sequence 0063 200D 0074 that is used where the ligature glyph is specifically requested. Thus the meaning of ct is not changed by using the ZWJ character.

Now the use of the ct ligature is well-known and frequent.

Suppose now that a fontmaker is making a font of his or her own and decides to include a glyph for a pp ligature, with a swash flourish joining and going beyond the lower ends of the descenders both to the left and to the right.

The fontmaker could note that the ligature might be good in a word like copper but might look wrong in a word like happy due to the tail on the letter y clashing with the rightward side of the swash flourish. So the fontmaker encodes 0070 200D 0070 as a pp ligature but does not encode 0070 0070 as a pp ligature, so that the ligature glyph is only used when specifically requested using a ZWJ character.

However, when the ZWJ character is used, the meaning of the pp sequence is not changed from the meaning when the pp sequence is not used.

Yet when 1F468 200D 1F680 is used, the meaning of the sequence is different from the meaning of the sequence 1F468 1F680 such that the meaning of 1F468 200D 1F680 is listed in a file available from the Unicode website.

>From where does the astronaut's spacesuit and helmet come?

I am reminded that in chemistry if one mixes two chemicals, sometimes one just gets a mixture of two chemicals and sometimes one gets a chemical reaction such that another chemical is produced.

Repeating the quote from earlier in this post.

In tr51-11.html at 2.3 Emoji ZWJ Sequences

quote

To the user of such a system, these behave like single emoji characters, even though internally they are sequences.

end quote

I am concerned that in the future a user of ISO/IEC 10646 will not be able to find from ISO/IEC 10646 the meaning of an emoji that he or she observes being displayed, even if he or she is able to discover what is the sequence of characters being used.

So I ask that this matter be discussed please.

William Overington

Monday 15 May 2017



More information about the Unicode mailing list