Use of tag characters in a private encoding - is it valid please?

Erik Carvalhal Miller ecm.unicode at gmail.com
Tue May 7 13:16:19 CDT 2024


On Mon, May 6, 2024 at 11:34 PM Ken Whistler via Unicode <
unicode at corp.unicode.org> wrote:
> This is erroneous. Emoji tag sequences are not "defined as part of [the
> Unicode Consortium's] plain-text standard", i.e. the Unicode Standard.
> Emoji tag sequences are defined in and by UTS #51, which is a *separate*
> specification defined on top of the Unicode Standard. Emoji tag
> sequences make use of the tag characters defined in the Unicode
> Standard, but UTS #51 is defining a protocol for their use which is
> built on top of the Unicode Standard, and not formally a part of it.

Youʼre of course correct about emoji tag sequences being defined in and by
UTS #51.  There are actually three things we call the Unicode Standard: the
nowadays epic‐length book also known as the core specification, a
collection of documents that includes that book (and excludes UTS #51), and
the intangible and utterly complex concept which that collection defines.
The Unicode Standard (the book — hence also the collection), in chapter 23,
§23.9, says, “The current conformant use of the undeprecated 96 tag
characters is specified in Unicode Technical Standard #51, ‘Unicode Emoji.’
 See ED-14a. emoji tag sequence (ETS) and Annex C, Valid Emoji Tag
Sequences in that specification.”  No, the Standard itself (book or
collection) does not define what emoji tag sequences are or which ones are
valid; but that same Standard points to UTS #51 as the definitive
specification of ETSs for “conformant use” of tag characters.  I think itʼs
quite reasonable to read that passage as acknowledging/specifying/defining
ETSsʼ place as part of the Standard (the concept), even if itʼs outsourcing
the details.  Perhaps that separation is a useful fiction, as fictions
sometimes are (“Unicode, Inc. is a person!”), and of course an abstraction
such as the Unicode Standard (the concept, of course) is a malleable
fabrication — so, I wonʼt begrudge you the useful fiction.

> The formal definition of an
> emoji_tag_sequence depends on the definition of a tag_base, which can
> either be an emoji_character or an emoji_modifier_sequence or an
> emoji_presentation_sequence. The problem, for extending any of those to
> PUA, is that all of those entity sets are very clearly and precisely
> defined by enumerations in data files associated with each version of
> the publication of UTS #51. PUA characters are not included in any of
> those lists. Therefore, a PUA character cannot be a tag_base, per UTS #51.

This is erroneous.  The Standard (book/collection) tells us quite clearly
(most extensively in chapter 23, §23.5) that private‐use charactersʼ use
may be determined by agreement and nearly all properties of such characters
may be changed or overridden as per agreement.  There is nothing in the
Standard forbidding PUA characters from being treated as emoji under a
private agreement and therefore as viable candidates for tag_base.

> It doesn't suffice to say, well, I've decided that U+F0000 is going to
> be an emoji character, so I can use it in an emoji_tag_sequence, per UTS
> #51. Rather, what one would have to do is build out a private agreement
> that 1) I am going to be treat U+F0000 as an emoji, and 2) I am going to
> be using a private extension of the concept of an emoji_tag_sequence
> which allows my "emoji" U+F0000 as a tag_base. I can document that, and
> if I can get somebody else to buy into that private agreement, then by
> all means, interchange all you want.

In other words, an emoji tag sequence headed by a private‐use emoji can
indeed be plain text.  Good, thatʼs what I thought…!

> But anybody else who happens to
> sample some of that text is under no obligation whatsoever to interpret
> any of that, or to even recognize your private extension of the concept
> of an emoji_tag_sequence to even be syntactically correct, let alone
> interpretable.

Agreed, the risk of interpretation problems outside the context of the
private agreement exists, just as with all other PUA usage.  This
particular usage does add some risk with its exotic syntax possibly
upsetting some conformance gatekeeper.  But how great is that risk in
practical terms?  I hope my example of ⟨􏿽󠁔󠁨󠁩󠁳󠀠󠁩󠁳󠀠󠁡󠀠󠁴󠁥󠁳󠁴󠀮󠁿⟩
(for which I didnʼt create a private agreement, not even in my own mind)
isnʼt crashing anyoneʼs computer…
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240507/70df45c1/attachment-0001.htm>


More information about the Unicode mailing list