Use of tag characters in a private encoding - is it valid please?

Ken Whistler kenwhistler at sonic.net
Mon May 6 22:28:15 CDT 2024


I'm not going to pile on about what constitutes "higher-level", but ...

On 5/6/2024 7:23 PM, Erik Carvalhal Miller via Unicode wrote:
> If emoji tag sequences, including the existing flag emoji tag 
> sequences, categorically constitute markup, then this markup format is 
> one which Unicode has paradoxically defined as part of its plain‐text 
> standard.
This is erroneous. Emoji tag sequences are not "defined as part of [the 
Unicode Consortium's] plain-text standard", i.e. the Unicode Standard. 
Emoji tag sequences are defined in and by UTS #51, which is a *separate* 
specification defined on top of the Unicode Standard. Emoji tag 
sequences make use of the tag characters defined in the Unicode 
Standard, but UTS #51 is defining a protocol for their use which is 
built on top of the Unicode Standard, and not formally a part of it.
> If Unicodeʼs RGI sequence for a Welsh‐flag emoji is plain text, then 
> an emoji tag sequence headed by a private‐use emoji can be too, as per 
> TUS §23.5 and UTS #51.  Why could it not be? 

Well, because private use is private use. The formal definition of an 
emoji_tag_sequence depends on the definition of a tag_base, which can 
either be an emoji_character or an emoji_modifier_sequence or an 
emoji_presentation_sequence. The problem, for extending any of those to 
PUA, is that all of those entity sets are very clearly and precisely 
defined by enumerations in data files associated with each version of 
the publication of UTS #51. PUA characters are not included in any of 
those lists. Therefore, a PUA character cannot be a tag_base, per UTS #51.

It doesn't suffice to say, well, I've decided that U+F0000 is going to 
be an emoji character, so I can use it in an emoji_tag_sequence, per UTS 
#51. Rather, what one would have to do is build out a private agreement 
that 1) I am going to be treat U+F0000 as an emoji, and 2) I am going to 
be using a private extension of the concept of an emoji_tag_sequence 
which allows my "emoji" U+F0000 as a tag_base. I can document that, and 
if I can get somebody else to buy into that private agreement, then by 
all means, interchange all you want. But anybody else who happens to 
sample some of that text is under no obligation whatsoever to interpret 
any of that, or to even recognize your private extension of the concept 
of an emoji_tag_sequence to even be syntactically correct, let alone 
interpretable.

--Ken




More information about the Unicode mailing list