<div dir="ltr"><div dir="ltr">On Mon, May 6, 2024 at 7:26 PM James Kass via Unicode <<a href="mailto:unicode@corp.unicode.org">unicode@corp.unicode.org</a>> wrote:<br></div><div class="gmail_quote"><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Asmus wrote, "If you create elaborate conventions for the use of tag <br>
characters you are creating a markup language. It's no different from <br>
re-using ASCII characters for syntax in addition to text."<br>
<br>
The question posed in the thread subject seems to have been answered by <br>
Asmus Freytag.<br>
<br>
PUA(1) + ZWJ + PUA(2) = a ligature glyph combining PUA(1) with PUA(2)<br>
- that's legit. Not higher level.<br>
<br>
PUA(1) + a string of tag characters = something completely different.<br>
- higher level. Even though this can be handled at the font/font engine <br>
level.<br>
<br>
So, if we're on the same page,<br>
1) U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG.<br>
2) COMET plus CIRCUMFLEX followed by the ASCII string "!313125"<br>
... both examples represent a private agreement mark-up, and Unicode <br>
shouldn't care.<br></blockquote><div><br></div>If emoji tag sequences, including the existing flag emoji tag sequences, categorically constitute markup, then this markup format is one which Unicode has paradoxically defined as part of its plain‐text standard. If Unicodeʼs RGI sequence for a Welsh‐flag emoji is plain text, then an emoji tag sequence headed by a private‐use emoji can be too, as per TUS §23.5 and UTS #51. Why could it not be? (I did raise objections to William Overingtonʼs hypothetical constructions on the bases that ① he additionally discussed non‐emoji tag sequences, for which the Standard makes no provision (outside the deprecated language tagging); ② the suggested tag sequences appeared to be an overly complicated way to encode private‐use characters, with no apparent benefit; and ③ the notion of making the tag characters conditionally visible as a fallback in standard reading mode is nonconformant. But those issues do not impact on the conformance of the basic idea of an emoji tag sequence headed by a PUA emoji.)<br><br>There are some significant distinctions among the hypothetical examples of peculiar character sequences which you (James Kass) have been examining:<br><br> • ⟨🆔⟩ followed by a sequence of six tag characters from the range U+E0020‥U+E007E is almost a well‐formed emoji tag sequence — it needs U+E007F CANCEL TAG appended to be well‐formed. But even with that addition itʼs currently invalid, as per UTS #51.<br> • As I have argued, U+10FFFD followed by the tag analogues of ⟨!313125⟩ and then a CANCEL TAG appears to be valid, if U+10FFFD is agreed to be an emoji and the entire sequence is meant to be interpreted as an emoji.<br> • ⟨☄^!313125⟩ is valid Unicode, such as it is. If in normal reading mode itʼs meant to be replaced by a different comet or an aardvark or a Klingon symbol for empire or anything other than a representation of the characters ⟨☄^!313125⟩, then the intended interpretation is not valid as Unicode plain text, though it may be perfectly valid markup of some sort or another beyond Unicodeʼs concern. If the idea is for a font to make one of those substitutions, then such a font is not Unicode‐conformant.<br><div> • ⟨<img src="aardvark.jpg">⟩ is similar to ⟨☄^!313125⟩: Though not Unicode‐conformant if to be normally interpreted as something other than that very sequence of characters, in the HTML context you cited it can serve as perfectly good markup. HTML is not intended to be processed at the font level — weʼre likely to see this sequence rendered as an aardvark image only when itʼs run through a Web browser or similar application. Plain text, on the other hand, is portable: Generally speaking, with the proper font support, plain text can be used anywhere with a consistent interpretation — yes, in various contexts you may run into issues such as markup interpretation, restrictions on allowable characters, and text‐length limits. But broadly, in plain text a ⟨rose⟩ is a ⟨rose⟩ is a ⟨rose⟩, wherever you go; whereas if youʼre finding that a ⟨🌹⟩ is a ⟨🌹⟩, youʼre probably dealing with not‐so‐plain text, even if the source code is plain text. </div></div></div>