Use of tag characters in a private encoding - is it valid please?

Peter Constable pgcon6 at msn.com
Mon May 6 21:01:56 CDT 2024


> It's when PUA (or even non-PUA) characters are modified by tag characters as part of a private agreement that the scheme becomes higher level.  

If it's a tag sequence scheme defined by Unicode, then not higher-level. But if it's a scheme defined elsewhere, not by Unicode, they I agree that would become higher-level.


Peter

-----Original Message-----
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of James Kass via Unicode
Sent: Monday, May 6, 2024 4:22 PM
To: unicode at corp.unicode.org
Subject: Re: Use of tag characters in a private encoding - is it valid please?



On 2024-05-06 7:29 PM, Peter Constable via Unicode wrote:
> Perhaps what Asmus was reacting to was the mention of "higher-level". I understand you to mean _defined externally to Unicode_. But I think more common use of that term would be in relation to some _application of Unicode text encoding_ involving more than plain text. So, in relation to Unicode PUA, a private agreement on semantics of PUA code points would comprise a protocol, but not a _higher-level_ protocol.
>
My phrasing may have been inept.  For single PUA characters, or even strings of PUA characters, private agreements are not higher level because PUA characters are supposed to be defined by private agreement.

It's when PUA (or even non-PUA) characters are modified by tag characters as part of a private agreement that the scheme becomes higher level.  As Asmus pointed out, this is essentially a private agreement for mark-up.

Asmus wrote, "If you create elaborate conventions for the use of tag characters you are creating a markup language. It's no different from re-using ASCII characters for syntax in addition to text."

The question posed in the thread subject seems to have been answered by Asmus Freytag.

PUA(1) + ZWJ + PUA(2) = a ligature glyph combining PUA(1) with PUA(2)
- that's legit.  Not higher level.

PUA(1) + a string of tag characters = something completely different.
- higher level.  Even though this can be handled at the font/font engine level.

So, if we're on the same page,
1)  U+10FFFD followed by the tag versions of !313125 and a CANCEL TAG.
2)  COMET plus CIRCUMFLEX followed by the ASCII string "!313125"
... both examples represent a private agreement mark-up, and Unicode shouldn't care.




More information about the Unicode mailing list