Unicode "no-op" Character?
Sławomir Osipiuk via Unicode
unicode at unicode.org
Wed Jul 3 10:44:30 CDT 2019
I’m frustrated at how badly you seem to be missing the point. There is nothing impossible nor self-contradictory here. There is only the matter that Unicode requires all scalar values to be preserved during interchange. This is in many ways a good idea, and I don’t expect it to change, but something else would be possible if this requirement were explicitly dropped for a well-defined small subset of characters (even just one character). A modern-day SYN.
Let’s say it’s U+000F. The standard takes my proposal and makes it a discardable, null-displayable character. What does this mean?
U+000F may appear in any text. It has no (external) semantic value. But it may appear. It may appear a lot.
Display routines (which are already dealing with combining, ligaturing, non-/joiners, variations, initial/medial/finals forms) understand that U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move to the next character. Simple.
Security gateways filter it out completely, as a matter of best practice and security-in-depth.
A process, let’s call it Process W, adds a bunch of U+000F to a string it received, or built, or a user entered via keyboard. Maybe it’s to packetize. Maybe to mark every word that is an anagram of the name of a famous 19th-century painter, or that represents a pizza topping. Maybe something else. This is a versatile character. Process W is done adding U+000F to the string. It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The database is happy.
Now Process X runs. Process X is meant to work with Process W and it’s well-aware of how U+000F is used. It reads the string from the database. It sees U+000F and interprets it. It chops the string into packets, or does a websearch for each famous painter, or it orders pizza. The private meaning of U+000F is known to both Process X and Process W. There is useful information encoded in-band, within a limited private context.
But now we have Process Y. Process Y doesn’t care about packets or painters or pizza. Process Y runs outside of the private context that X and W had. Process Y translates strings into Morse code for transmission. As part of that, it replaces common words with abbreviations. Process Y doesn’t interpret U+000F. Why would it? It has no semantic value to Process Y.
Process Y reads the string from the database. Internally, it clears all instances of U+000F from the string. They’re just taking up space. They’re meaningless to Y. It compiles the Morse code sequence into an audio file.
But now we have Process Z. Process Z wants to take a string and mark every instance of five contiguous Latin consonants. It scrapes the database looking for text strings. It finds the string Process W created and marked. Z has no obligation to W. It’s not part of that private context. Process Z clears all instances of U+000F it finds, then inserts its own wherever it finds five-consonant clusters. It stores its results in a UTF-16LE text file. It’s allowed to do that.
Nothing impossible happened here. Let’s summarize:
Processes W and X established a private meaning for U+000F by agreement and interacted based on that meaning.
Process Y ignored U+000F completely because it assigned no meaning to it.
Process Z assigned a completely new meaning to U+000F. That’s permitted because U+000F is special and is guaranteed to have no semantics without private agreement and doesn’t need to be preserved.
There is no need to escape anything. Escaping is used when a character must have more than one meaning (i.e. it is overloaded, as when it is both text and markup). U+000F only gets one meaning in any context. In a new context, the meaning gets overridden, not overloaded. That’s what makes it special.
I don’t expect to see any of this in official Unicode. But I take exception to the idea that I’m suggesting something impossible.
From: Philippe Verdy [mailto:verdy_p at wanadoo.fr]
Sent: Wednesday, July 03, 2019 04:49
To: Sławomir Osipiuk
Cc: unicode Unicode Discussion
Subject: Re: Unicode "no-op" Character?
Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode