Unicode "no-op" Character?

Mark E. Shoulson via Unicode unicode at unicode.org
Wed Jul 3 16:44:26 CDT 2019


Um... How could you be sure that process X would get the no-ops that 
process W wrote?  After all, it's *discardable*, like you said, and the 
database programs and libraries aren't in on the secret.  The database 
API functions might well strip it out, because it carries no meaning to 
them. Unless you can count on _certain_ programs not discarding it, and 
then you'd need either specialty libraries or some kind of registry or 
terminology for "this program does NOT strip no-ops" vs ones that do... 
But then they wouldn't be discardable, would they?  Not by 
non-discarding programs.  Which would have to have ways to pass them 
around between themselves.

Moreover, as you say, what about when Process Z (or its companions) 
comes along and is using THE SAME MECHANISM for something utterly 
different?  How does it know that process W wasn't writing no-ops for 
it, but was writing them for Process X?  And of course, Z will trash 
them and insert its own there, and when process X comes to read it, they 
won't be there. You'd need to make sure that NOBODY is allowed to touch 
the string between *pairs* of generators and consumers of no-ops, 
specifically designated for each other.

Yes, this is about consensual acts between responsible processes W and 
X, but that's exactly what the PUA is for: being assigned meaning 
between consenting processes. And they are not discardable by 
non-consenting processes, precisely because they mean something to 
someone.  If your no-ops carry meaning, they are going to need to be 
preserved and passed around and not thrown away.  If they carry no 
meaning, why are you dealing with them?  Yes, PUA characters are 
annoying and break up grapheme clusters and stuff.  But they're the only 
way to do what you're trying to do.

~mark

On 7/3/19 11:44 AM, Sławomir Osipiuk via Unicode wrote:
>
> A process, let’s call it Process W, adds a bunch of U+000F to a string 
> it received, or built, or a user entered via keyboard. Maybe it’s to 
> packetize. Maybe to mark every word that is an anagram of the name of 
> a famous 19^th -century painter, or that represents a pizza topping. 
> Maybe something else. This is a versatile character. Process W is done 
> adding U+000F to the string. It stores in it a database UTF-8 encoded 
> field. Encoding isn’t a problem. The database is happy.
>
> Now Process X runs. Process X is meant to work with Process W and it’s 
> well-aware of how U+000F is used. It reads the string from the 
> database. It sees U+000F and interprets it. It chops the string into 
> packets, or does a websearch for each famous painter, or it orders 
> pizza. The private meaning of U+000F is known to both Process X and 
> Process W. There is useful information encoded in-band, within a 
> limited private context.
>
> But now we have Process Y. Process Y doesn’t care about packets or 
> painters or pizza. Process Y runs outside of the private context that 
> X and W had. Process Y translates strings into Morse code for 
> transmission. As part of that, it replaces common words with 
> abbreviations. Process Y doesn’t interpret U+000F. Why would it? It 
> has no semantic value to Process Y.
>
> Process Y reads the string from the database. Internally, it clears 
> all instances of U+000F from the string. They’re just taking up space. 
> They’re meaningless to Y. It compiles the Morse code sequence into an 
> audio file.
>
> But now we have Process Z. Process Z wants to take a string and mark 
> every instance of five contiguous Latin consonants. It scrapes the 
> database looking for text strings. It finds the string Process W 
> created and marked. Z has no obligation to W. It’s not part of that 
> private context. Process Z clears all instances of U+000F it finds, 
> then inserts its own wherever it finds five-consonant clusters. It 
> stores its results in a UTF-16LE text file. It’s allowed to do that.
>
> Nothing impossible happened here. Let’s summarize:
>
> Processes W and X established a private meaning for U+000F by 
> agreement and interacted based on that meaning.
>
> Process Y ignored U+000F completely because it assigned no meaning to it.
>
> Process Z assigned a completely new meaning to U+000F. That’s 
> permitted because U+000F is special and is guaranteed to have no 
> semantics without private agreement and doesn’t need to be preserved.
>
> There is no need to escape anything. Escaping is used when a character 
> must have more than one meaning (i.e. it is overloaded, as when it is 
> both text and markup). U+000F only gets one meaning in any context. In 
> a new context, the meaning gets overridden, not overloaded. That’s 
> what makes it special.
>
> I don’t expect to see any of this in official Unicode. But I take 
> exception to the idea that I’m suggesting something impossible.
>
> *From:*Philippe Verdy [mailto:verdy_p at wanadoo.fr]
> *Sent:* Wednesday, July 03, 2019 04:49
> *To:* Sławomir Osipiuk
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Unicode "no-op" Character?
>
> Your goal is **impossible** to reach with Unicode. Assume sich 
> character is "added" to the UCS, then it can appear in the text. Your 
> goal being that it should be "warrantied" not to be used in any text, 
> means that your "character" cannot be encoded at all.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190703/11532262/attachment.html>


More information about the Unicode mailing list