Unicode "no-op" Character?
Sławomir Osipiuk via Unicode
unicode at unicode.org
Sat Jul 13 00:36:15 CDT 2019
Hello again everyone,
Though I initially took the shoo-away, there have been some comments
made since then that I feel compelled to rebut. To avoid spamming the
list, I’ve combined my responses into a single message.
Before that, I will say, again, for the record: I know this NOOP idea
is unlikely to ever happen. Certainly not with the responses I've
gotten. I haven't submitted it, nor even looked into how to. I know it
would be rejected. This is a thought experiment, nothing more. If that
doesn't interest you, please disregard this message.
And again, the hypothetical NOOP is a character whose canonical
equivalent is the absence of a character. The logical consequences of
that statement apply fully.
On Wed, Jul 3, 2019 at 8:00 PM Shawn Steele via Unicode
<unicode at unicode.org> wrote:
> Even more complicated is that, as pointed out by others, it's pretty much impossible to say "these n codepoints should be ignored and have no meaning" because some process would try to use codepoints 1-3 for some private meaning. Another would use codepoint 1 for their own thing, and there'd be a conflict.
This is so utterly, completely, and severely missing the point I'm
starting to feel like a madman screaming to the heavens, "Why can't
they just understand?!"
Yes, a different process will have a different private meaning for the
codepoint. That is not a bug, it is a feature. A conflict is always
resolved by the current process saying, "I'm holding the string now.
The old NOOPs are gone, canonically decomposed to nothing. The new
ones mean what I want them to mean, as long as I or my buddies hold
the string. If you didn't want that, you shouldn't have given the
string to me!" This conflict-resolution mechanism is the special
sauce. If a process needs a private marker that will be preserved in
interchange, there are plenty of PUA characters to use, and even a
couple of private control characters.
> I also think that the conversation has pretty much proven that such a system is mathematically impossible. (You can't have a "private" no-meaning codepoint that won't conflict with other "private" uses in a public space).
No such thing has been proven in the slightest. Any conflict is
resolved, in the default case, by normalizing all NOOPs to nothing.
On Wed, Jul 3, 2019 at 5:46 PM Mark E. Shoulson via Unicode
<unicode at unicode.org> wrote:
> Um... How could you be sure that process X would get the no-ops that process W wrote? After all, it's *discardable*, like you said, and the database programs and libraries aren't in on the secret.
Yes, there is a requirement that W and X communicate via some
"NOOP-preserving path" (call it a NOOPPP). Such paths would generally
be very short and direct, because NOOPs are intended to be ephemeral,
not archival! They wouldn't be hard to come by. Memory mappings or
pipes. Direct inter-process comms. Anything that operates at
byte-level. Even simple persisting mechanisms like file storage or
databases can preserve NOOP by doing... nothing. "Discardable" doesn't
mean it must be discarded, merely that it can be. Where there are no
security implications or other need, strings containing NOOP can
simply be passed through and stored as-is. Where any interface,
library, or process does not preserve NOOP, it cannot be part of a
NOOPPP. Tough luck.
> Moreover, as you say, what about when Process Z (or its companions) comes along and is using THE SAME MECHANISM for something utterly different? How does it know that process W wasn't writing no-ops for it, but was writing them for Process X?
It is the responsibility of Process Z (and any process that interprets
NOOPs non-trivially) to be aware of the context/source of what it's
receiving. Prior agreement or advertised contract.
On Wed, Jul 3, 2019 at 2:06 PM Rebecca Bettencourt <beckiergb at gmail.com> wrote:
> And the database driver filters out the U+000F completely as a matter of best practice and security-in-depth.
I'm struggling to see the security implication of "store this string,
verbatim, in your regular VARCHAR (or whatever) text field". I can
store the string "DROP TABLE [STUDENTS];" in a text field and unless
the database is horribly broken it will store that without issue. A
database could strip out NOOP out of text fields and still claim to be
Unicode conformant. But I wonder why it would bother to do that. And
even then, you could just store the string in a VARBINARY field or
whatever just accepts bytes.
> You can't say "this character should be ignored everywhere" and "this character should be preserved everywhere" at the same time. That's the contradiction.
I have not said "this character should be preserved everywhere". That
statement is completely false. Unfortunately, that means what I said
is still not being understood at all. Forgive me for being frustrated.
Finally, a general comment:
I think people are getting hung-up on this idea because they’re still
thinking in terms of what is being guaranteed, while this is
explicitly about an inversion of that concept. Not a guarantee, but a
disclaimer. I called it an “ephemeral private sentinel” because that
name captures what it is. It’s not for archiving or interchange,
except for extremely short and direct cases under special conditions.
Most objections I’ve gotten so far arise out of misunderstanding and
attempts to force normal character behaviour on it. I can take
criticism, but not when it’s based on a completely false premise.
Define a character that is canonically equivalent to the absence of a
character. Make it so a conforming receiving process able to purge it
whenever convenient. That's not hard to implement, especially in
relation to other existing requirements. But would it be useful? I
claim it would be very useful indeed. Many things that can be done
with ordinary characters will not be possible with this one. That's
fine. Other things will be possible.
This idea isn’t really dissimilar to the original intended meanings of
SYN or NUL or DEL, or for that matter to Unicode noncharacters. In
fact if the standard had enforced purging noncharacters during
interchange (instead of vacillating about their illegality before
currently recommending they be preserved or at least U+FFFDed) we’d
already be 99% of the way to what I suggested. The ideal opportunity
to define this behaviour (for a single code point or a set) was almost
three decades ago, but it definitely could have been done, and it
would not have been expensive.
I just hold onto this idea for that day I get a time machine.
More information about the Unicode