Unicode "no-op" Character?
Mark Davis ☕️ via Unicode
unicode at unicode.org
Wed Jul 3 12:32:36 CDT 2019
Your goal is not achievable. We can't wave a magic wand, and suddenly (or
even within decades) all processes everywhere ignore U+000F in all
processing will not happen.
This thread is pointless and should be terminated.
Mark
On Wed, Jul 3, 2019 at 5:48 PM Sławomir Osipiuk via Unicode <
unicode at unicode.org> wrote:
> I’m frustrated at how badly you seem to be missing the point. There is
> nothing impossible nor self-contradictory here. There is only the matter
> that Unicode requires all scalar values to be preserved during interchange.
> This is in many ways a good idea, and I don’t expect it to change, but
> something else would be possible if this requirement were explicitly
> dropped for a well-defined small subset of characters (even just one
> character). A modern-day SYN.
>
>
>
> Let’s say it’s U+000F. The standard takes my proposal and makes it a
> discardable, null-displayable character. What does this mean?
>
>
>
> U+000F may appear in any text. It has no (external) semantic value. But it
> may appear. It may appear a lot.
>
>
>
> Display routines (which are already dealing with combining, ligaturing,
> non-/joiners, variations, initial/medial/finals forms) understand that
> U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move
> to the next character. Simple.
>
>
>
> Security gateways filter it out completely, as a matter of best practice
> and security-in-depth.
>
>
>
> A process, let’s call it Process W, adds a bunch of U+000F to a string it
> received, or built, or a user entered via keyboard. Maybe it’s to
> packetize. Maybe to mark every word that is an anagram of the name of a
> famous 19th-century painter, or that represents a pizza topping. Maybe
> something else. This is a versatile character. Process W is done adding
> U+000F to the string. It stores in it a database UTF-8 encoded field.
> Encoding isn’t a problem. The database is happy.
>
>
>
> Now Process X runs. Process X is meant to work with Process W and it’s
> well-aware of how U+000F is used. It reads the string from the database. It
> sees U+000F and interprets it. It chops the string into packets, or does a
> websearch for each famous painter, or it orders pizza. The private meaning
> of U+000F is known to both Process X and Process W. There is useful
> information encoded in-band, within a limited private context.
>
>
>
> But now we have Process Y. Process Y doesn’t care about packets or
> painters or pizza. Process Y runs outside of the private context that X and
> W had. Process Y translates strings into Morse code for transmission. As
> part of that, it replaces common words with abbreviations. Process Y
> doesn’t interpret U+000F. Why would it? It has no semantic value to Process
> Y.
>
>
>
> Process Y reads the string from the database. Internally, it clears all
> instances of U+000F from the string. They’re just taking up space. They’re
> meaningless to Y. It compiles the Morse code sequence into an audio file.
>
>
>
> But now we have Process Z. Process Z wants to take a string and mark every
> instance of five contiguous Latin consonants. It scrapes the database
> looking for text strings. It finds the string Process W created and marked.
> Z has no obligation to W. It’s not part of that private context. Process Z
> clears all instances of U+000F it finds, then inserts its own wherever it
> finds five-consonant clusters. It stores its results in a UTF-16LE text
> file. It’s allowed to do that.
>
>
>
> Nothing impossible happened here. Let’s summarize:
>
>
>
> Processes W and X established a private meaning for U+000F by agreement
> and interacted based on that meaning.
>
>
>
> Process Y ignored U+000F completely because it assigned no meaning to it.
>
>
>
> Process Z assigned a completely new meaning to U+000F. That’s permitted
> because U+000F is special and is guaranteed to have no semantics without
> private agreement and doesn’t need to be preserved.
>
>
>
> There is no need to escape anything. Escaping is used when a character
> must have more than one meaning (i.e. it is overloaded, as when it is both
> text and markup). U+000F only gets one meaning in any context. In a new
> context, the meaning gets overridden, not overloaded. That’s what makes it
> special.
>
>
>
> I don’t expect to see any of this in official Unicode. But I take
> exception to the idea that I’m suggesting something impossible.
>
>
>
>
>
> *From:* Philippe Verdy [mailto:verdy_p at wanadoo.fr]
> *Sent:* Wednesday, July 03, 2019 04:49
> *To:* Sławomir Osipiuk
> *Cc:* unicode Unicode Discussion
> *Subject:* Re: Unicode "no-op" Character?
>
>
>
> Your goal is **impossible** to reach with Unicode. Assume sich character
> is "added" to the UCS, then it can appear in the text. Your goal being that
> it should be "warrantied" not to be used in any text, means that your
> "character" cannot be encoded at all.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190703/3f7a53a2/attachment.html>
More information about the Unicode
mailing list