Unicode "no-op" Character?

Shawn Steele via Unicode unicode at unicode.org
Wed Jul 3 18:58:37 CDT 2019


I think you're overstating my concern :)

I meant that those things tend to be particular to a certain context and often aren't interesting for interchange.  A text editor might find it convenient to place word boundaries in the middle of something another part of the system thinks is a single unit to be rendered.  At the same time, a rendering engine might find it interesting that there's an ff together and want to mark it to be shown as a ligature though that text editor wouldn't be keen on that at all.

As has been said, these are private mechanisms for things that individual processes find interesting.  It's not useful to mark those for interchange as the text editors word breaking marks would interfere with the graphics engines glyph breaking marks.  Not to mention the transmission buffer size marks originally mentioned, which could be anywhere.

The "right" thing to do here is to use an internal higher level mechanism to keep track of these things however the component needs.  That can even be interchanged with another component designed to the same principles, via mechanisms like the PUA.  However, those components can't expect their private mechanisms are useful or harmless to other processes.  

Even more complicated is that, as pointed out by others, it's pretty much impossible to say "these n codepoints should be ignored and have no meaning" because some process would try to use codepoints 1-3 for some private meaning.  Another would use codepoint 1 for their own thing, and there'd be a conflict.  

As a thought experiment, I think it's certainly decent to ask the question "could such a mechanism be useful?"  It's an intriguing thought and a decent hypothesis that this kind of system could be privately useful to an application.  I also think that the conversation has pretty much proven that such a system is mathematically impossible.  (You can't have a "private" no-meaning codepoint that won't conflict with other "private" uses in a public space).

It might be worth noting that this kind of thing used to be fairly common in early computing.  Word processers would inject a "CTRL-I" token to toggle italics on or off.  Old printers used to use sequences to define the start of bold or italic or underlined or whatever sequences.  Those were private and pseudo-private mechanisms that were used internally &/or documented for others that wanted to interoperate with their systems.  (The printer folks would tell the word processers how to make italics happen, then other printer folks would use the same or similar mechanisms for compatibility - except for the dude that didn't get the memo and made their own scheme.)

Unicode was explicitly intended *not* to encode any of that kind of markup, and, instead, be "plain text," leaving other interesting metadata to other higher level protocols.  Whether those be word breaking, sentence parsing, formatting, buffer sizing or whatever.

-Shawn

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Richard Wordingham via Unicode
Sent: Wednesday, July 3, 2019 4:20 PM
To: unicode at unicode.org
Subject: Re: Unicode "no-op" Character?

On Wed, 3 Jul 2019 17:51:29 -0400
"Mark E. Shoulson via Unicode" <unicode at unicode.org> wrote:

> I think the idea being considered at the outset was not so complex as 
> these (and indeed, the point of the character was to avoid making 
> these kinds of decisions).

Shawn Steele appeared to be claiming that there was no good, interesting reason for separating base character and combining mark.  I was refuting that notion.  Natural text boundaries can get very messy - some languages have word boundaries that can be *within* an indecomposable combining mark.

Richard.



More information about the Unicode mailing list