Unicode "no-op" Character?

Philippe Verdy via Unicode unicode at unicode.org
Sat Jun 29 14:46:57 CDT 2019


If you want to "packetize" arbitrarily long Unicode text, you don't need
any new magic character. Just prepend your packet with a base character
used as a syntaxic delimiter, that does not combine with what follows in
any normalization.

There's a fine character for that: the TAB control. Except that during
transmission it may turn into a SPACE that would combine. (the same will
happen with "=" which can combine with a combining slash).

But look at the normalization data (and consider that Unicode warranties
that there will not be any addition of new combining pair starting by the
same base character) there are LOT of suitable base characters in Unicode,
which you can use as a syntaxic delimiter.

Some examples (in the ASCII subset) include the hyphen-minus, the
apostrophe-quote, the double quotation mark...

So it's easy to split an arbitrarily long text at arbitrary character
position, even in the middle of any cluster or combining sequence. It does
not matter that this character may create a "cluster" with the following
character, your "packetized" stream is still not readable text, but only a
transport syntax (just like quoted-printable, or Base64).

You can also freely choose the base character at end of each packet (the
newlines are not safe as lines may be merged, but like Base64, "=" is fine
to terminate each packet, as well as two ASCII quotation marks, and in fact
all punctuations and symbols from ASCII (you can even use the ASCII letters
and digits).

If your packets have variable lengths, you may need to use escaping, or you
may prepend the length (in characters or in combining sequences) of your
packet before the expected terminator.

All this is used in MIME for attachments in emails (with the two common
transport syntaxes: Quoted Printable using escaping, or Base64 which does
not require any length but requires a distinctive terminator (not used to
encode the data part of the "packet") for variable length "packets".





Le dim. 23 juin 2019 à 02:35, Sławomir Osipiuk via Unicode <
unicode at unicode.org> a écrit :

> I assure you, it wasn’t very interesting. :-) Headache-y, more like. The
> diacritic thing was completely inapplicable anyway, as all our text was
> plain English. I really don’t want to get into what the thing was, because
> it sounds stupider the more I try to explain it. But it got the wheels
> spinning in my head, and now that I’ve been reading up a lot about Unicode
> and older standards like 2022/6429, it got me thinking whether there might
> already be an elegant solution.
>
>
>
> But, as an example I’m making up right now, imagine you want to packetize
> a large string. The packets are not all equal sized, the sizes are
> determined by some algorithm. And the packet boundary may occur between a
> base char and a diacritic. You insert markers into the string at the packet
> boundaries. You can then store the string, copy it, display it, or pass it
> to the sending function which will scan the string and know to send the
> next packet when it reaches the marker. And you can now do all that without
> the need to pass around extra metadata (like a list of ints of where the
> packet boundaries are supposed to be) or to re-calculate the boundaries;
> it’s still just a big string. If a different application sees the string,
> it will know to completely ignore the packet markers; it can even strip
> them out if it wants to (the canonical equivalent of the noop character is
> the absence of a character).
>
>
>
> As should be obvious, I’m not recommending this as good practice.
>
>
>
>
>
> *From:* Shawn Steele [mailto:Shawn.Steele at microsoft.com]
> *Sent:* Saturday, June 22, 2019 19:57
> *To:* Sławomir Osipiuk; unicode at unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> + the list.  For some reason the list’s reply header is confusing.
>
>
>
> *From:* Shawn Steele
> *Sent:* Saturday, June 22, 2019 4:55 PM
> *To:* Sławomir Osipiuk <sosipiuk at gmail.com>
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> The original comment about putting it between the base character and the
> combining diacritic seems peculiar.  I’m having a hard time visualizing how
> that kind of markup could be interesting?
>
>
>
> *From:* Unicode <unicode-bounces at unicode.org> *On Behalf Of *Slawomir
> Osipiuk via Unicode
> *Sent:* Saturday, June 22, 2019 2:02 PM
> *To:* unicode at unicode.org
> *Subject:* RE: Unicode "no-op" Character?
>
>
>
> I see there is no such character, which I pretty much expected after
> Google didn’t help.
>
>
>
> The original problem I had was solved long ago but the recent article
> about watermarking reminded me of it, and my question was mostly out of
> curiosity. The task wasn’t, strictly speaking, about “padding”, but about
> marking – injecting “flag” characters at arbitrary points in a string
> without affecting the resulting visible text. I think we ended up using
> ESC, which is a dumb choice in retrospect, though the whole approach was a
> bit of a hack anyway and the process it was for isn’t being used anymore.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20190629/f736098f/attachment.html>


More information about the Unicode mailing list