EBCDIC control characters

Richard Wordingham richard.wordingham at ntlworld.com
Fri Jun 19 17:58:12 CDT 2020


On Fri, 19 Jun 2020 13:24:41 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:

> On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > Isn't there still the issue of supporting U+0000 in C-type
> > strings?  
> 
> I don't see why. And it has nothing to do with Unicode per se, anyway.
> 
> That is just a transform of the question of "the issue of supporting
> 0x00 in C-type strings restricted to ASCII."
> 
> The issue is precisely the same, and the solutions are precisely the
> same -- by design.

There is a solution, but it's not nice.  The solution is to work with
UTF-8 plus one other character code - <0xC0, 0x80> for U+0000.  In the
absence of policemen, it works.

While Ken and Asmus both live (I can't remember whose life time it
is), one can use scalar values beyond 0x10FFFF for character-like
non-character entities, such as byte values with bit 7 or higher set
(a widespread possibility for file names), or some enormous CJK glyph
sets.  I understand Emacs does that sort of thing, storing them using
an extension of UTF-8, and seems to get away with it.  I believe they're
also used for Bucky-bitted 'characters' from keyboards. Outside Emacs,
such things also provide reliable, private non-characters.  Again, one
has to watch out for policemen, which can make life fraught in
complicated environments.

Richard.


More information about the Unicode mailing list