EBCDIC control characters
richard.wordingham at ntlworld.com
Fri Jun 19 17:58:12 CDT 2020
On Fri, 19 Jun 2020 13:24:41 -0700
Ken Whistler via Unicode <unicode at unicode.org> wrote:
> On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > Isn't there still the issue of supporting U+0000 in C-type
> > strings?
> I don't see why. And it has nothing to do with Unicode per se, anyway.
> That is just a transform of the question of "the issue of supporting
> 0x00 in C-type strings restricted to ASCII."
> The issue is precisely the same, and the solutions are precisely the
> same -- by design.
There is a solution, but it's not nice. The solution is to work with
UTF-8 plus one other character code - <0xC0, 0x80> for U+0000. In the
absence of policemen, it works.
While Ken and Asmus both live (I can't remember whose life time it
is), one can use scalar values beyond 0x10FFFF for character-like
non-character entities, such as byte values with bit 7 or higher set
(a widespread possibility for file names), or some enormous CJK glyph
sets. I understand Emacs does that sort of thing, storing them using
an extension of UTF-8, and seems to get away with it. I believe they're
also used for Bucky-bitted 'characters' from keyboards. Outside Emacs,
such things also provide reliable, private non-characters. Again, one
has to watch out for policemen, which can make life fraught in
More information about the Unicode