EBCDIC control characters

Ken Whistler kenwhistler at sonic.net
Sat Jun 20 09:45:45 CDT 2020


On 6/20/2020 5:11 AM, Corentin via Unicode wrote:
>
>     My guess is that Ken is alluding to not storing arbitrary text in
>     strings, but rather in arrays of code units along with appropriate
>     length parameters.
>
>
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main 
> use case would be test cases?

Yes, that was basically what I was alluding to.

Richard is making the purist point that U+0000 is a Unicode character, 
and therefore should be transmissible as part of any Unicode plain text 
stream.

But the C string is not actually "plain text" -- it is a convention for 
representing a string which makes use of 0x00 as a "syntactic" character 
to terminate the string without counting for its length. And that was 
already true back in 7-bit ASCII days, of course. Peoples' workaround, 
if they need to represent NULL *in* character data in a "string" in a C 
program, was to simply use char arrays, manage length external to the 
"string" stored in the array, and then avoid the regular C string 
runtime library calls when manipulating them, because those depend on 
0x00 as a signal of string termination.

Such cases need not be limited to test cases. One can envision real 
cases, as for example, packing a data store full of null-terminated 
strings and then wanting to manipulate that entire data store as a 
chunk. It is, of course, full of NULL bytes for the null-terminated 
strings. But the answer, of course, is to just keep track of the size of 
the entire data store and use memcpy() instead of strcpy(). I've had to 
deal with precisely such cases in real production code.

Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, 
but in UTF-8 it is, of course, represented as a single 0x00 code unit. 
And for the ASCII subset of Unicode, you cannot even tell the difference 
-- it is precisely identical, as far as C strings and their manipulation 
is concerned. Which was precisely my point:

7-bit ASCII: One cannot represent NULL (0x00) as part of the content of 
a C string. Resort to char arrays.

Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the 
content of a C string. Resort to char arrays.

The convention of using non-shortest UTF-8 to represent embedded NULLs 
in C strings was simply a non-interoperable hack that people tried 
because they fervently believed that NULLs *should* be embeddable in C 
strings, after all. The UTC put a spike in that one by ruling that 
non-shortest UTF-8 was ill-formed for any purpose.

This whole issue has been a permanent confusion for C programmers, I 
think, largely because C is so loosey goosey about pointers, where a 
pointer is really just an index register wolf in sheep's clothing. With 
a char* pointer in hand, one cannot really tell whether it is referring 
to an actual C string following the null-termination convention, or a 
char array full of characters interpreted as a "string", but without 
null termination, or a char array full of arbitrary byte values meaning 
anything. And from that source flow thousands upon thousands of C 
program bugs. :(

--Ken

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/5a4a7579/attachment.htm>


More information about the Unicode mailing list