EBCDIC control characters
Ken Whistler
kenwhistler at sonic.net
Sat Jun 20 09:45:45 CDT 2020
On 6/20/2020 5:11 AM, Corentin via Unicode wrote:
>
> My guess is that Ken is alluding to not storing arbitrary text in
> strings, but rather in arrays of code units along with appropriate
> length parameters.
>
>
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main
> use case would be test cases?
Yes, that was basically what I was alluding to.
Richard is making the purist point that U+0000 is a Unicode character,
and therefore should be transmissible as part of any Unicode plain text
stream.
But the C string is not actually "plain text" -- it is a convention for
representing a string which makes use of 0x00 as a "syntactic" character
to terminate the string without counting for its length. And that was
already true back in 7-bit ASCII days, of course. Peoples' workaround,
if they need to represent NULL *in* character data in a "string" in a C
program, was to simply use char arrays, manage length external to the
"string" stored in the array, and then avoid the regular C string
runtime library calls when manipulating them, because those depend on
0x00 as a signal of string termination.
Such cases need not be limited to test cases. One can envision real
cases, as for example, packing a data store full of null-terminated
strings and then wanting to manipulate that entire data store as a
chunk. It is, of course, full of NULL bytes for the null-terminated
strings. But the answer, of course, is to just keep track of the size of
the entire data store and use memcpy() instead of strcpy(). I've had to
deal with precisely such cases in real production code.
Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character,
but in UTF-8 it is, of course, represented as a single 0x00 code unit.
And for the ASCII subset of Unicode, you cannot even tell the difference
-- it is precisely identical, as far as C strings and their manipulation
is concerned. Which was precisely my point:
7-bit ASCII: One cannot represent NULL (0x00) as part of the content of
a C string. Resort to char arrays.
Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the
content of a C string. Resort to char arrays.
The convention of using non-shortest UTF-8 to represent embedded NULLs
in C strings was simply a non-interoperable hack that people tried
because they fervently believed that NULLs *should* be embeddable in C
strings, after all. The UTC put a spike in that one by ruling that
non-shortest UTF-8 was ill-formed for any purpose.
This whole issue has been a permanent confusion for C programmers, I
think, largely because C is so loosey goosey about pointers, where a
pointer is really just an index register wolf in sheep's clothing. With
a char* pointer in hand, one cannot really tell whether it is referring
to an actual C string following the null-termination convention, or a
char array full of characters interpreted as a "string", but without
null termination, or a char array full of arbitrary byte values meaning
anything. And from that source flow thousands upon thousands of C
program bugs. :(
--Ken
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/5a4a7579/attachment.htm>
More information about the Unicode
mailing list