EBCDIC control characters

Corentin corentin.jabot at gmail.com
Sat Jun 20 12:00:51 CDT 2020


On Sat, 20 Jun 2020 at 16:45, Ken Whistler <kenwhistler at sonic.net> wrote:

>
> On 6/20/2020 5:11 AM, Corentin via Unicode wrote:
>
> My guess is that Ken is alluding to not storing arbitrary text in
>> strings, but rather in arrays of code units along with appropriate
>> length parameters.
>>
>
> Oh, yes, I see thanks.
> It's a special case of "null-terminated strings were a mistake".
> But U+0000 has no other use or alternative semantic right? The main use
> case would be test cases?
>
> Yes, that was basically what I was alluding to.
>
> Richard is making the purist point that U+0000 is a Unicode character, and
> therefore should be transmissible as part of any Unicode plain text stream.
>
> But the C string is not actually "plain text" -- it is a convention for
> representing a string which makes use of 0x00 as a "syntactic" character to
> terminate the string without counting for its length. And that was already
> true back in 7-bit ASCII days, of course. Peoples' workaround, if they need
> to represent NULL *in* character data in a "string" in a C program, was to
> simply use char arrays, manage length external to the "string" stored in
> the array, and then avoid the regular C string runtime library calls when
> manipulating them, because those depend on 0x00 as a signal of string
> termination.
>
> Such cases need not be limited to test cases. One can envision real cases,
> as for example, packing a data store full of null-terminated strings and
> then wanting to manipulate that entire data store as a chunk. It is, of
> course, full of NULL bytes for the null-terminated strings. But the answer,
> of course, is to just keep track of the size of the entire data store and
> use memcpy() instead of strcpy(). I've had to deal with precisely such
> cases in real production code.
>
> Now fast forward to Unicode and UTF-8. U+0000 is a Unicode character, but
> in UTF-8 it is, of course, represented as a single 0x00 code unit. And for
> the ASCII subset of Unicode, you cannot even tell the difference -- it is
> precisely identical, as far as C strings and their manipulation is
> concerned. Which was precisely my point:
>
> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content of a
> C string. Resort to char arrays.
>
> Unicode UTF-8: One cannot represent U+0000 NULL (0x00) as part of the
> content of a C string. Resort to char arrays.
>
> The convention of using non-shortest UTF-8 to represent embedded NULLs in
> C strings was simply a non-interoperable hack that people tried because
> they fervently believed that NULLs *should* be embeddable in C strings,
> after all. The UTC put a spike in that one by ruling that non-shortest
> UTF-8 was ill-formed for any purpose.
>
> This whole issue has been a permanent confusion for C programmers, I
> think, largely because C is so loosey goosey about pointers, where a
> pointer is really just an index register wolf in sheep's clothing. With a
> char* pointer in hand, one cannot really tell whether it is referring to an
> actual C string following the null-termination convention, or a char array
> full of characters interpreted as a "string", but without null termination,
> or a char array full of arbitrary byte values meaning anything. And from
> that source flow thousands upon thousands of C program bugs. :(
>

To be super pedantic, strings *are* arrays, but they decay to pointers
really easily, at which point the only way to know their size is to look
for 0x0, which made sense at one point in 1964 - if you never use strlen
you are fine. in fact it is common for people to use multiple null as
string delimiters within a larger array


> --Ken
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/e3bbb68d/attachment.htm>


More information about the Unicode mailing list