EBCDIC control characters

Sat Jun 20 07:11:15 CDT 2020

On Sat, 20 Jun 2020 at 13:14, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Sat, 20 Jun 2020 10:57:10 +0200
> Corentin via Unicode <unicode at unicode.org> wrote:
>
> > On Fri, 19 Jun 2020 at 22:30, Ken Whistler via Unicode
> > <unicode at unicode.org> wrote:
> >
> > >
> > > On 6/19/2020 3:48 AM, Richard Wordingham via Unicode wrote:
> > > > Isn't there still the issue of supporting U+0000 in C-type
> > > > strings?
> > >
> > > I don't see why. And it has nothing to do with Unicode per se,
> > > anyway.
> > >
> > > That is just a transform of the question of "the issue of
> > > supporting 0x00 in
> > > C-type strings restricted to ASCII."
> > >
> > > The issue is precisely the same, and the solutions are precisely
> > > the same -- by design.
> > >
> >
> > I'm not sure I understand that issue, could you clarify?
> > in both C and C++, U+0000  is interpreted as the null character
> > (which mark the end of the string depending on context), which is the
> > same behavior as the equivalent
> > ascii character
>
> One immediate consequence of that assertion is that one cannot in
> general store a line of Unicode text in a 'string'.  There have been
> Unicode test cases that deliberately include a null in the middle of
> the text, and if the program thinks it has stored the line in a
> 'string', it will fail the test, because the null character and beyond
> are not part of the text being interpreted.
>
> One of the early tricks to store general character sequences in
> strings was to use non-shortest form UTF-8 encodings to avoid characters
> being interpreted as control characters with undesired
> characteristics.  This form of UTF-8 is now invalid.  Java was
> especially noted for using the encoding <C0, 80> to store zero bytes in
> byte code in UTF-8.
>
> My guess is that Ken is alluding to not storing arbitrary text in
> strings, but rather in arrays of code units along with appropriate
> length parameters.
>

Oh, yes, I see thanks.
It's a special case of "null-terminated strings were a mistake".
But U+0000 has no other use or alternative semantic right? The main use
case would be test cases?

>
> Richard.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200620/a434c9ee/attachment.htm>