EBCDIC control characters

Harriet Riddle harjitmoe at outlook.com
Sat Jun 20 11:43:26 CDT 2020


Richard Wordingham via Unicode wrote:
> Prompted by the pain of Unicode test files with embedded nulls and 
> even embedded end of file.
Embedded nulls might indeed be used disruptively in user-submitted 
content (to induce truncation or, if nulls are removed or ignored at 
some stage, even to mask malicious sequences if the nulls are removed or 
ignored by something downstream of a sanitisation step). In such 
applications, there may be a need to deal with them somehow (even if 
that is simply replacing U+0000 instances with U+FFFD, as stipulated in 
the spec for e.g. CommonMark).

But so long as it can accurately output the string and its length in 
code units, it's not really the decoder's job to sort this out.

> I could never work out why isolated UTF-16 code units should be handled, but there was no need to handle isolated UTF-8 code units.
Depends on the context you are working in.

Python's PEP 383 ( https://www.python.org/dev/peps/pep-0383/ ) does 
define a scheme for passing isolated 8-bit code units through a decoder 
and encoder unchanged, actually in much the same way as tends to be done 
for UTF-16, i.e. passing around isolated surrogate codes. This is not 
the default behaviour, but it arose as a solution to the problem of 
handling potentially invalid data in Unix filenames (similar to the 
issue of potentially invalid UTF-16 data in Windows filenames).

-- Har

>> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content
>> of a C string. Resort to char arrays.
> Actually, you can.  As the size of char is at least 8 bits, you have
> 128 spare codes. :-)
>
> Richard.



More information about the Unicode mailing list