EBCDIC control characters
Harriet Riddle
harjitmoe at outlook.com
Sat Jun 20 11:43:26 CDT 2020
Richard Wordingham via Unicode wrote:
> Prompted by the pain of Unicode test files with embedded nulls and
> even embedded end of file.
Embedded nulls might indeed be used disruptively in user-submitted
content (to induce truncation or, if nulls are removed or ignored at
some stage, even to mask malicious sequences if the nulls are removed or
ignored by something downstream of a sanitisation step). In such
applications, there may be a need to deal with them somehow (even if
that is simply replacing U+0000 instances with U+FFFD, as stipulated in
the spec for e.g. CommonMark).
But so long as it can accurately output the string and its length in
code units, it's not really the decoder's job to sort this out.
> I could never work out why isolated UTF-16 code units should be handled, but there was no need to handle isolated UTF-8 code units.
Depends on the context you are working in.
Python's PEP 383 ( https://www.python.org/dev/peps/pep-0383/ ) does
define a scheme for passing isolated 8-bit code units through a decoder
and encoder unchanged, actually in much the same way as tends to be done
for UTF-16, i.e. passing around isolated surrogate codes. This is not
the default behaviour, but it arose as a solution to the problem of
handling potentially invalid data in Unix filenames (similar to the
issue of potentially invalid UTF-16 data in Windows filenames).
-- Har
>> 7-bit ASCII: One cannot represent NULL (0x00) as part of the content
>> of a C string. Resort to char arrays.
> Actually, you can. As the size of char is at least 8 bits, you have
> 128 spare codes. :-)
>
> Richard.
More information about the Unicode
mailing list