What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Sławomir Osipiuk sosipiuk at gmail.com
Sat Jun 6 08:56:23 CDT 2020


On Sat, Jun 6, 2020 at 7:04 AM Eli Zaretskii via Unicode
<unicode at unicode.org> wrote:
>
> What do you mean by "make no sense"?  A general-purpose editor is
> presented with a byte stream and needs to decide how to interpret and
> display it.  It usually has no meta-data about the byte stream to help
> it decide what does and doesn't make sense.  It doesn't even know
> whether the byte stream is human-readable text or just raw binary
> bytes.

Escape sequences may be present in UTF-8, but SI and SO cannot be, nor
can most designation sequences (a special subset of escape sequences),
not only because they make no sense, but because ISO 10646 explicitly
forbids them:

"Code extension control functions for the ISO/IEC 2022 code extension
techniques (such as designation escape sequences, single shift, and
locking shift) shall not be used with this coded character set."

The presence of these in a UTF-8 stream indicates an error of some
kind. It's not completely impossible for them to appear in something
that is otherwise valid UTF-8, but they should be treated, in my
opinion, the same as overlong sequences or surrogates; i.e. the UTF-8
math works, but the code point isn't valid. This can occur due to
faulty conversion from another encoding, giving something that is
close to UTF-8 but not quite right. This brings up the question of how
error-tolerant Karl's algorithm is.

7-bit ISO 2022 encodings would clearly show such errors.

Also: I did not receive the email from Harriet Riddle that Eli is
replying to. Is there a problem with the mailing list? I may be
missing other messages.

Sławomir Osipiuk



More information about the Unicode mailing list