What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Eli Zaretskii eliz at gnu.org
Sat Jun 6 09:27:33 CDT 2020


> From: Sławomir Osipiuk <sosipiuk at gmail.com>
> Date: Sat, 6 Jun 2020 09:56:23 -0400
> 
> Escape sequences may be present in UTF-8, but SI and SO cannot be, nor
> can most designation sequences (a special subset of escape sequences),
> not only because they make no sense, but because ISO 10646 explicitly
> forbids them:
> 
> "Code extension control functions for the ISO/IEC 2022 code extension
> techniques (such as designation escape sequences, single shift, and
> locking shift) shall not be used with this coded character set."

Alas, the stuff one bumps into out there doesn't always follow written
standards, let alone recent enough standards.

> The presence of these in a UTF-8 stream indicates an error of some
> kind. It's not completely impossible for them to appear in something
> that is otherwise valid UTF-8, but they should be treated, in my
> opinion, the same as overlong sequences or surrogates; i.e. the UTF-8
> math works, but the code point isn't valid.

What to do when these irregularities are found is a separate (though
very important) issue.  The issue discussed here is whether assuming
UTF-8 "until proven otherwise" is sufficient in practice.  I don't
think it is, and I provided a few examples why.


More information about the Unicode mailing list