What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Sat Jun 6 06:17:48 CDT 2020

Frequency analysis of bigrams and trigrams, provided the text is not too short, can reveal the encoding and even the language. But this is not normally the province of text editors and word processing software.

Jonathan Rosenne
-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Eli Zaretskii via Unicode
Sent: Saturday, June 6, 2020 1:58 PM
To: Harriet Riddle
Cc: Shawn.Steele at microsoft.com; tom at honermann.net; alisdairm at me.com; unicode at unicode.org
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> 	<alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
> 
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.

What do you mean by "make no sense"?  A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it.  It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense.  It doesn't even know
whether the byte stream is human-readable text or just raw binary
bytes.

I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense.  But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).

In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise".  My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.

> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.

Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either.  For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.

Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".