What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Eli Zaretskii
eliz at gnu.org
Sat Jun 6 05:57:57 CDT 2020
> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> <alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
>
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.
What do you mean by "make no sense"? A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it. It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense. It doesn't even know
whether the byte stream is human-readable text or just raw binary
bytes.
I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense. But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).
In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise". My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.
> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.
Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either. For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.
Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".
More information about the Unicode
mailing list