What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
harjitmoe at outlook.com
Sat Jun 6 07:20:40 CDT 2020
Point taken about it not necessarily being human readable text. I was mainly considering the case of distinguishing between a collection of files, the older ones being in ISO-2022-JP and the newer ones in UTF-8.
In response to the comment about SGR sequences:
ISO/IEC 2022 (ECMA-35, JIS X 0202), specifically section 13 (referencing the ECMA version), ultimately defines the format of all ANSI/ISO compliant escape sequences, whether in an actual ISO/IEC 2022 encoding (including both 7-bit code versions, and also 8-bit code versions such as ISO-8859-1) or in ISO 10646 / Unicode. The main difference is that ISO/IEC 10646 adds the requirement that they be padded to the code unit width, which is only relevant in the context of UTF-16 or UTF-32.
However, "type Fe" escape sequences, i.e. ESC 0x40 (ESC @) through ESC 0x5F (ESC _) with no intervening bytes, are delegated to the C1 control code set in use, normally ISO/IEC 6429 (ECMA-48, JIS X 0211). The escape sequence ESC 0x5B (ESC [), which is the CSI control in turn used at the start of SGR, CUP etc. sequences, is one of these.
The sequence ESC $ B (ESC 0x24 0x42), on the other hand, is a "type 4F" escape sequence, with a function defined by ISO/IEC 2022 itself.
And yes, some of the code-switching sequences are supported by e.g. xterm, but this is mainly for their ISO 2022 code-switching purposes, e.g. using ESC - F to switch from ISO-8859-1 to ISO-8859-7, or ESC % G to switch from an ISO 2022 code version (such as ISO 8859) to UTF-8.
So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless.
If you're dealing with non-ISO-compliant escape sequences used by some terminal, then fair enough.
From: Eli Zaretskii <eliz at gnu.org>
Sent: 06 June 2020 12:57
To: Harriet Riddle <harjitmoe at outlook.com>
Cc: Shawn.Steele at microsoft.com <Shawn.Steele at microsoft.com>; tom at honermann.net <tom at honermann.net>; alisdairm at me.com <alisdairm at me.com>; unicode at unicode.org <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> <alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 10:05:49 +0000
> In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and
> the Gx-set designating ESC sequences make no sense in UTF-8.
What do you mean by "make no sense"? A general-purpose editor is
presented with a byte stream and needs to decide how to interpret and
display it. It usually has no meta-data about the byte stream to help
it decide what does and doesn't make sense. It doesn't even know
whether the byte stream is human-readable text or just raw binary
I understand that, given enough of the byte stream, one can analyze it
and see whether interpreting it as one encoding or another will make
more sense. But these decisions are sometimes required after only a
small portion of the material has arrived (a case in point: a process
or a network connection that outputs text in relatively small chunks).
In any case, I was responding to a proposal to treat any text as UTF-8
"unless proven otherwise". My point is that with ISO 2022 encoding,
and perhaps also others, such a proof is not really at hand.
> If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a
> set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the
> sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.
Treating ESC sequences as telltale signs of ISO 2022 is not foolproof,
either. For example, you may be looking at UTF-8 text interspersed
with terminal control sequences, like SGR or somesuch.
Bottom line: the real world out there is not as clean as we might
think, and those rare corner cases keep breaking any simple-minded
decision rules such as "assume UTF-8 by default".
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode