What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Eli Zaretskii
eliz at gnu.org
Sat Jun 6 07:53:30 CDT 2020
> From: Harriet Riddle <harjitmoe at outlook.com>
> CC: "Shawn.Steele at microsoft.com" <Shawn.Steele at microsoft.com>,
> "tom at honermann.net" <tom at honermann.net>, "alisdairm at me.com"
> <alisdairm at me.com>, "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 12:20:40 +0000
>
> So it is true that detecting ESC on its own will not identify 7-bit ISO 2022, but the specific sequence ESC $ B
> (ESC 0x24 0x42) has only one ANSI/ISO compliant meaning, which is to switch the G0 set to JIS X 0208. In
> UTF-8, there is no such thing as a G0 set (due to it not being fully ISO 2022 based), so it is meaningless.
If you are saying that "ESC $ B" or similar sequences can be
considered as evidence that the text is not in UTF-8, then I might
concur. Whether that's the "proof" that should reject UTF-8, I'm not
sure.
More information about the Unicode
mailing list