What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Harriet Riddle
harjitmoe at outlook.com
Sat Jun 6 05:05:49 CDT 2020
In theory, one decoder can handle both, since 7-bit ISO 2022 generally starts out in ASCII, and SI, SO and the Gx-set designating ESC sequences make no sense in UTF-8. So, handling the left-hand side (those with the high bit unset) as (say) ISO-2022-JP-2 and the right-hand side (with the high bit set) as UTF-8 could work, with no ambiguity in practice. I do not recommend this for general use, since allowing this sort of mixed encoding at the receiving end can allow data to bypass upstream XSS sanitisers et cetera, but you presumably know how revelant this concern is to your work. It also probably doesn't make sense to write a decoder from scratch for this, unless you were doing that anyway.
If that isn't feasible, then, more moderate measures might include trying 7-bit ISO 2022 and if it runs into a set high bit, retrying with UTF-8. Or trying UTF-8 and, if the result contains SI, SO or (for instance) the sequence ESC $ B (U+001B U+0024 U+0042), retrying with (for instance) ISO-2022-JP-2.
________________________________
From: Unicode <unicode-bounces at unicode.org> on behalf of Eli Zaretskii via Unicode <unicode at unicode.org>
Sent: 06 June 2020 09:12
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: tom at honermann.net <tom at honermann.net>; alisdairm at me.com <alisdairm at me.com>; unicode at unicode.org <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
> CC: "tom at honermann.net" <tom at honermann.net>,
> "alisdairm at me.com"
> <alisdairm at me.com>,
> "unicode at unicode.org" <unicode at unicode.org>
> Date: Sat, 6 Jun 2020 06:58:55 +0000
> From: Shawn Steele via Unicode <unicode at unicode.org>
>
> I mentioned that later.... But there is a lot of content for interchange that are single/double byte (8 bit) rather than requiring escape sequences. The 2022 encodings seem rarer, though it may depend on your data source.
I agree that ISO 2022 is rare these days, but rarity doesn't help when
you need to be accurate in decoding, because mistaking one encoding
for another produces horribly incorrect results, and users complain
vociferously when that happens.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200606/4d924928/attachment.htm>
More information about the Unicode
mailing list