What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Shawn.Steele at microsoft.com
Fri Jun 5 17:33:23 CDT 2020
I’ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then consider falling back to some other codepage. Pretty much all the other code pages would contain text that would look like unexpected trail bytes, or lead bytes without trail bytes, etc. One can anecdotally find single-word Latin examples that break the pattern (Nestlé® IIRC), but if you want to think of accuracy in terms of “9s”, then that pretty much has as many nines as you have bytes of input data.
I did find some DBCS CJK text that could look like valid UTF-8, so my “one nine per byte of input” isn’t quite as high there, however for meaningful runs of text it is still reasonably hard to make sensible text in a double byte codepage look like UTF-8. Note that this “works” partially because the ASCII range of the SBCS/DBCS code pages typically looks like ASCII, as does UTF-8. If you had a 7 bit codepage data with stateful shift sequences, of course that wouldn’t err in UTF-8. Fortunately for your scenario source code in 7 bit encodings is very rare nowadays.
Hope that helps,
From: Tom Honermann <tom at honermann.net>
Sent: Freitag, 5. Juni 2020 15:15
To: Shawn Steele <Shawn.Steele at microsoft.com>
Cc: Alisdair Meredith <alisdairm at me.com>; Unicode Mail List <unicode at unicode.org>
Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
The modern viewpoint is that the BOM should be discouraged in all contexts. (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16). I’d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.
Are you asking because you’re interested in differentiating UTF-8 from UTF-16? Or UTF-8 from some other legacy non-Unicode encoding?
The latter. In particular, as a differentiator between shiny new UTF-8 encoded source code files and long-in-the-tooth legacy encoded source code files coexisting (perhaps via transitive package dependencies) within a single project.
Anecdotally, if you can decode data without error in UTF-8, then it’s probably UTF-8. Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it.
From: Unicode <unicode-bounces at unicode.org><mailto:unicode-bounces at unicode.org> On Behalf Of Tom Honermann via Unicode
Sent: Freitag, 5. Juni 2020 13:10
To: unicode at unicode.org<mailto:unicode at unicode.org>
Cc: Alisdair Meredith <alisdairm at me.com><mailto:alisdairm at me.com>
Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine):
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 23.8, Specials, for more information.
The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information). Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism?
The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement. The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature.
Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ<https://www.unicode.org/faq/utf_bom.html> does not recommend for or against use of a BOM as an encoding signature. It also can be read as endorsing such usage.
So, my question is, what exactly is the intent of the emphasized statement above? Is the recommendation intended to be so broadly worded? Or is it only intended to discourage BOM use in cases where the encoding is known by other means?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode