What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Shawn Steele
Shawn.Steele at microsoft.com
Fri Jun 5 16:47:49 CDT 2020
The modern viewpoint is that the BOM should be discouraged in all contexts. (Along with you should always be using Unicode encodings, probably UTF-8 or UTF-16). I’d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.
Are you asking because you’re interested in differentiating UTF-8 from UTF-16? Or UTF-8 from some other legacy non-Unicode encoding?
Anecdotally, if you can decode data without error in UTF-8, then it’s probably UTF-8. Sensible sequences in other encodings rarely look like valid UTF-8, though there are a few short examples that can confuse it.
-Shawn
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Tom Honermann via Unicode
Sent: Freitag, 5. Juni 2020 13:10
To: unicode at unicode.org
Cc: Alisdair Meredith <alisdairm at me.com>
Subject: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, states (emphasis mine):
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 23.8, Specials, for more information.
The emphasized statement is unconditional regarding the recommendation, but it isn't clear to me that this recommendation is intended to extend to both presence of a BOM in contexts where the encoding is known to be UTF-8 (where the BOM provides no additional information) and to contexts where the BOM signifies the presence of UTF-8 encoded text (where the BOM does provide additional information). Is the guidance intended to state that, when possible, use of UTF-8 as an encoding signature is to be avoided in favor of some other mechanism?
The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 (Specials) contains no similar guidance; it is factual and details some possible consequences of use, but does not apply a judgement. The discussion of use with other character sets could be read as an endorsement for use of a BOM as an encoding signature.
Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ<https://www.unicode.org/faq/utf_bom.html> does not recommend for or against use of a BOM as an encoding signature. It also can be read as endorsing such usage.
So, my question is, what exactly is the intent of the emphasized statement above? Is the recommendation intended to be so broadly worded? Or is it only intended to discourage BOM use in cases where the encoding is known by other means?
Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/f01d2068/attachment.htm>
More information about the Unicode
mailing list