What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Tom Honermann
tom at honermann.net
Fri Jun 5 17:15:08 CDT 2020
On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>
> The modern viewpoint is that the BOM should be discouraged in all
> contexts. (Along with you should always be using Unicode encodings,
> probably UTF-8 or UTF-16). I’d recommend to anyone encountering
> ASCII-like data to presume it was UTF-8 unless proven otherwise.
>
> Are you asking because you’re interested in differentiating UTF-8 from
> UTF-16? Or UTF-8 from some other legacy non-Unicode encoding?
>
The latter. In particular, as a differentiator between shiny new UTF-8
encoded source code files and long-in-the-tooth legacy encoded source
code files coexisting (perhaps via transitive package dependencies)
within a single project.
Tom.
> Anecdotally, if you can decode data without error in UTF-8, then it’s
> probably UTF-8. Sensible sequences in other encodings rarely look
> like valid UTF-8, though there are a few short examples that can
> confuse it.
>
> -Shawn
>
> *From:* Unicode <unicode-bounces at unicode.org> *On Behalf Of *Tom
> Honermann via Unicode
> *Sent:* Freitag, 5. Juni 2020 13:10
> *To:* unicode at unicode.org
> *Cc:* Alisdair Meredith <alisdairm at me.com>
> *Subject:* What is the Unicode guidance regarding the use of a BOM as
> a UTF-8 encoding signature?
>
> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order,
> states (emphasis mine):
>
> ... *Use of a BOM is neither required nor recommended for UTF-8*,
> but may be encountered in contexts where UTF-8 data is converted
> from other encoding forms that use a BOM or where the BOM
> is used as a UTF-8 signature. See the “Byte Order Mark”
> subsection in Section 23.8, Specials, for more information.
>
> The emphasized statement is unconditional regarding the
> recommendation, but it isn't clear to me that this recommendation is
> intended to extend to both presence of a BOM in contexts where the
> encoding is known to be UTF-8 (where the BOM provides no additional
> information) and to contexts where the BOM signifies the presence of
> UTF-8 encoded text (where the BOM does provide additional
> information). Is the guidance intended to state that, when possible,
> use of UTF-8 as an encoding signature is to be avoided in favor of
> some other mechanism?
>
> The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
> (Specials) contains no similar guidance; it is factual and details
> some possible consequences of use, but does not apply a judgement.
> The discussion of use with other character sets could be read as an
> endorsement for use of a BOM as an encoding signature.
>
> Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ
> <https://www.unicode.org/faq/utf_bom.html> does not recommend for or
> against use of a BOM as an encoding signature. It also can be read as
> endorsing such usage.
>
> So, my question is, what exactly is the intent of the emphasized
> statement above? Is the recommendation intended to be so broadly
> worded? Or is it only intended to discourage BOM use in cases where
> the encoding is known by other means?
>
> Tom.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/291e1011/attachment.htm>
More information about the Unicode
mailing list