What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Fri Jun 5 17:15:08 CDT 2020

On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>
> The modern viewpoint is that the BOM should be discouraged in all 
> contexts.   (Along with you should always be using Unicode encodings, 
> probably UTF-8 or UTF-16). I’d recommend to anyone encountering 
> ASCII-like data to presume it was UTF-8 unless proven otherwise.
>
> Are you asking because you’re interested in differentiating UTF-8 from 
> UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?
>
The latter.  In particular, as a differentiator between shiny new UTF-8 
encoded source code files and long-in-the-tooth legacy encoded source 
code files coexisting (perhaps via transitive package dependencies) 
within a single project.

Tom.

> Anecdotally, if you can decode data without error in UTF-8, then it’s 
> probably UTF-8.  Sensible sequences in other encodings rarely look 
> like valid UTF-8, though there are a few short examples that can 
> confuse it.
>
> -Shawn
>
> *From:* Unicode <unicode-bounces at unicode.org> *On Behalf Of *Tom 
> Honermann via Unicode
> *Sent:* Freitag, 5. Juni 2020 13:10
> *To:* unicode at unicode.org
> *Cc:* Alisdair Meredith <alisdairm at me.com>
> *Subject:* What is the Unicode guidance regarding the use of a BOM as 
> a UTF-8 encoding signature?
>
> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte order, 
> states (emphasis mine):
>
>     ... *Use of a BOM is neither required nor recommended for UTF-8*,
>     but may be encountered in contexts where UTF-8 data is converted
>     from other encoding forms that  use  a  BOM  or  where  the  BOM 
>     is  used  as a  UTF-8  signature.  See  the  “Byte  Order Mark”
>     subsection in Section 23.8, Specials, for more information.
>
> The emphasized statement is unconditional regarding the 
> recommendation, but it isn't clear to me that this recommendation is 
> intended to extend to both presence of a BOM in contexts where the 
> encoding is known to be UTF-8 (where the BOM provides no additional 
> information) and to contexts where the BOM signifies the presence of 
> UTF-8 encoded text (where the BOM does provide additional 
> information).  Is the guidance intended to state that, when possible, 
> use of UTF-8 as an encoding signature is to be avoided in favor of 
> some other mechanism?
>
> The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8 
> (Specials) contains no similar guidance; it is factual and details 
> some possible consequences of use, but does not apply a judgement.  
> The discussion of use with other character sets could be read as an 
> endorsement for use of a BOM as an encoding signature.
>
> Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode FAQ 
> <https://www.unicode.org/faq/utf_bom.html> does not recommend for or 
> against use of a BOM as an encoding signature.  It also can be read as 
> endorsing such usage.
>
> So, my question is, what exactly is the intent of the emphasized 
> statement above?  Is the recommendation intended to be so broadly 
> worded?  Or is it only intended to discourage BOM use in cases where 
> the encoding is known by other means?
>
> Tom.
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200605/291e1011/attachment.htm>