What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Fri Jun 5 22:28:52 CDT 2020

On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
> I’ve been recommending that people assume documents are UTF-8.  If the 
> UTF-8 decoding fails, then consider falling back to some other 
> codepage.   Pretty much all the other code pages would contain text that 
> would look like unexpected trail bytes, or lead bytes without trail 
> bytes, etc.  One can anecdotally find single-word Latin examples that 
> break the pattern (Nestlé® IIRC), but if you want to think of accuracy 
> in terms of “9s”, then that pretty much has as many nines as you have 
> bytes of input data.

I have code that attempts to distinguish between UTF-8 and CP1252 
inputs.  It now does a pretty good job; no one has complained in several 
years.   To do this, I resort to some "semantic" analysis of the input. 
If it is syntactically valid UTF-8, but not a script run, it's not 
UTF-8.  Likewise, the texts it will be subjected to are going to be in 
modern commercially-valuable scripts, so not IPA, for example.  And it 
will be important characters, ones whose Age property is 1.1; text won't 
contain C1 controls.  CP1252 is harder than plain ASCII/Latin1/C1 
because manyh of the C1 controls are co-opted for graphic characters. 
Someone sent me the following example, scraped from some dictionaries, 
that it successfully gets right:

Muvrar\xE1\x9A\x9Aa is a mountain in Norway

is legal 1252, and syntactically legal UTF-8, but the "semantic" tests 
say it isn't UTF-8.

I also have code that tries to distinguish between a UTF-8 POSIX locale 
and a non-UTF-8, and which needs to work on systems without certain C 
library functions that would make it foolproof.  That is less successful 
primarily because of insufficient text available to make a 
determination.  One might think that the operating system error messages 
would be fruitful, but it turns out that many are in English, no one 
bothered to translate them.  The locale's currency symbol is always 
translated, though the dollar sign is commonly used in other languages 
as part of the symbol.  The time and date names are usually translated, 
and I use them.

> I did find some DBCS CJK text that could look like valid UTF-8, so my 
> “one nine per byte of input” isn’t quite as high there, however for 
> meaningful runs of text it is still reasonably hard to make sensible 
> text in a double byte codepage look like UTF-8.  Note that this “works” 
> partially because the ASCII range of the SBCS/DBCS code pages typically 
> looks like ASCII, as does UTF-8.  If you had a 7 bit codepage data with 
> stateful shift sequences, of course that wouldn’t err in UTF-8.  
> Fortunately for your scenario source code in 7 bit encodings is very 
> rare nowadays.
> 
> Hope that helps,
> 
> -Shawn
> 
> *From:* Tom Honermann <tom at honermann.net>
> *Sent:* Freitag, 5. Juni 2020 15:15
> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List 
> <unicode at unicode.org>
> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM 
> as a UTF-8 encoding signature?
> 
> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
> 
>     The modern viewpoint is that the BOM should be discouraged in all
>     contexts.   (Along with you should always be using Unicode
>     encodings, probably UTF-8 or UTF-16).  I’d recommend to anyone
>     encountering ASCII-like data to presume it was UTF-8 unless proven
>     otherwise.
> 
>     Are you asking because you’re interested in differentiating UTF-8
>     from UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?
> 
> The latter.  In particular, as a differentiator between shiny new UTF-8 
> encoded source code files and long-in-the-tooth legacy encoded source 
> code files coexisting (perhaps via transitive package dependencies) 
> within a single project.
> 
> Tom.
> 
>     Anecdotally, if you can decode data without error in UTF-8, then
>     it’s probably UTF-8.  Sensible sequences in other encodings rarely
>     look like valid UTF-8, though there are a few short examples that
>     can confuse it.
> 
>     -Shawn
> 
>     *From:* Unicode <unicode-bounces at unicode.org>
>     <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>     via Unicode
>     *Sent:* Freitag, 5. Juni 2020 13:10
>     *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>     *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>     *Subject:* What is the Unicode guidance regarding the use of a BOM
>     as a UTF-8 encoding signature?
> 
>     Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>     order, states (emphasis mine):
> 
>         ... *Use of a BOM is neither required nor recommended for
>         UTF-8*, but may be encountered in contexts where UTF-8 data is
>         converted from other encoding forms that  use  a  BOM  or 
>         where  the  BOM  is  used  as  a  UTF-8  signature. See  the 
>         “Byte  Order Mark” subsection in Section 23.8, Specials, for
>         more information.
> 
>     The emphasized statement is unconditional regarding the
>     recommendation, but it isn't clear to me that this recommendation is
>     intended to extend to both presence of a BOM in contexts where the
>     encoding is known to be UTF-8 (where the BOM provides no additional
>     information) and to contexts where the BOM signifies the presence of
>     UTF-8 encoded text (where the BOM does provide additional
>     information).  Is the guidance intended to state that, when
>     possible, use of UTF-8 as an encoding signature is to be avoided in
>     favor of some other mechanism?
> 
>     The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>     (Specials) contains no similar guidance; it is factual and details
>     some possible consequences of use, but does not apply a judgement. 
>     The discussion of use with other character sets could be read as an
>     endorsement for use of a BOM as an encoding signature.
> 
>     Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>     FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>     for or against use of a BOM as an encoding signature.  It also can
>     be read as endorsing such usage.
> 
>     So, my question is, what exactly is the intent of the emphasized
>     statement above?  Is the recommendation intended to be so broadly
>     worded?  Or is it only intended to discourage BOM use in cases where
>     the encoding is known by other means?
> 
>     Tom.
>