What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Sun Jun 7 14:29:50 CDT 2020

On 6/5/20 9:53 PM, Jonathan Rosenne via Unicode wrote:
> I am curious about how your code would work with CP1255 or CP1256?
> 
> Best Regards,
> 
> Jonathan Rosenne

Send me a few problematic strings, and I'll check them out

> 
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode
> Sent: Saturday, June 6, 2020 6:29 AM
> To: Shawn Steele; Tom Honermann
> Cc: Alisdair Meredith; Unicode Mail List
> Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
> 
> On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
>> I’ve been recommending that people assume documents are UTF-8.  If the
>> UTF-8 decoding fails, then consider falling back to some other
>> codepage.   Pretty much all the other code pages would contain text that
>> would look like unexpected trail bytes, or lead bytes without trail
>> bytes, etc.  One can anecdotally find single-word Latin examples that
>> break the pattern (Nestlé® IIRC), but if you want to think of accuracy
>> in terms of “9s”, then that pretty much has as many nines as you have
>> bytes of input data.
> 
> I have code that attempts to distinguish between UTF-8 and CP1252
> inputs.  It now does a pretty good job; no one has complained in several
> years.   To do this, I resort to some "semantic" analysis of the input.
> If it is syntactically valid UTF-8, but not a script run, it's not
> UTF-8.  Likewise, the texts it will be subjected to are going to be in
> modern commercially-valuable scripts, so not IPA, for example.  And it
> will be important characters, ones whose Age property is 1.1; text won't
> contain C1 controls.  CP1252 is harder than plain ASCII/Latin1/C1
> because manyh of the C1 controls are co-opted for graphic characters.
> Someone sent me the following example, scraped from some dictionaries,
> that it successfully gets right:
> 
> Muvrar\xE1\x9A\x9Aa is a mountain in Norway
> 
> is legal 1252, and syntactically legal UTF-8, but the "semantic" tests
> say it isn't UTF-8.
> 
> I also have code that tries to distinguish between a UTF-8 POSIX locale
> and a non-UTF-8, and which needs to work on systems without certain C
> library functions that would make it foolproof.  That is less successful
> primarily because of insufficient text available to make a
> determination.  One might think that the operating system error messages
> would be fruitful, but it turns out that many are in English, no one
> bothered to translate them.  The locale's currency symbol is always
> translated, though the dollar sign is commonly used in other languages
> as part of the symbol.  The time and date names are usually translated,
> and I use them.
> 
>> I did find some DBCS CJK text that could look like valid UTF-8, so my
>> “one nine per byte of input” isn’t quite as high there, however for
>> meaningful runs of text it is still reasonably hard to make sensible
>> text in a double byte codepage look like UTF-8.  Note that this “works”
>> partially because the ASCII range of the SBCS/DBCS code pages typically
>> looks like ASCII, as does UTF-8.  If you had a 7 bit codepage data with
>> stateful shift sequences, of course that wouldn’t err in UTF-8.
>> Fortunately for your scenario source code in 7 bit encodings is very
>> rare nowadays.
>>
>> Hope that helps,
>>
>> -Shawn
>>
>> *From:* Tom Honermann <tom at honermann.net>
>> *Sent:* Freitag, 5. Juni 2020 15:15
>> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
>> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List
>> <unicode at unicode.org>
>> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM
>> as a UTF-8 encoding signature?
>>
>> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>>
>>      The modern viewpoint is that the BOM should be discouraged in all
>>      contexts.   (Along with you should always be using Unicode
>>      encodings, probably UTF-8 or UTF-16).  I’d recommend to anyone
>>      encountering ASCII-like data to presume it was UTF-8 unless proven
>>      otherwise.
>>
>>      Are you asking because you’re interested in differentiating UTF-8
>>      from UTF-16?  Or UTF-8 from some other legacy non-Unicode encoding?
>>
>> The latter.  In particular, as a differentiator between shiny new UTF-8
>> encoded source code files and long-in-the-tooth legacy encoded source
>> code files coexisting (perhaps via transitive package dependencies)
>> within a single project.
>>
>> Tom.
>>
>>      Anecdotally, if you can decode data without error in UTF-8, then
>>      it’s probably UTF-8.  Sensible sequences in other encodings rarely
>>      look like valid UTF-8, though there are a few short examples that
>>      can confuse it.
>>
>>      -Shawn
>>
>>      *From:* Unicode <unicode-bounces at unicode.org>
>>      <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>>      via Unicode
>>      *Sent:* Freitag, 5. Juni 2020 13:10
>>      *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>>      *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>>      *Subject:* What is the Unicode guidance regarding the use of a BOM
>>      as a UTF-8 encoding signature?
>>
>>      Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>>      order, states (emphasis mine):
>>
>>          ... *Use of a BOM is neither required nor recommended for
>>          UTF-8*, but may be encountered in contexts where UTF-8 data is
>>          converted from other encoding forms that  use  a  BOM  or
>>          where  the  BOM  is  used  as  a  UTF-8  signature. See  the
>>          “Byte  Order Mark” subsection in Section 23.8, Specials, for
>>          more information.
>>
>>      The emphasized statement is unconditional regarding the
>>      recommendation, but it isn't clear to me that this recommendation is
>>      intended to extend to both presence of a BOM in contexts where the
>>      encoding is known to be UTF-8 (where the BOM provides no additional
>>      information) and to contexts where the BOM signifies the presence of
>>      UTF-8 encoded text (where the BOM does provide additional
>>      information).  Is the guidance intended to state that, when
>>      possible, use of UTF-8 as an encoding signature is to be avoided in
>>      favor of some other mechanism?
>>
>>      The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>>      (Specials) contains no similar guidance; it is factual and details
>>      some possible consequences of use, but does not apply a judgement.
>>      The discussion of use with other character sets could be read as an
>>      endorsement for use of a BOM as an encoding signature.
>>
>>      Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>>      FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>>      for or against use of a BOM as an encoding signature.  It also can
>>      be read as endorsing such usage.
>>
>>      So, my question is, what exactly is the intent of the emphasized
>>      statement above?  Is the recommendation intended to be so broadly
>>      worded?  Or is it only intended to discourage BOM use in cases where
>>      the encoding is known by other means?
>>
>>      Tom.
>>
> 
> 
>