What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Karl Williamson
public at khwilliamson.com
Sun Jun 7 14:29:50 CDT 2020
On 6/5/20 9:53 PM, Jonathan Rosenne via Unicode wrote:
> I am curious about how your code would work with CP1255 or CP1256?
>
> Best Regards,
>
> Jonathan Rosenne
Send me a few problematic strings, and I'll check them out
>
> -----Original Message-----
> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson via Unicode
> Sent: Saturday, June 6, 2020 6:29 AM
> To: Shawn Steele; Tom Honermann
> Cc: Alisdair Meredith; Unicode Mail List
> Subject: Re: What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
>
> On 6/5/20 4:33 PM, Shawn Steele via Unicode wrote:
>> I’ve been recommending that people assume documents are UTF-8. If the
>> UTF-8 decoding fails, then consider falling back to some other
>> codepage. Pretty much all the other code pages would contain text that
>> would look like unexpected trail bytes, or lead bytes without trail
>> bytes, etc. One can anecdotally find single-word Latin examples that
>> break the pattern (Nestlé® IIRC), but if you want to think of accuracy
>> in terms of “9s”, then that pretty much has as many nines as you have
>> bytes of input data.
>
> I have code that attempts to distinguish between UTF-8 and CP1252
> inputs. It now does a pretty good job; no one has complained in several
> years. To do this, I resort to some "semantic" analysis of the input.
> If it is syntactically valid UTF-8, but not a script run, it's not
> UTF-8. Likewise, the texts it will be subjected to are going to be in
> modern commercially-valuable scripts, so not IPA, for example. And it
> will be important characters, ones whose Age property is 1.1; text won't
> contain C1 controls. CP1252 is harder than plain ASCII/Latin1/C1
> because manyh of the C1 controls are co-opted for graphic characters.
> Someone sent me the following example, scraped from some dictionaries,
> that it successfully gets right:
>
> Muvrar\xE1\x9A\x9Aa is a mountain in Norway
>
> is legal 1252, and syntactically legal UTF-8, but the "semantic" tests
> say it isn't UTF-8.
>
> I also have code that tries to distinguish between a UTF-8 POSIX locale
> and a non-UTF-8, and which needs to work on systems without certain C
> library functions that would make it foolproof. That is less successful
> primarily because of insufficient text available to make a
> determination. One might think that the operating system error messages
> would be fruitful, but it turns out that many are in English, no one
> bothered to translate them. The locale's currency symbol is always
> translated, though the dollar sign is commonly used in other languages
> as part of the symbol. The time and date names are usually translated,
> and I use them.
>
>> I did find some DBCS CJK text that could look like valid UTF-8, so my
>> “one nine per byte of input” isn’t quite as high there, however for
>> meaningful runs of text it is still reasonably hard to make sensible
>> text in a double byte codepage look like UTF-8. Note that this “works”
>> partially because the ASCII range of the SBCS/DBCS code pages typically
>> looks like ASCII, as does UTF-8. If you had a 7 bit codepage data with
>> stateful shift sequences, of course that wouldn’t err in UTF-8.
>> Fortunately for your scenario source code in 7 bit encodings is very
>> rare nowadays.
>>
>> Hope that helps,
>>
>> -Shawn
>>
>> *From:* Tom Honermann <tom at honermann.net>
>> *Sent:* Freitag, 5. Juni 2020 15:15
>> *To:* Shawn Steele <Shawn.Steele at microsoft.com>
>> *Cc:* Alisdair Meredith <alisdairm at me.com>; Unicode Mail List
>> <unicode at unicode.org>
>> *Subject:* Re: What is the Unicode guidance regarding the use of a BOM
>> as a UTF-8 encoding signature?
>>
>> On 6/5/20 5:47 PM, Shawn Steele via Unicode wrote:
>>
>> The modern viewpoint is that the BOM should be discouraged in all
>> contexts. (Along with you should always be using Unicode
>> encodings, probably UTF-8 or UTF-16). I’d recommend to anyone
>> encountering ASCII-like data to presume it was UTF-8 unless proven
>> otherwise.
>>
>> Are you asking because you’re interested in differentiating UTF-8
>> from UTF-16? Or UTF-8 from some other legacy non-Unicode encoding?
>>
>> The latter. In particular, as a differentiator between shiny new UTF-8
>> encoded source code files and long-in-the-tooth legacy encoded source
>> code files coexisting (perhaps via transitive package dependencies)
>> within a single project.
>>
>> Tom.
>>
>> Anecdotally, if you can decode data without error in UTF-8, then
>> it’s probably UTF-8. Sensible sequences in other encodings rarely
>> look like valid UTF-8, though there are a few short examples that
>> can confuse it.
>>
>> -Shawn
>>
>> *From:* Unicode <unicode-bounces at unicode.org>
>> <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>> via Unicode
>> *Sent:* Freitag, 5. Juni 2020 13:10
>> *To:* unicode at unicode.org <mailto:unicode at unicode.org>
>> *Cc:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>> *Subject:* What is the Unicode guidance regarding the use of a BOM
>> as a UTF-8 encoding signature?
>>
>> Unicode 13 chapter 2.6 (Encoding Schemes), when discussing byte
>> order, states (emphasis mine):
>>
>> ... *Use of a BOM is neither required nor recommended for
>> UTF-8*, but may be encountered in contexts where UTF-8 data is
>> converted from other encoding forms that use a BOM or
>> where the BOM is used as a UTF-8 signature. See the
>> “Byte Order Mark” subsection in Section 23.8, Specials, for
>> more information.
>>
>> The emphasized statement is unconditional regarding the
>> recommendation, but it isn't clear to me that this recommendation is
>> intended to extend to both presence of a BOM in contexts where the
>> encoding is known to be UTF-8 (where the BOM provides no additional
>> information) and to contexts where the BOM signifies the presence of
>> UTF-8 encoded text (where the BOM does provide additional
>> information). Is the guidance intended to state that, when
>> possible, use of UTF-8 as an encoding signature is to be avoided in
>> favor of some other mechanism?
>>
>> The referenced "Byte Order Mark" section in Unicode 13 chapter 23.8
>> (Specials) contains no similar guidance; it is factual and details
>> some possible consequences of use, but does not apply a judgement.
>> The discussion of use with other character sets could be read as an
>> endorsement for use of a BOM as an encoding signature.
>>
>> Likewise, the "UTF-8, UTF-16, UTF-32 & BOM" section in the Unicode
>> FAQ <https://www.unicode.org/faq/utf_bom.html> does not recommend
>> for or against use of a BOM as an encoding signature. It also can
>> be read as endorsing such usage.
>>
>> So, my question is, what exactly is the intent of the emphasized
>> statement above? Is the recommendation intended to be so broadly
>> worded? Or is it only intended to discourage BOM use in cases where
>> the encoding is known by other means?
>>
>> Tom.
>>
>
>
>
More information about the Unicode
mailing list