What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Doug Ewell doug at ewellic.org
Sat Jun 6 10:19:48 CDT 2020


Shawn Steele wrote:

> I’ve been recommending that people assume documents are UTF-8.  If
> the UTF-8 decoding fails, then consider falling back to some other
> codepage.   Pretty much all the other code pages would contain text
> that would look like unexpected trail bytes, or lead bytes without
> trail bytes, etc.  One can anecdotally find single-word Latin examples
> that break the pattern (Nestlé® IIRC),

That's traditionally been my example. You have to spell it in all caps (NESTLÉ®), which Nestlé seldom does, in order to get an ISO 8859-1 sequence that can be mistaken for UTF-8:

4E 45 53 54 4C C9 AE

where the last two code points could be UTF-8 for ɮ, U+026E LATIN SMALL LETTER LEZH.

If the é is lowercase, you get:

4E 45 53 54 4C E9 AE

which is not valid UTF-8 (only one trail byte), and the heuristic that UTF-8 can be reliably auto-detected is reinforced.

--
Doug Ewell | Thornton, CO, US | ewellic.org





More information about the Unicode mailing list