What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Doug Ewell
doug at ewellic.org
Sat Jun 6 10:19:48 CDT 2020
Shawn Steele wrote:
> I’ve been recommending that people assume documents are UTF-8. If
> the UTF-8 decoding fails, then consider falling back to some other
> codepage. Pretty much all the other code pages would contain text
> that would look like unexpected trail bytes, or lead bytes without
> trail bytes, etc. One can anecdotally find single-word Latin examples
> that break the pattern (Nestlé® IIRC),
That's traditionally been my example. You have to spell it in all caps (NESTLÉ®), which Nestlé seldom does, in order to get an ISO 8859-1 sequence that can be mistaken for UTF-8:
4E 45 53 54 4C C9 AE
where the last two code points could be UTF-8 for ɮ, U+026E LATIN SMALL LETTER LEZH.
If the é is lowercase, you get:
4E 45 53 54 4C E9 AE
which is not valid UTF-8 (only one trail byte), and the heuristic that UTF-8 can be reliably auto-detected is reinforced.
--
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode
mailing list