[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature
James Kass
jameskass at code2001.com
Sun Oct 11 23:28:53 CDT 2020
On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote:
> On 10/11/20 11:32 PM, JF Bastien wrote:
>> It’s a bit odd: if you assume the default is ascii then you don’t
>> need this. If you assume the default is utf8 then you don’t need
>> this... so when do you need the BOM? It seems like making bad prior
>> choices more acceptable... even though they were bad choices. I’m not
>> sure it’s a good idea.
>
> A BOM would be needed when:
>
> 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252,
> etc...) and the UTF-8 text to be produced contains non-ASCII
> characters. Or,
> 2. The default encoding is not ASCII based (e.g., EBCDIC).
>
> Both of these cases presume that the default encoding can't be made
> UTF-8 for backward compatibility reasons.
>
> Tom.
1. UTF-8 text consists only of ASCII characters. Even if some ASCII
strings reference non-ASCII characters. It's the same idea as HTML
numeric character references which point to non-ASCII characters while
being composed of ASCII characters. It shouldn't matter whether a
string of ASCII digits form the charcter number or a string of UTF-8 hex
bytes form that number. A Unicode-aware application will display the
string as a special character while legacy applications will show the
string as mojibake. Either way, UTF-8 remains an ASCII-preserving
encoding format.
2. Files using non-standard encodings should be converted to Unicode.
Any plain-text file should be presumed to be UTF-8 unless marked otherwise.
Years ago, the UTF-8 signature was sometimes considered helpful.
Nowadays it seems be more of an anachronism.
More information about the Unicode
mailing list