[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature
tom at honermann.net
Mon Oct 12 08:35:23 CDT 2020
On 10/12/20 12:28 AM, James Kass via Unicode wrote:
> On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote:
>> On 10/11/20 11:32 PM, JF Bastien wrote:
>>> It’s a bit odd: if you assume the default is ascii then you don’t
>>> need this. If you assume the default is utf8 then you don’t need
>>> this... so when do you need the BOM? It seems like making bad prior
>>> choices more acceptable... even though they were bad choices. I’m
>>> not sure it’s a good idea.
>> A BOM would be needed when:
>> 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252,
>> etc...) and the UTF-8 text to be produced contains non-ASCII
>> characters. Or,
>> 2. The default encoding is not ASCII based (e.g., EBCDIC).
>> Both of these cases presume that the default encoding can't be made
>> UTF-8 for backward compatibility reasons.
> 1. UTF-8 text consists only of ASCII characters. Even if some ASCII
> strings reference non-ASCII characters. It's the same idea as HTML
> numeric character references which point to non-ASCII characters while
> being composed of ASCII characters. It shouldn't matter whether a
> string of ASCII digits form the charcter number or a string of UTF-8
> hex bytes form that number. A Unicode-aware application will display
> the string as a special character while legacy applications will show
> the string as mojibake. Either way, UTF-8 remains an ASCII-preserving
> encoding format.
I don't understand this response. UTF-8 lead and trailing bytes are not
ASCII characters. Perhaps you are using "ASCII" to refer to the set of
8-bit ASCII-based encodings? ASCII is a 7-bit encoding.
> 2. Files using non-standard encodings should be converted to Unicode.
> Any plain-text file should be presumed to be UTF-8 unless marked
That doesn't match existing practice on Windows where most applications
assume the encoding of the Active Code Page (e.g., Windows-1252).
> Years ago, the UTF-8 signature was sometimes considered helpful.
> Nowadays it seems be more of an anachronism.
I think that is true in some contexts; e.g., on the web and on most
POSIX systems. I don't think it is true in general though.
More information about the Unicode