[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Tom Honermann tom at honermann.net
Mon Oct 12 08:35:23 CDT 2020


On 10/12/20 12:28 AM, James Kass via Unicode wrote:
>
>
> On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote:
>> On 10/11/20 11:32 PM, JF Bastien wrote:
>>> It’s a bit odd: if you assume the default is ascii then you don’t 
>>> need this. If you assume the default is utf8 then you don’t need 
>>> this... so when do you need the BOM? It seems like making bad prior 
>>> choices more acceptable... even though they were bad choices. I’m 
>>> not sure it’s a good idea.
>>
>> A BOM would be needed when:
>>
>> 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252,
>>    etc...) and the UTF-8 text to be produced contains non-ASCII
>>    characters.  Or,
>> 2. The default encoding is not ASCII based (e.g., EBCDIC).
>>
>> Both of these cases presume that the default encoding can't be made 
>> UTF-8 for backward compatibility reasons.
>>
>> Tom.
>
>
> 1.  UTF-8 text consists only of ASCII characters.  Even if some ASCII 
> strings reference non-ASCII characters.  It's the same idea as HTML 
> numeric character references which point to non-ASCII characters while 
> being composed of ASCII characters.  It shouldn't matter whether a 
> string of ASCII digits form the charcter number or a string of UTF-8 
> hex bytes form that number.  A Unicode-aware application will display 
> the string as a special character while legacy applications will show 
> the string as mojibake.  Either way, UTF-8 remains an ASCII-preserving 
> encoding format.
I don't understand this response.  UTF-8 lead and trailing bytes are not 
ASCII characters.  Perhaps you are using "ASCII" to refer to the set of 
8-bit ASCII-based encodings?  ASCII is a 7-bit encoding.
>
> 2.  Files using non-standard encodings should be converted to Unicode.
>
> Any plain-text file should be presumed to be UTF-8 unless marked 
> otherwise.
That doesn't match existing practice on Windows where most applications 
assume the encoding of the Active Code Page (e.g., Windows-1252).
>
> Years ago, the UTF-8 signature was sometimes considered helpful. 
> Nowadays it seems be more of an anachronism.

I think that is true in some contexts; e.g., on the web and on most 
POSIX systems.  I don't think it is true in general though.

Tom.



More information about the Unicode mailing list