[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

James Kass jameskass at code2001.com
Sun Oct 11 23:28:53 CDT 2020



On 2020-10-12 3:37 AM, Tom Honermann via Unicode wrote:
> On 10/11/20 11:32 PM, JF Bastien wrote:
>> It’s a bit odd: if you assume the default is ascii then you don’t 
>> need this. If you assume the default is utf8 then you don’t need 
>> this... so when do you need the BOM? It seems like making bad prior 
>> choices more acceptable... even though they were bad choices. I’m not 
>> sure it’s a good idea.
>
> A BOM would be needed when:
>
> 1. The default encoding is ASCII based (ISO-8859-1, Windows-1252,
>    etc...) and the UTF-8 text to be produced contains non-ASCII
>    characters.  Or,
> 2. The default encoding is not ASCII based (e.g., EBCDIC).
>
> Both of these cases presume that the default encoding can't be made 
> UTF-8 for backward compatibility reasons.
>
> Tom.


1.  UTF-8 text consists only of ASCII characters.  Even if some ASCII 
strings reference non-ASCII characters.  It's the same idea as HTML 
numeric character references which point to non-ASCII characters while 
being composed of ASCII characters.  It shouldn't matter whether a 
string of ASCII digits form the charcter number or a string of UTF-8 hex 
bytes form that number.  A Unicode-aware application will display the 
string as a special character while legacy applications will show the 
string as mojibake.  Either way, UTF-8 remains an ASCII-preserving 
encoding format.

2.  Files using non-standard encodings should be converted to Unicode.

Any plain-text file should be presumed to be UTF-8 unless marked otherwise.

Years ago, the UTF-8 signature was sometimes considered helpful. 
Nowadays it seems be more of an anachronism.


More information about the Unicode mailing list