[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Kent Karlsson kent.b.karlsson at bahnhof.se
Mon Oct 12 17:38:30 CDT 2020


> 12 okt. 2020 kl. 06:28 skrev James Kass via Unicode <unicode at unicode.org>:

> 1.  UTF-8 text consists only of ASCII characters. 

????

> Even if some ASCII strings reference non-ASCII characters.  It's the same idea as HTML numeric character references which point to non-ASCII characters while being composed of ASCII characters.  It shouldn't matter whether a string of ASCII digits form the charcter number or a string of UTF-8 hex bytes form that number. 

That is a HUGE difference. If you are using character references, you rely upon a conversion of those references to the ”actual” characters in a target encoding. And it matters if you ”work on” (e.g. view, or let a program process the characters) the source (with character references) or the target (where the character references have been replaced by the characters they represent, if any).

Further there are several different, and not freely mixable, ways of doing character references. HTML has its way, C++ (and many other programming languages) have another way (and they may differ slightly). So it depend on context how, and if, supposed character references are interpreted as the character referenced (if any). C++ style character references are not interpreted in HTML, and HTML style character references are not interpreted in C++ (as such).

Without character references (or where they are not interpreted), you have one character encoding. With (possible) character references, you have a source character encoding and a target character encoding, that need not be the same; in addition to which syntax is used for the character references (and there are several different syntaxes).

Any ”charset” declaration of a string (or file) would be for the source encoding of that string/file.

> A Unicode-aware application will display the string as a special character while legacy applications will show the string as mojibake.  Either way, UTF-8 remains an ASCII-preserving encoding format.

What is a ”special character”? Any ”non-ASCII” one?? That could be seen as offensive...

/Kent K




More information about the Unicode mailing list