What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Thu Jun 11 00:00:49 CDT 2020

On 6/10/20 12:47 PM, Henri Sivonen via Unicode wrote:
> Tom Honermann wrote:
>> I'm researching support for UTF-8 encoded source files in various C++ compilers.  Here is a brief snapshot of existing practice:
>>
>> Clang only accepts UTF-8 encoded source files.  A UTF-8 BOM is recognized and discarded.
>> GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option.  If GCC is expecting UTF-8 source, then a BOM is discarded.  Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error.  GCC has no support for compiling a translation unit consisting of differently encoded source files.
>> Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM.  The default encoding can be overridden with a command line option.
>> IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent).  z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files.  Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files.
>>
>> The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding.
> ...
>> Various methods are being explored for how to support collections of mixed encoding source files.  The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture.
> Given your description of existing compiler behavior, I recommend
> making the C++ standard say that if a file (substitute the right ISO
> term for "file") starts with a UTF-8 BOM, the file must be interpreted
> as UTF-8 and the BOM be discarded before further processing. This
> already fits what you say GCC, clang, and MSVC do by default and would
> not be a compatibility-breaking change for IBM compilers (though I
> understood the IBM compilers are being superseded by clang anyway as
> far as implementing C++ versions later than C++11 goes). This would
> also facilitate migration to UTF-8 on Windows and z/OS.

Thank you, Henri, this matches my inclination.  If anyone else has 
dissenting opinions, please share them.

(The Clang ports to z/OS support distinct EBCDIC and ASCII modes, so 
they don't escape these concerns)

>
> Shawn Steele wrote:
>> The modern viewpoint is that the BOM should be discouraged in all contexts.
> If you are writing an HTML serializer that 1) is a component distinct
> from the HTTP layer and, therefore, cannot control the HTTP headers
> and 2) mustn't impose restrictions on the shape of the DOM and,
> therefore, mustn't inject a meta element on its own, the best approach
> is to use the UTF-8 BOM.
>
>> I’d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.
> This is problematic in contexts where there is non-UTF-8 legacy, the
> input arrives over time, and streaming processing of the input is
> expected. See https://hsivonen.fi/utf-8-detection/
>
> On Eli Zaretskii wrote:
>>> From: Shawn Steele via Unicode <unicode at unicode.org>
>>>
>>> I’ve been recommending that people assume documents are UTF-8.  If the UTF-8 decoding fails, then
>>> consider falling back to some other codepage.
>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
> Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox
> disables the character encoding menu to prevent self-XSS and to
> prevent the user from introducing data corruption to forms. This is a
> bit of a problem with e.g. university servers that have acquired a
> server-wide HTTP-level UTF-8 declaration but that carry occasional
> ancient ISO-2022-JP content. So far, I've decided not to do anything
> about this.
>
> Fortunately, the ISO 2022 series isn't really relevant (as a good
> approximation) to C++.

Additionally, fall back to another code page is not appropriate in 
contexts where proper diagnosis of ill-formed UTF-8 text is desired.  
For source code in particular, fall back to ISO-8859-1 due to ill-formed 
UTF-8 in a string literal would result in silent miscompilation.  The 
performance overhead of fall back for C++ compilation would also not be 
acceptable (where compilation performance is already a challenge).

Tom.