What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
hsivonen at hsivonen.fi
Wed Jun 10 11:47:47 CDT 2020
Tom Honermann wrote:
> I'm researching support for UTF-8 encoded source files in various C++ compilers. Here is a brief snapshot of existing practice:
> Clang only accepts UTF-8 encoded source files. A UTF-8 BOM is recognized and discarded.
> GCC accepts UTF-8 encoded source files by default, but the encoding expectation can be overridden with a command line option. If GCC is expecting UTF-8 source, then a BOM is discarded. Otherwise, a BOM is *not* honored and its presence is likely to result in compilation error. GCC has no support for compiling a translation unit consisting of differently encoded source files.
> Microsoft Visual C++, by default, interprets source files as encoded according to the Windows' Active Code Page (ACP), but supports translation units consisting of differently encoded source files by honoring a UTF-8 BOM. The default encoding can be overridden with a command line option.
> IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes (yes, though you may not have seen a green screen in recent times, mainframes are still busy crunching numbers behind the scenes for websites you frequent). z/OS is an EBCDIC based operating system and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source files. Many EBCDIC code pages exist and the xlC compiler supports an in-source code page annotation that enables compilation of translation units consisting of differently encoded source files.
> The goal of this research is to produce a proposal for the C and C++ standards intended to better enable UTF-8 as a portable source file encoding.
> Various methods are being explored for how to support collections of mixed encoding source files. The intent in asking the question is to help determine if/how use of a UTF-8 BOM fits in to the picture.
Given your description of existing compiler behavior, I recommend
making the C++ standard say that if a file (substitute the right ISO
term for "file") starts with a UTF-8 BOM, the file must be interpreted
as UTF-8 and the BOM be discarded before further processing. This
already fits what you say GCC, clang, and MSVC do by default and would
not be a compatibility-breaking change for IBM compilers (though I
understood the IBM compilers are being superseded by clang anyway as
far as implementing C++ versions later than C++11 goes). This would
also facilitate migration to UTF-8 on Windows and z/OS.
Shawn Steele wrote:
> The modern viewpoint is that the BOM should be discouraged in all contexts.
If you are writing an HTML serializer that 1) is a component distinct
from the HTTP layer and, therefore, cannot control the HTTP headers
and 2) mustn't impose restrictions on the shape of the DOM and,
therefore, mustn't inject a meta element on its own, the best approach
is to use the UTF-8 BOM.
> I’d recommend to anyone encountering ASCII-like data to presume it was UTF-8 unless proven otherwise.
This is problematic in contexts where there is non-UTF-8 legacy, the
input arrives over time, and streaming processing of the input is
expected. See https://hsivonen.fi/utf-8-detection/
On Eli Zaretskii wrote:
> > From: Shawn Steele via Unicode <unicode at unicode.org>
> > I’ve been recommending that people assume documents are UTF-8. If the UTF-8 decoding fails, then
> > consider falling back to some other codepage.
> That strategy would fail with 7-bit ISO 2022 based encodings, no?
Yes. When HTML is labeled as UTF-8 and is valid UTF-8, Firefox
disables the character encoding menu to prevent self-XSS and to
prevent the user from introducing data corruption to forms. This is a
bit of a problem with e.g. university servers that have acquired a
server-wide HTTP-level UTF-8 declaration but that carry occasional
ancient ISO-2022-JP content. So far, I've decided not to do anything
Fortunately, the ISO 2022 series isn't really relevant (as a good
approximation) to C++.
hsivonen at hsivonen.fi
More information about the Unicode