What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Tom Honermann
tom at honermann.net
Sun Jun 7 02:47:12 CDT 2020
Thank you to everyone that responded to this thread. The responses have
indicated that I need to be more clear about my motivation for asking.
More details below.
On 6/5/20 7:04 PM, Markus Scherer via Unicode wrote:
> The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode
> signature byte sequence -- was popular when Unicode was gaining ground
> but legacy charsets were still widely used.
> Especially on Windows, which had settled on UTF-16 much earlier, lots
> of tools and editors started writing or expecting UTF-8 signatures.
> Other tools (especially in the Linux/Unix world) were never modified
> to expect or even cope with the signature, so ignored it or choked on it.
> There has never been uniform practice on this.
> For the most part, all new and recent text is now UTF-8, and the
> signature byte sequence has fallen out of favor again even where it
> had been used.
Thank you, this is helpful historical perspective.
>
> Having said that, I think the statement is right: "neither required
> nor recommended for UTF-8"
I think different audiences could interpret that guidance in different ways.
As a software tool provider, I can interpret the guidance as meaning
that I should not require a BOM to be present on text that is consumed,
nor produce a BOM in text that is produced. But what is the
recommendation for honoring a BOM that is present in consumed text?
Pragmatically, it seems to me that tools should honor the presence of a
BOM by either treating the data following it as UTF-8 encoded or issuing
a diagnostic if the BOM presents a conflict with other indications of
expected encoding.
As a protocol developer, I can interpret the guidance as meaning that a
new protocol should either mandate a particular encoding or use some
mechanism other than a BOM to negotiate encoding.
As a text author, I can interpret the guidance as meaning that I should
not place a BOM in text that I author without strong motivation, nor
should I expect a tool to require one.
Back to my motivation for asking the question...
I'm researching support for UTF-8 encoded source files in various C++
compilers. Here is a brief snapshot of existing practice:
* Clang only accepts UTF-8 encoded source files. A UTF-8 BOM is
recognized and discarded.
* GCC accepts UTF-8 encoded source files by default, but the encoding
expectation can be overridden with a command line option. If GCC is
expecting UTF-8 source, then a BOM is discarded. Otherwise, a BOM
is *not* honored and its presence is likely to result in compilation
error. GCC has no support for compiling a translation unit
consisting of differently encoded source files.
* Microsoft Visual C++, by default, interprets source files as encoded
according to the Windows' Active Code Page (ACP), but supports
translation units consisting of differently encoded source files by
honoring a UTF-8 BOM. The default encoding can be overridden with a
command line option.
* IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes
(yes, though you may not have seen a green screen in recent times,
mainframes are still busy crunching numbers behind the scenes for
websites you frequent). z/OS is an EBCDIC based operating system
and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source
files. Many EBCDIC code pages exist and the xlC compiler supports
an in-source code page annotation that enables compilation of
translation units consisting of differently encoded source files.
The goal of this research is to produce a proposal for the C and C++
standards intended to better enable UTF-8 as a portable source file
encoding. The following are acknowledged (at least by me) as accepted
constraints:
* Existing compilers are not going to change their default mode of
operation due to backward compatibility constraints.
* Non-UTF-8 encoded source files are still in use, particularly by
commercial software providers.
* Converting source files to UTF-8 is not necessarily an easy task.
It isn't necessarily a simple matter of running the source files
through 'iconv' and committing the results.
* Transition to UTF-8 for source files will be aided by the
possibility of incremental adoption; e.g., use of UTF-8 encoded
header files by a project that has non-UTF-8 encoded source files.
Various methods are being explored for how to support collections of
mixed encoding source files. The intent in asking the question is to
help determine if/how use of a UTF-8 BOM fits in to the picture.
>
> We might want to review chapter 23 and the FAQ and see if they should
> be updated.
I think that would be useful. In particular, per other comments above,
if the standard or FAQ is to continue offering statements regarding
recommendations or guidance, it may be helpful to tailor the guidance
for different audiences. For example, "Software providers are
encouraged to honor the presence of a BOM signifying that a text is
UTF-8 encoded in text that is consumed, and are discouraged from
inserting a BOM in text that is produced . Text authors are discouraged
from inserting a BOM in their UTF-8 encoded documents [unless it is
known to be needed; because UTF-8 should be considered a default,
because some tools won't honor it, etc...]".
Tom.
>
> Thanks,
> markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200607/e82b12a3/attachment.htm>
More information about the Unicode
mailing list