What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Tom Honermann tom at honermann.net
Sun Jun 7 02:47:12 CDT 2020


Thank you to everyone that responded to this thread.  The responses have 
indicated that I need to be more clear about my motivation for asking.  
More details below.

On 6/5/20 7:04 PM, Markus Scherer via Unicode wrote:
> The BOM -- or for UTF-8 where "byte order" is meaningless, the Unicode 
> signature byte sequence -- was popular when Unicode was gaining ground 
> but legacy charsets were still widely used.
> Especially on Windows, which had settled on UTF-16 much earlier, lots 
> of tools and editors started writing or expecting UTF-8 signatures.
> Other tools (especially in the Linux/Unix world) were never modified 
> to expect or even cope with the signature, so ignored it or choked on it.
> There has never been uniform practice on this.
> For the most part, all new and recent text is now UTF-8, and the 
> signature byte sequence has fallen out of favor again even where it 
> had been used.
Thank you, this is helpful historical perspective.
>
> Having said that, I think the statement is right: "neither required 
> nor recommended for UTF-8"

I think different audiences could interpret that guidance in different ways.

As a software tool provider, I can interpret the guidance as meaning 
that I should not require a BOM to be present on text that is consumed, 
nor produce a BOM in text that is produced.  But what is the 
recommendation for honoring a BOM that is present in consumed text?  
Pragmatically, it seems to me that tools should honor the presence of a 
BOM by either treating the data following it as UTF-8 encoded or issuing 
a diagnostic if the BOM presents a conflict with other indications of 
expected encoding.

As a protocol developer, I can interpret the guidance as meaning that a 
new protocol should either mandate a particular encoding or use some 
mechanism other than a BOM to negotiate encoding.

As a text author, I can interpret the guidance as meaning that I should 
not place a BOM in text that I author without strong motivation, nor 
should I expect a tool to require one.

Back to my motivation for asking the question...

I'm researching support for UTF-8 encoded source files in various C++ 
compilers.  Here is a brief snapshot of existing practice:

  * Clang only accepts UTF-8 encoded source files.  A UTF-8 BOM is
    recognized and discarded.
  * GCC accepts UTF-8 encoded source files by default, but the encoding
    expectation can be overridden with a command line option.  If GCC is
    expecting UTF-8 source, then a BOM is discarded.  Otherwise, a BOM
    is *not* honored and its presence is likely to result in compilation
    error.  GCC has no support for compiling a translation unit
    consisting of differently encoded source files.
  * Microsoft Visual C++, by default, interprets source files as encoded
    according to the Windows' Active Code Page (ACP), but supports
    translation units consisting of differently encoded source files by
    honoring a UTF-8 BOM.  The default encoding can be overridden with a
    command line option.
  * IBM z/OS xlC C/C++ is IBM's compiler for C and C++ on mainframes
    (yes, though you may not have seen a green screen in recent times,
    mainframes are still busy crunching numbers behind the scenes for
    websites you frequent).  z/OS is an EBCDIC based operating system
    and IBM's xlC compiler for z/OS only accepts EBCDIC encoded source
    files.  Many EBCDIC code pages exist and the xlC compiler supports
    an in-source code page annotation that enables compilation of
    translation units consisting of differently encoded source files.

The goal of this research is to produce a proposal for the C and C++ 
standards intended to better enable UTF-8 as a portable source file 
encoding.  The following are acknowledged (at least by me) as accepted 
constraints:

  * Existing compilers are not going to change their default mode of
    operation due to backward compatibility constraints.
  * Non-UTF-8 encoded source files are still in use, particularly by
    commercial software providers.
  * Converting source files to UTF-8 is not necessarily an easy task. 
    It isn't necessarily a simple matter of running the source files
    through 'iconv' and committing the results.
  * Transition to UTF-8 for source files will be aided by the
    possibility of incremental adoption; e.g., use of UTF-8 encoded
    header files by a project that has non-UTF-8 encoded source files.

Various methods are being explored for how to support collections of 
mixed encoding source files.  The intent in asking the question is to 
help determine if/how use of a UTF-8 BOM fits in to the picture.

>
> We might want to review chapter 23 and the FAQ and see if they should 
> be updated.

I think that would be useful.  In particular, per other comments above, 
if the standard or FAQ is to continue offering statements regarding 
recommendations or guidance, it may be helpful to tailor the guidance 
for different audiences.  For example, "Software providers are 
encouraged to honor the presence of a BOM signifying that a text is 
UTF-8 encoded in text that is consumed, and are discouraged from 
inserting a BOM in text that is produced .  Text authors are discouraged 
from inserting a BOM in their UTF-8 encoded documents [unless it is 
known to be needed; because UTF-8 should be considered a default, 
because some tools won't honor it, etc...]".

Tom.

>
> Thanks,
> markus


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200607/e82b12a3/attachment.htm>


More information about the Unicode mailing list