<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

<meta name="Generator" content="Microsoft Word 15 (filtered medium)">

<style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:"Yu Gothic";

        panose-1:2 11 4 0 0 0 0 0 0 0;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:"Segoe UI Emoji";

        panose-1:2 11 5 2 4 2 4 2 2 3;}

@font-face

        {font-family:"\@Yu Gothic";

        panose-1:2 11 4 0 0 0 0 0 0 0;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

span.EmailStyle20

        {mso-style-type:personal-reply;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page WordSection1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.WordSection1

        {page:WordSection1;}

/* List Definitions */

@list l0

        {mso-list-id:1825581357;

        mso-list-template-ids:-1413206734;}

ol

        {margin-bottom:0in;}

ul

        {margin-bottom:0in;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

</head>

<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">

<div class="WordSection1">

<p class="MsoNormal">Nobody’s going to consider #1 regardless of what wordsmithing is done in Unicode, people have had too much difficulty with BOMs for it to be considered as a serious standards based solution.  #4 isn’t portable. 

<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">The “right” approach would be to ensure that the languages have ways of declaring a codepage (like a pragma or other magic semantic, options 2 & 3).<o:p></o:p></p>

<p class="MsoNormal"><br>

The time invested on this problem should be spent on getting agreement with WG21 about what the declaration should be and seeing if there are any “gotchas” to something like #pragma UTF8.  IMO, it’s not the effort to try to get effort to tweak Unicode’s guidance

 in order to support the common view the BOMs are bad, which WG21 won’t be considering anyway. 

<br>

<br>

The biggest thing I can think of is that very few codepages would lend themselves to being declared in a portable manner.  Different OS’s/software/vendors have different implementations of various codepages.  Even ones that are nominally similar often are mistagged

 or have subtle differences.  <br>

<br>

In other words, “UTF8” is about the only “safe” encoding that won’t have edge cases. Something like “shift-jis” has multiple legacy variations that mean everything won’t always be the same.

<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<p class="MsoNormal">-Shawn<o:p></o:p></p>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b>From:</b> Tom Honermann <tom@honermann.net> <br>

<b>Sent:</b> Friday, October 16, 2020 6:23 AM<br>

<b>To:</b> Shawn Steele <Shawn.Steele@microsoft.com>; J Decker <d3ck0r@gmail.com><br>

<b>Cc:</b> sg16@lists.isocpp.org<br>

<b>Subject:</b> Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal"><o:p> </o:p></p>

<div>

<p class="MsoNormal">On 10/14/20 3:21 PM, Shawn Steele wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal">How are you going to #include differently encoded source files?  I don’t see anything in this document that would make it possible to #include a file in a different encoding.  It’s unclear to me how your proposed document could be utilized

 to enable the scenario you’re interested in.<o:p></o:p></p>

</blockquote>

<p>My intention is to present various options for WG21 to consider along with a recommendation.  The options that have been identified so far are listed below.  Combinations of some of these options is a possibility.<o:p></o:p></p>

<ol start="1" type="1">

<li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">

Use of a BOM to indicate UTF-8 encoded source files.  This matches existing practice for the Microsoft compiler.<o:p></o:p></li><li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">

Use of a #pragma.  This matches <a href="https://www.ibm.com/support/knowledgecenter/en/SSLTBW_2.3.0/com.ibm.zos.v2r3.cbclx01/zos_pragma_filetag.htm">

existing practice</a> for the IBM compiler.<o:p></o:p></li><li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">

Use of a "magic" or "semantic" comment.  This matches <a href="https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations">

existing practice</a> in Python.<o:p></o:p></li><li class="MsoNormal" style="mso-margin-top-alt:auto;mso-margin-bottom-alt:auto;mso-list:l0 level1 lfo1">

Use of filesystem meta data.  This is an option for some compilers and is being considered for Clang on z/OS.<o:p></o:p></li></ol>

<p>The goal of this paper is to clarify guidance in the Unicode standard in order to better inform and justify a recommendation.  If the UTC were to provide a strong recommendation either for or against use of a BOM in UTF-8 files, that would be a point either

 in favor or in opposition to option 1 above.  As is, based on my reading and a number of the responses I've seen, the guidance is murky.<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">For mixed-encoding behavior the only thing I could imagine is adding some sort of preprocessor #codepage or something to the standard.  (Which would again take a while to reach critical mass.)<o:p></o:p></p>

</blockquote>

<p>Yes, deployment will take time in any case.  A goal would be to choose an option that can be used as an extension for previous C++ standards.  This may rule out option 2 above since some compilers diagnose use of an unrecognized pragma.<o:p></o:p></p>

<p>Tom.<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">-Shawn<o:p></o:p></p>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b>From:</b> Tom Honermann <a href="mailto:tom@honermann.net">

<tom@honermann.net></a> <br>

<b>Sent:</b> Tuesday, October 13, 2020 9:47 PM<br>

<b>To:</b> Shawn Steele <a href="mailto:Shawn.Steele@microsoft.com"><Shawn.Steele@microsoft.com></a>; J Decker

<a href="mailto:d3ck0r@gmail.com"><d3ck0r@gmail.com></a><br>

<b>Cc:</b> <a href="mailto:sg16@lists.isocpp.org">sg16@lists.isocpp.org</a><br>

<b>Subject:</b> Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<p class="MsoNormal">On 10/13/20 5:19 PM, Shawn Steele wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal">IMO this document doesn’t solve your problem.  The problem of encourage use of UTF-8 in C++ source code is a goal that most compilers/source code authors/etc are totally onboard with.<o:p></o:p></p>

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">The source is already in an indeterminate state.  The desired end state is to have UTF-8 source code (without BOM), which is typically supported.  The difficulty is therefore getting from point A to point B.  As far as “Use Unicode” goes,

 there’s no issue, but trying to specify BOM as a protocol doesn’t really solve the problem, particularly in complex environments.<o:p></o:p></p>

</blockquote>

<p class="MsoNormal">I think there is a misunderstanding.  The intent of the paper is to provide rationale for the existing discouragement for use of a BOM in UTF-8 while acknowledging that, in some cases, it may remain useful.  My intent is to discourage use

 of a BOM for UTF-8 encoded source files - thereby arguing against standardizing the behavior exhibited by Microsoft Visual C++ today.<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">If the compiler doesn’t handle BOM as expected, then you’ll get errors.  This can be further complicated by preprocessors, #include, resources, etc.  If “specifying BOM behavior in Unicode” could help solve the problem, then all of the

 tooling used by everyone would have to be updated to handle that (new) requirement.  If you could get everyone on the same page, they’d all use UTF-8, so you wouldn’t need to update the tooling.  If you don’t need to update the tooling, you wouldn’t need to

 update the best practices for BOMs.<o:p></o:p></p>

</blockquote>

<p>This paper does not propose "specifying BOM behavior in Unicode".  If you feel that it does, please read it again and let me know what leads you to believe that it does.<o:p></o:p></p>

<p>The tooling isn't the problem.  The problem is the existing source code that is not UTF-8 encoded or that is UTF-8 encoded with a BOM.  The deployment challenge is with those existing source files.  Microsoft Visual C++ is going to continue consuming source

 files using the Active Code Page (ACP) and IBM compilers on EBCDIC platforms are going to continue consuming source files using EBCDIC code pages.  The goal is to provide a mechanism where a UTF-8 encoded source file can #include a source file in another encoding

 or vice versa.  Any solution for that will require tooling updates (and that is ok).<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">Personally, I’d prefer if cases like this ignore BOMs (or use them to switch to UTF-8); eg: treat BOMs like whitespace.  But this isn’t a problem solvable by any recommendation by Unicode.<o:p></o:p></p>

</blockquote>

<p class="MsoNormal">When consuming text as UTF-8, I agree that ignoring a BOM is usually the right thing to do and would be the right thing to do when consuming source code.<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">As you noted, many systems provide mechanisms for indicating that code is UTF-8 or compiling with UTF-8, regardless of BOM.<o:p></o:p></p>

</blockquote>

<p class="MsoNormal">Yes, but there is no standard solution, not even a defacto one, for consuming differently encoded source files in the same translation unit.<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">A rather large codebase I’ve been working with has been working to remove encoding confusion, and it’s a big task

<span style="font-family:"Segoe UI Emoji",sans-serif">😁</span><o:p></o:p></p>

</blockquote>

<p>Yes, yes it is.<o:p></o:p></p>

<p>Tom.<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal"> <o:p></o:p></p>

<p class="MsoNormal">-Shawn<o:p></o:p></p>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<div style="border:none;border-top:solid #E1E1E1 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b>From:</b> Unicode <a href="mailto:unicode-bounces@unicode.org">

<unicode-bounces@unicode.org></a> <b>On Behalf Of </b>Tom Honermann via Unicode<br>

<b>Sent:</b> Tuesday, October 13, 2020 1:47 PM<br>

<b>To:</b> J Decker <a href="mailto:d3ck0r@gmail.com"><d3ck0r@gmail.com></a>; Unicode List

<a href="mailto:unicode@unicode.org"><unicode@unicode.org></a><br>

<b>Cc:</b> <a href="mailto:sg16@lists.isocpp.org">sg16@lists.isocpp.org</a><br>

<b>Subject:</b> Re: [SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature<o:p></o:p></p>

</div>

</div>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<p class="MsoNormal">On 10/12/20 8:09 PM, J Decker via Unicode wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<div>

<p class="MsoNormal">On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode <<a href="mailto:unicode@unicode.org">unicode@unicode.org</a>> wrote:<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

<div>

<div>

<p class="MsoNormal">On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:<o:p></o:p></p>

</div>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<p class="MsoNormal">One concern I have, that might lead into rationale for the current discouragement,

<o:p></o:p></p>

<div>

<p class="MsoNormal">is that I would hate to see a best practice that pushes a BOM into ASCII files.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">One of the nice properties of UTF-8 is that a valid ASCII file (still very common) is<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">also a valid UTF-8 file.  Changing best practice would encourage updating those<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">files to be no longer ASCII.<o:p></o:p></p>

</div>

</blockquote>

<p>Thanks, Alisdair.  I think that concern is implicitly addressed by the suggested resolutions, but perhaps that can be made more clear.  One possibility would be to modify the "protocol designer" guidelines to address the case where a protocol's default encoding

 is ASCII based and to specify that a BOM is only required for UTF-8 text that contains non-ASCII characters.  Would that be helpful?<o:p></o:p></p>

</div>

</blockquote>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">'and to specify that a BOM is only required for UTF-8 '  this should NEVER be 'required' or 'must', it shouldn't even be 'suggested'; fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with a such and such.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">These days the standard 'everything IS utf-8' works really well, except in firefox where the charset is required to be specified for JS scripts (but that's a bug in that)<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">EBCDIC should be converted on the edge to internal ascii, since, thankfully, this is a niche application and everything thinks in ASCII or some derivative thereof.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the correct order.<o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">I have run into several editors that have insisted on emitted BOM for UTF8 when initially promoted from ASCII, but subsequently deleting it doesn't bother anything.<o:p></o:p></p>

</div>

</div>

</div>

</blockquote>

<p class="MsoNormal">I mostly agree.  Please note that the paper suggests use of a BOM only as a last resort.  The goal is to further discourage its use with rationale.<br>

<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">I am curious though, what was the actual problem you ran into that makes you even consider this modification? 

<o:p></o:p></p>

</div>

</div>

</div>

</blockquote>

<p>I'm working on improving support for portable C++ source code.  Today, there is no character encoding that is supported by all C++ implementations (not even ASCII).  I'd like to make UTF-8 that commonly supported character encoding.  For backward compatibility

 reasons, compilers cannot change their default source code character encoding to UTF-8.<o:p></o:p></p>

<p>Most C++ applications are created from components that have different release schedules and that are maintained by different organizations.  Synchronizing a conversion to UTF-8 across dependent projects isn't feasible, nor is converting all of the source

 files used by an application to UTF-8 as simple as just running them through 'iconv'.  Migration to UTF-8 will therefore require an incremental approach for at least some applications, though many are likely to find success by simply invoking their compiler

 with the appropriate -everything-is-utf8 option since most source files are ASCII.<o:p></o:p></p>

<p>Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding signature and allows differently encoded source files to be used in the same translation unit.  Support for differently encoded source files in the same translation unit is the feature that will

 be needed to enable incremental migration.  Normative discouragement (with rationale) for use of a BOM by the Unicode standard would be helpful to explain why a solution other than a BOM (perhaps something like

<a href="https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations">

Python's encoding declaration</a>) should be standardized in favor of the existing practice demonstrated by Microsoft's solution.<o:p></o:p></p>

<p>Tom.<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<div>

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">J<o:p></o:p></p>

</div>

<blockquote style="border:none;border-left:solid #CCCCCC 1.0pt;padding:0in 0in 0in 6.0pt;margin-left:4.8pt;margin-top:5.0pt;margin-right:0in;margin-bottom:5.0pt">

<div>

<p>Tom.<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<div>

<p class="MsoNormal">AlisdairM<o:p></o:p></p>

<div>

<p class="MsoNormal"><br>

<br>

<br>

<br>

<o:p></o:p></p>

<blockquote style="margin-top:5.0pt;margin-bottom:5.0pt">

<div>

<p class="MsoNormal">On Oct 10, 2020, at 14:54, Tom Honermann via SG16 <<a href="mailto:sg16@lists.isocpp.org" target="_blank">sg16@lists.isocpp.org</a>> wrote:<o:p></o:p></p>

</div>

<p class="MsoNormal"> <o:p></o:p></p>

<div>

<div>

<p>Attached is a draft proposal for the Unicode standard that intends to clarify the current recommendation regarding use of a BOM in UTF-8 text.  This is follow up to

<a href="https://corp.unicode.org/pipermail/unicode/2020-June/008713.html" target="_blank">

discussion on the Unicode mailing list</a> back in June.<o:p></o:p></p>

<p>Feedback is welcome.  I plan to <a href="https://www.unicode.org/pending/docsubmit.html" target="_blank">

submit</a> this to the UTC in a week or so pending review feedback.<o:p></o:p></p>

<p>Tom.<o:p></o:p></p>

</div>

<p class="MsoNormal"><Unicode-BOM-guidance.pdf>-- <br>

SG16 mailing list<br>

<a href="mailto:SG16@lists.isocpp.org" target="_blank">SG16@lists.isocpp.org</a><br>

<a href="https://lists.isocpp.org/mailman/listinfo.cgi/sg16" target="_blank">https://lists.isocpp.org/mailman/listinfo.cgi/sg16</a><o:p></o:p></p>

</div>

</blockquote>

</div>

<p class="MsoNormal"> <o:p></o:p></p>

</div>

<p class="MsoNormal"><br>

<br>

<br>

<br>

<o:p></o:p></p>

</blockquote>

<p> <o:p></o:p></p>

</div>

</blockquote>

</div>

</div>

</blockquote>

<p> <o:p></o:p></p>

</blockquote>

<p> <o:p></o:p></p>

</blockquote>

<p><o:p> </o:p></p>

</div>

</body>

</html>