[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Tom Honermann tom at honermann.net
Tue Oct 13 15:46:43 CDT 2020


On 10/12/20 8:09 PM, J Decker via Unicode wrote:
>
>
> On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode 
> <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>
>     On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>>     One concern I have, that might lead into rationale for the
>>     current discouragement,
>>     is that I would hate to see a best practice that pushes a BOM
>>     into ASCII files.
>>     One of the nice properties of UTF-8 is that a valid ASCII file
>>     (still very common) is
>>     also a valid UTF-8 file.  Changing best practice would encourage
>>     updating those
>>     files to be no longer ASCII.
>
>     Thanks, Alisdair.  I think that concern is implicitly addressed by
>     the suggested resolutions, but perhaps that can be made more
>     clear.  One possibility would be to modify the "protocol designer"
>     guidelines to address the case where a protocol's default encoding
>     is ASCII based and to specify that a BOM is only required for
>     UTF-8 text that contains non-ASCII characters.  Would that be helpful?
>
>
> 'and to specify that a BOM is only required for UTF-8 ' this should 
> NEVER be 'required' or 'must', it shouldn't even be 'suggested'; 
> fortunately BOM is just a ZWNBSP, so it's certainly a 'may' start with 
> a such and such.
> These days the standard 'everything IS utf-8' works really well, 
> except in firefox where the charset is required to be specified for JS 
> scripts (but that's a bug in that)
> EBCDIC should be converted on the edge to internal ascii, since, 
> thankfully, this is a niche application and everything thinks in ASCII 
> or some derivative thereof.
> Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the 
> correct order.
> I have run into several editors that have insisted on emitted BOM for 
> UTF8 when initially promoted from ASCII, but subsequently deleting it 
> doesn't bother anything.
I mostly agree.  Please note that the paper suggests use of a BOM only 
as a last resort.  The goal is to further discourage its use with rationale.
>
> I am curious though, what was the actual problem you ran into that 
> makes you even consider this modification?

I'm working on improving support for portable C++ source code. Today, 
there is no character encoding that is supported by all C++ 
implementations (not even ASCII).  I'd like to make UTF-8 that commonly 
supported character encoding.  For backward compatibility reasons, 
compilers cannot change their default source code character encoding to 
UTF-8.

Most C++ applications are created from components that have different 
release schedules and that are maintained by different organizations.  
Synchronizing a conversion to UTF-8 across dependent projects isn't 
feasible, nor is converting all of the source files used by an 
application to UTF-8 as simple as just running them through 'iconv'.  
Migration to UTF-8 will therefore require an incremental approach for at 
least some applications, though many are likely to find success by 
simply invoking their compiler with the appropriate -everything-is-utf8 
option since most source files are ASCII.

Microsoft Visual C++ recognizes a UTF-8 BOM as an encoding signature and 
allows differently encoded source files to be used in the same 
translation unit.  Support for differently encoded source files in the 
same translation unit is the feature that will be needed to enable 
incremental migration.  Normative discouragement (with rationale) for 
use of a BOM by the Unicode standard would be helpful to explain why a 
solution other than a BOM (perhaps something like Python's encoding 
declaration 
<https://docs.python.org/3/reference/lexical_analysis.html#encoding-declarations>) 
should be standardized in favor of the existing practice demonstrated by 
Microsoft's solution.

Tom.

>
> J
>
>     Tom.
>
>>
>>     AlisdairM
>>
>>>     On Oct 10, 2020, at 14:54, Tom Honermann via SG16
>>>     <sg16 at lists.isocpp.org <mailto:sg16 at lists.isocpp.org>> wrote:
>>>
>>>     Attached is a draft proposal for the Unicode standard that
>>>     intends to clarify the current recommendation regarding use of a
>>>     BOM in UTF-8 text.  This is follow up to discussion on the
>>>     Unicode mailing list
>>>     <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html>
>>>     back in June.
>>>
>>>     Feedback is welcome.  I plan to submit
>>>     <https://www.unicode.org/pending/docsubmit.html> this to the UTC
>>>     in a week or so pending review feedback.
>>>
>>>     Tom.
>>>
>>>     <Unicode-BOM-guidance.pdf>--
>>>     SG16 mailing list
>>>     SG16 at lists.isocpp.org <mailto:SG16 at lists.isocpp.org>
>>>     https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201013/25681782/attachment-0001.htm>


More information about the Unicode mailing list