[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Mon Oct 12 09:02:49 CDT 2020

Great, here is the change I'm making to address this:

    Protocol designers:

      * If possible, mandate use of UTF-8 without a BOM; diagnose the
        presence of a BOM in consumed text as an error, and produce text
        without a BOM.
      * Otherwise, if possible, mandate use of UTF-8 with or without a
        BOM; accept and discard a BOM in consumed text, and produce text
        without a BOM.
      * Otherwise, if possible, use UTF-8 as the default encoding with
        use of other encodings negotiated using information other than a
        BOM; accept and discard a BOM in consumed text, and produce text
        without a BOM.
      * Otherwise, require the presence of a BOM to differentiate UTF-8
        encoded text in both consumed and produced text*unless the
        absence of a BOM would result in the text being interpreted as
        an ASCII-based encoding and the UTF-8 text contains no non-ASCII
        characters (the exception is intended to avoid the addition of a
        BOM to ASCII text thus rendering such text as non-ASCII)*. This
        approach should be reserved for scenarios in which UTF-8 cannot
        be adopted as a default due to backward compatibility concerns.

Tom.

On 10/12/20 8:40 AM, Alisdair Meredith wrote:
> That addresses my main concern.  Essentially, best practice (for 
> UTF-8) would be no BOM unless the document contains code points that 
> require multiple code units to express.
>
> AlisdairM
>
>> On Oct 11, 2020, at 23:22, Tom Honermann <tom at honermann.net 
>> <mailto:tom at honermann.net>> wrote:
>>
>> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>>> One concern I have, that might lead into rationale for the current 
>>> discouragement,
>>> is that I would hate to see a best practice that pushes a BOM into 
>>> ASCII files.
>>> One of the nice properties of UTF-8 is that a valid ASCII file 
>>> (still very common) is
>>> also a valid UTF-8 file.  Changing best practice would encourage 
>>> updating those
>>> files to be no longer ASCII.
>>
>> Thanks, Alisdair.  I think that concern is implicitly addressed by 
>> the suggested resolutions, but perhaps that can be made more clear.  
>> One possibility would be to modify the "protocol designer" guidelines 
>> to address the case where a protocol's default encoding is ASCII 
>> based and to specify that a BOM is only required for UTF-8 text that 
>> contains non-ASCII characters.  Would that be helpful?
>>
>> Tom.
>>
>>>
>>> AlisdairM
>>>
>>>> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 
>>>> <sg16 at lists.isocpp.org <mailto:sg16 at lists.isocpp.org>> wrote:
>>>>
>>>> Attached is a draft proposal for the Unicode standard that intends 
>>>> to clarify the current recommendation regarding use of a BOM in 
>>>> UTF-8 text. This is follow up to discussion on the Unicode mailing 
>>>> list 
>>>> <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html> 
>>>> back in June.
>>>>
>>>> Feedback is welcome.  I plan to submit 
>>>> <https://www.unicode.org/pending/docsubmit.html> this to the UTC in 
>>>> a week or so pending review feedback.
>>>>
>>>> Tom.
>>>>
>>>> <Unicode-BOM-guidance.pdf>--
>>>> SG16 mailing list
>>>> SG16 at lists.isocpp.org <mailto:SG16 at lists.isocpp.org>
>>>> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>>>
>>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201012/1612a0dc/attachment.htm>