[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Tue Oct 13 16:06:28 CDT 2020

On 10/13/20 4:42 PM, Shawn Steele wrote:
>
> My assertion is that if the application cannot change to UTF-8 due to 
> legacy considerations, that the subtleties of whether to use a BOM or 
> not also cannot be prescribed.  If the application could follow best 
> practices, it would use UTF-8.  Since it cannot use UTF-8, therefore 
> it can’t follow any prescribed behavior.  Therefore anything beyond 
> “Use Unicode!” is merely suggestions.  Terminology like “require” 
> implies a false sense of rigor that these applications can’t follow in 
> practice.
>
This is why the prescription remains abstract:

  * If possible, use something other than a BOM.
  * As a last resort, use a BOM.

I am effectively proposing that as a best practice.

> Eg:  Presume I have a text editor that has been used in some context 
> for some time.  If I’m told “use UTF-8”, that’s cool, I could try to 
> do that, but if I cannot, then I’m in an exceptional path.  Unicode 
> could suggest that I consider behavior for BOMs (such as ignoring them 
> if present), however I’m already stuck in my legacy behavior, so 
> there’s a limit to what my application can do.
>
This scenario fits the advice above.  The "use something other than a 
BOM" could mean adding a command line option, adding a menu option, 
remembering what encoding was used for that file last time, performing a 
heuristic analysis (that may or may not include the presence of a BOM in 
its calculation), prompting the user, etc...
>
> However, if Unicode says “if you see a BOM, then you must use UTF-8”, 
> then users of my legacy application that is difficult to change, may 
> have expectations of the application that don’t match reality.  They 
> could even enter bugs like “The app isn’t recognizing data being 
> tagged with BOMs.”  Or “your system isn’t compliant, so we can’t 
> license it.”  If the app could properly handle UTF-8, we’d have been 
> captured in the first requirements and wouldn’t even be having this 
> part of the conversation.  Since they can’t handle UTF-8, trying to 
> enforce it through the BOM isn’t going to add much.
>
No part of this proposal states "if you see a BOM, then you must use 
UTF-8".  It only suggests guidelines; requirements are imposed by 
protocols as deemed appropriate by the protocol designers.
>
> IMO it’s better that everyone involved understand that this legacy app 
> that can’t handle UTF-8 by default isn’t necessarily going to behave 
> per any set expectations and likely has legacy behaviors that users 
> may need to deal with.
>
> Granted, the difference between “requiring,” and “suggesting” or 
> “recommending”, may be subtle, however those subtleties can sometimes 
> cause unnecessary pain.
>
> I don’t mind mandating UTF-8 without BOM if possible.  I don’t really 
> mind mandating that BOMs be ignored if “without BOM” isn’t reasonable 
> to mandate.
>
> After that though, it’s trying to create a higher order protocol for 
> codepage detection.  BOM isn’t a great way to identify UTF-8 data.  
> (It’s probably more effective to decode it as UTF-8.  If it decodes 
> properly, then it’s likely UTF-8.  With a certainty of about as many 
> “nines” as you have bytes of input.  Linguistically appropriate 
> strings that fail that test are rare.)
>
We are agreed on these points.

Tom.

> -Shawn
>
> *From:* Tom Honermann <tom at honermann.net>
> *Sent:* Tuesday, October 13, 2020 1:04 PM
> *To:* Shawn Steele <Shawn.Steele at microsoft.com>; Alisdair Meredith 
> <alisdairm at me.com>
> *Cc:* sg16 at lists.isocpp.org; Unicode Mail List <unicode at unicode.org>
> *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of a 
> BOM as a UTF-8 encoding signature
>
> On 10/12/20 4:54 PM, Shawn Steele wrote:
>
>     I’m having trouble with the attempt to be this prescriptive.
>
>     These make sense:  “Use Unicode!”
>
>       * If possible, mandate use of UTF-8 without a BOM; diagnose the
>         presence of a BOM in consumed text as an error, and produce
>         text without a BOM.
>       * Alternatively, swallow the BOM if present.
>
>     After that the situation is clearly hopeless.  Applications should
>     Use Unicode, eg: UTF-8, and clearly there are cases happening
>     where that isn’t happening.  Trying to prescribe that negotiation
>     should therefore happen, or that BOMs should be interpreted or
>     whatever is fairly meaningless at that point.  Given that the
>     higher-order guidance of “Use Unicode” has already been ignored,
>     at this point it’s garbage-in, garbage-out.  Clearly the
>     app/whatever is ignoring the “use unicode” guidance for some
>     legacy reason. If they could adapt, it should be to use UTF-8. 
>      It **might** be helpful to say something about a BOM likely
>     indicating UTF-8 text in otherwise unspecified data, but
>     prescriptive stuff is pointless, it’s legacy stuff that behaves in
>     a legacy fashion for a reason and saying they should have done it
>     differently 20 years ago isn’t going to help 😊
>
> There are applications that, for legacy reasons, are unable to change 
> their default encoding to UTF-8, but that also need to handle UTF-8 
> text.  It is not clear to me that such situations are hopeless or that 
> they cannot be improved.
>
> The prescription offered follows what you suggest.  The first three 
> cases are are all of the "use Unicode!" variety.  The distinction 
> between the third and the fourth is to relegate use of a BOM as an 
> encoding signature to the last resort option.  The intent is to make 
> it clear, with stronger motivation than is currently present in the 
> Unicode standard, that use of a BOM in UTF-8 is not a best practice today.
>
> Tom.
>
>     -Shawn
>
>     *From:* Unicode <unicode-bounces at unicode.org>
>     <mailto:unicode-bounces at unicode.org> *On Behalf Of *Tom Honermann
>     via Unicode
>     *Sent:* Monday, October 12, 2020 7:03 AM
>     *To:* Alisdair Meredith <alisdairm at me.com> <mailto:alisdairm at me.com>
>     *Cc:* sg16 at lists.isocpp.org <mailto:sg16 at lists.isocpp.org>;
>     Unicode List <unicode at unicode.org> <mailto:unicode at unicode.org>
>     *Subject:* Re: [SG16] Draft proposal: Clarify guidance for use of
>     a BOM as a UTF-8 encoding signature
>
>     Great, here is the change I'm making to address this:
>
>         Protocol designers:
>
>           * If possible, mandate use of UTF-8 without a BOM; diagnose
>             the presence of a BOM in consumed text as an error, and
>             produce text without a BOM.
>           * Otherwise, if possible, mandate use of UTF-8 with or
>             without a BOM; accept and discard a BOM in consumed text,
>             and produce text without a BOM.
>           * Otherwise, if possible, use UTF-8 as the default encoding
>             with use of other encodings negotiated using information
>             other than a BOM; accept and discard a BOM in consumed
>             text, and produce text without a BOM.
>           * Otherwise, require the presence of a BOM to differentiate
>             UTF-8 encoded text in both consumed and produced
>             text*unless the absence of a BOM would result in the text
>             being interpreted as an ASCII-based encoding and the UTF-8
>             text contains no non-ASCII characters (the exception is
>             intended to avoid the addition of a BOM to ASCII text thus
>             rendering such text as non-ASCII)*. This approach should
>             be reserved for scenarios in which UTF-8 cannot be adopted
>             as a default due to backward compatibility concerns.
>
>     Tom.
>
>     On 10/12/20 8:40 AM, Alisdair Meredith wrote:
>
>         That addresses my main concern.  Essentially, best practice
>         (for UTF-8) would be no BOM unless the document contains code
>         points that require multiple code units to express.
>
>         AlisdairM
>
>
>
>
>             On Oct 11, 2020, at 23:22, Tom Honermann
>             <tom at honermann.net <mailto:tom at honermann.net>> wrote:
>
>             On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>
>                 One concern I have, that might lead into rationale for
>                 the current discouragement,
>
>                 is that I would hate to see a best practice that
>                 pushes a BOM into ASCII files.
>
>                 One of the nice properties of UTF-8 is that a valid
>                 ASCII file (still very common) is
>
>                 also a valid UTF-8 file.  Changing best practice would
>                 encourage updating those
>
>                 files to be no longer ASCII.
>
>             Thanks, Alisdair.  I think that concern is implicitly
>             addressed by the suggested resolutions, but perhaps that
>             can be made more clear.  One possibility would be to
>             modify the "protocol designer" guidelines to address the
>             case where a protocol's default encoding is ASCII based
>             and to specify that a BOM is only required for UTF-8 text
>             that contains non-ASCII characters.  Would that be helpful?
>
>             Tom.
>
>                 AlisdairM
>
>
>
>
>                     On Oct 10, 2020, at 14:54, Tom Honermann via SG16
>                     <sg16 at lists.isocpp.org
>                     <mailto:sg16 at lists.isocpp.org>> wrote:
>
>                     Attached is a draft proposal for the Unicode
>                     standard that intends to clarify the current
>                     recommendation regarding use of a BOM in UTF-8
>                     text.  This is follow up to discussion on the
>                     Unicode mailing list
>                     <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html>
>                     back in June.
>
>                     Feedback is welcome.  I plan to submit
>                     <https://www.unicode.org/pending/docsubmit.html>
>                     this to the UTC in a week or so pending review
>                     feedback.
>
>                     Tom.
>
>                     <Unicode-BOM-guidance.pdf>--
>                     SG16 mailing list
>                     SG16 at lists.isocpp.org <mailto:SG16 at lists.isocpp.org>
>                     https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201013/c71136f4/attachment.htm>