[SG16] Draft proposal: Clarify guidance for use of a BOM as a UTF-8 encoding signature

Mon Oct 12 19:09:56 CDT 2020

On Sun, Oct 11, 2020 at 8:24 PM Tom Honermann via Unicode <
unicode at unicode.org> wrote:

> On 10/10/20 7:58 PM, Alisdair Meredith via SG16 wrote:
>
> One concern I have, that might lead into rationale for the current
> discouragement,
> is that I would hate to see a best practice that pushes a BOM into ASCII
> files.
> One of the nice properties of UTF-8 is that a valid ASCII file (still very
> common) is
> also a valid UTF-8 file.  Changing best practice would encourage updating
> those
> files to be no longer ASCII.
>
> Thanks, Alisdair.  I think that concern is implicitly addressed by the
> suggested resolutions, but perhaps that can be made more clear.  One
> possibility would be to modify the "protocol designer" guidelines to
> address the case where a protocol's default encoding is ASCII based and to
> specify that a BOM is only required for UTF-8 text that contains non-ASCII
> characters.  Would that be helpful?
>

'and to specify that a BOM is only required for UTF-8 '  this should NEVER
be 'required' or 'must', it shouldn't even be 'suggested'; fortunately BOM
is just a ZWNBSP, so it's certainly a 'may' start with a such and such.
These days the standard 'everything IS utf-8' works really well, except in
firefox where the charset is required to be specified for JS scripts (but
that's a bug in that)
EBCDIC should be converted on the edge to internal ascii, since,
thankfully, this is a niche application and everything thinks in ASCII or
some derivative thereof.
Byte Order Mark is irrelatvent to utf-8 since bytes are ordered in the
correct order.
I have run into several editors that have insisted on emitted BOM for UTF8
when initially promoted from ASCII, but subsequently deleting it doesn't
bother anything.

I am curious though, what was the actual problem you ran into that makes
you even consider this modification?

J

> Tom.
>
>
> AlisdairM
>
> On Oct 10, 2020, at 14:54, Tom Honermann via SG16 <sg16 at lists.isocpp.org>
> wrote:
>
> Attached is a draft proposal for the Unicode standard that intends to
> clarify the current recommendation regarding use of a BOM in UTF-8 text.
> This is follow up to discussion on the Unicode mailing list
> <https://corp.unicode.org/pipermail/unicode/2020-June/008713.html> back
> in June.
>
> Feedback is welcome.  I plan to submit
> <https://www.unicode.org/pending/docsubmit.html> this to the UTC in a
> week or so pending review feedback.
>
> Tom.
> <Unicode-BOM-guidance.pdf>--
> SG16 mailing list
> SG16 at lists.isocpp.org
> https://lists.isocpp.org/mailman/listinfo.cgi/sg16
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20201012/8b7a26d3/attachment.htm>