What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?

Martin J. Dürst duerst at it.aoyama.ac.jp
Sat Jun 6 19:40:08 CDT 2020



On 07/06/2020 00:43, Doug Ewell via Unicode wrote:
> Eli Zaretskii wrote:
> 
>>>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
>>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but
>>>> actually represent non-ASCII text.

Well, yes, but if you exploit the fact that 7-bit ISO 2022 encodings 
contain ESC characters with specific character sequences thereafter, 
whereas UTF-8 text doesn't, that case should be easy to handle, too.


>>> I mentioned that later....  But there is a lot of content for
>>> interchange that are single/double byte (8 bit) rather than requiring
>>> escape sequences.  The 2022 encodings seem rarer, though it may
>>> depend on your data source.
>>
>> I agree that ISO 2022 is rare these days, but rarity doesn't help when
>> you need to be accurate in decoding, because mistaking one encoding
>> for another produces horribly incorrect results, and users complain
>> vociferously when that happens.
> 
> If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.
> 
> Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example.

I agree. What's difficult is distinguish the various non-UTF-8 encodings 
among themselves. Compared to that, identifying something as UTF-8 is 
much easier. It's not 100% failproof, in particular not for very short 
pieces of non-ASCII text (just a word or so), but it gets better very, 
very fast the more non-ASCII text you have.


Regards,   Martin.


More information about the Unicode mailing list