What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
Martin J. Dürst
duerst at it.aoyama.ac.jp
Sat Jun 6 19:40:08 CDT 2020
On 07/06/2020 00:43, Doug Ewell via Unicode wrote:
> Eli Zaretskii wrote:
>
>>>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
>>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but
>>>> actually represent non-ASCII text.
Well, yes, but if you exploit the fact that 7-bit ISO 2022 encodings
contain ESC characters with specific character sequences thereafter,
whereas UTF-8 text doesn't, that case should be easy to handle, too.
>>> I mentioned that later.... But there is a lot of content for
>>> interchange that are single/double byte (8 bit) rather than requiring
>>> escape sequences. The 2022 encodings seem rarer, though it may
>>> depend on your data source.
>>
>> I agree that ISO 2022 is rare these days, but rarity doesn't help when
>> you need to be accurate in decoding, because mistaking one encoding
>> for another produces horribly incorrect results, and users complain
>> vociferously when that happens.
>
> If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.
>
> Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example.
I agree. What's difficult is distinguish the various non-UTF-8 encodings
among themselves. Compared to that, identifying something as UTF-8 is
much easier. It's not 100% failproof, in particular not for very short
pieces of non-ASCII text (just a word or so), but it gets better very,
very fast the more non-ASCII text you have.
Regards, Martin.
More information about the Unicode
mailing list