What is the Unicode guidance regarding the use of a BOM as a UTF-8 encoding signature?
doug at ewellic.org
Sat Jun 6 10:43:34 CDT 2020
Eli Zaretskii wrote:
>>> That strategy would fail with 7-bit ISO 2022 based encodings, no?
>>> They look like plain 7-bit ASCII (which will not fail UTF-8), but
>>> actually represent non-ASCII text.
>> I mentioned that later.... But there is a lot of content for
>> interchange that are single/double byte (8 bit) rather than requiring
>> escape sequences. The 2022 encodings seem rarer, though it may
>> depend on your data source.
> I agree that ISO 2022 is rare these days, but rarity doesn't help when
> you need to be accurate in decoding, because mistaking one encoding
> for another produces horribly incorrect results, and users complain
> vociferously when that happens.
If you need to deal with an arbitrary set of encodings, such as CP1255 and CP1256 and 7-bit ISO 2022-based encodings, instead of just CP1252 versus UTF-8 as Karl stated, then auto-detection won't work without a fair amount of natural language context. Otherwise, the text really has to be tagged.
Long ago I wrote some code that detected Russian text in any of six popular (at the time) Cyrillic encodings, and it seldom got it wrong, but I have no idea how it would do for other, especially non-Slavic, languages written in Cyrillic. I bet it would fail spectacularly for Mongolian, for example.
Doug Ewell | Thornton, CO, US | ewellic.org
More information about the Unicode