ICU encoding name alias conflicts

Harriet Riddle harjitmoe at outlook.com
Sun Nov 21 17:03:34 CST 2021


Hello.

Long infodump ahead, but there are several things going on here.

① Some of these are different mappings for the same encoding, e.g. ibm-33722_P120-1999 versus ibm-33722_P12A_P12A-2009_U2. This is because the mapping of legacy character sets, JIS X 0208 and a subset of JIS X 0212 in this case, isn't always universally agreed upon between vendors (MINUS SIGN versus FULLWIDTH HYPHEN-MINUS, EM DASH versus HORIZONTAL BAR, WAVE DASH versus FULLWIDTH TILDE versus TILDE OPERATOR, et cetera), to say nothing of the REVERSE SOLIDUS / YEN SIGN / WON SIGN brouhaha.

As a sidenote, IBM-33722 is the subset of IBM-954 (IBM's version of EUC-JP) that can be converted to IBM-942, similarly to how IBM-5050 is the subset of IBM-954 that can be converted to IBM-932, which is a subset of IBM-942 without the single-byte extensions (hence IBM-5050 is aliased to its superset IBM-33722). Why both aren't just aliased to IBM-954 is beyond me.

Further sidenote: both IBM-954 and the OSF/TUG eucJP-open encode the subset of the IBM Extensions section from IBM-932 that doesn't have standard codepoints in JIS X 0212 to an extension range in empty space in JIS X 0212; however, these schemes collide with one another. In practice, it is NEC's scheme (which encodes the subset of the IBM Extensions section that doesn't have standard codepoints in NEC Row 13 to empty space in JIS X 0208) that gets used more often, in both EUC-JP and Shift_JIS, even when the IBM Extensions themselves are also included (as in Windows code page 932).

② A pervasive problem with legacy character encoding names is that Microsoft and IBM often use different definitions for a given code page number. For instance, code page 932 was modified by Microsoft to use a newer JIS X 0208 edition and add NEC extensions as well as the existing IBM extensions (IBM-932 was also updated with the newer JIS X 0208 repertoire, but without the codepoint swaps of kyuujitai with corresponding extended shinjitai between levels 1 and 2 that JIS X 0208 made in 1983, and excluding additions which duplicated the existing IBM extensions). Microsoft's code page 932 was later adopted by IBM as code page 943. Hence some labels are inherently ambiguous.

Likewise: IBM code page 949 and Windows code page 949 are both supersets of EUC-KR, but the similarities end there (Windows's one is Unified Hangul Code, IBM's adds its own extensions outside of the EUC range to fully support the repertoires of IBM-933 and IBM-934). IBM's 1363 is Windows-949, although IBM and Microsoft don't entirely agree on mapping.

IBM's code page 950 and Windows code page 950 are both subsets of Big5-ETEN, but IBM includes only the part of the ETEN extensions that Microsoft doesn't, both treating the other range as user-defined; IBM-1373 corresponds to Windows-950.

Code page 936 is the most egregious, referring to formerly EUC-CN and latterly GBK on Windows, but seemingly referring to Shift_GB (or something very similar) by IBM's definition (though IBM-936 is heavily deprecated and is omitted by ICU).

IBM-874 and Windows-874 are also different, otherwise-unrelated, extensions of TIS-620, the national standard which would, with a minor revision, become ISO-8859-11.

③ IBM makes a distinction between CPGIDs and CCSIDs, both of which essentially occupy the same namespace, but CPGIDs identify a fixed-width plane with a potentially growing repertoire (unless the plane is full), while CCSIDs specify a repertoire (they can have a growing repertoire, but have to specify it explicitly) and can be variable-width by combining multiple planes within a higher-level scheme (such as ISO-2022-JP, general EUC, stateful EBCDIC, lead-byte-masked variable-width). Microsoft does not, calling both code page numbers.

Hence, IBM-5348 (CCSID 5348) is the current version of Windows-1252, with a larger specified repertoire than IBM-1252 (CCSID 1252), which is the version of Windows-1252 before the Euro Sign Update (which also added a few characters besides the Euro sign)—but CPGID 1252 refers to the whole thing (with the maximal CCSID of 5348).

Similarly, IBM-5471 is Big5-HKSCS (2001) and IBM-1375 is Big5-HKSCS Growing, in practice meaning Big5-HKSCS (2008) as seen from its inclusion of 0x877A through 0x87DF—both are variable-width so neither is a CPGID (the pure double-byte CPGID for HKSCS is 1374).

Often updates or extensions to, or conversely subsets of, an existing CCSID get assigned CCSIDs amounting to an increment of the existing one by a multiple of 4096 (hence 1257 versus 5353 versus 9449).

I think those three explanations cover everything.
—Har.

________________________________
From: Unicode <unicode-bounces at corp.unicode.org> on behalf of Tom Honermann via Unicode <unicode at corp.unicode.org>
Sent: 16 November 2021 00:20
To: SG16 <sg16 at lists.isocpp.org>; UnicoDe List <unicode at corp.unicode.org>; icu-support at lists.sourceforge.net <icu-support at lists.sourceforge.net>
Subject: ICU encoding name alias conflicts


I conducted an audit of all of the encoding names recognized by ICU with the goal of identifying any cases where comparison under the COMP_NAME loose matching algorithm specified in P1885<https://wg21.link/p1885> would lead to a conflict in selecting an ICU converter. The good news is that no conflicts were identified that can be attributed to the loose matching algorithm. However, I found that the same alias is used for different encodings in multiple cases as described in the table below. These can be verified with ICU Converter Explorer<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.

I did not scrape the ICU Converter Explorer page to perform the audit. The data I worked off of was produced with ICU 70.1 by running uconv -l --canon and then massaging the output.

Each row of the table describes a conflict between two ICU encodings, each of which is named in the left most and right most columns respectively. The inner columns list the specific aliases that conflict and which provider they correspond to.

For at least some of these, one has to wonder if the ICU data is simply incorrect. Cases that only involve a conflict with an untagged alias are illustrated in gray so that the others stand out.

Can anyone offer an explanation for these conflicts? Do these reflect defects in ICU (particularly for the cases where the untagged aliases disagree with)?

ICU encoding
Encoding alias (provider)
Encoding alias (provider)       ICU encoding
ibm-943_P15A-2003
cp932 (Windows)
cp932 (Untagged)
ibm-942_P12A-1999
ibm-943_P130-1999
ibm-943 (IBM)
ibm-943 (Java)  ibm-943 (Untagged)
ibm-943_P15A-2003
ibm-943_P130-1999
Shift_JIS (Untagged)
Shift_JIS (Windows)
Shift_JIS (Java)
Shift_JIS (IANA)
Shift_JIS (MIME)
ibm-943_P15A-2003
ibm-33722_P120-1999
ibm-33722 (IBM)
ibm-33722 (Java)        ibm-33722 (Untagged)
ibm-33722_P12A_P12A-2009_U2
ibm-33722_P120-1999
ibm-5050 (IBM)
ibm-5050 (Untagged)
ibm-33722_P12A_P12A-2009_U2
windows-950-2000
windows-950 (Windows)
windows-950 (Untagged)
ibm-1373_P100-2002
ibm-5471_P100-2006
Big5-HKSCS (Untagged)
Big5-HKSCS (Java)
Big5-HKSCS (IANA)
ibm-1375_P100-2008
windows-936-2000
windows-936 (Windows)
windows-936 (Java)
windows-936 (IANA)
windows-936 (Untagged)
ibm-1386_P100-2001
ibm-949_P11A-1999
ibm-949 (Untagged)
ibm-949 (IBM)
ibm-949 (Java)
ibm-949_P110-1999
ibm-1363_P11B-1998
KS_C_5601-1987 (IANA)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
KSC_5601 (IANA)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
5601 (Untagged)
5601 (Java)
ibm-970_P110_P110-2006_U2
ibm-1363_P110-1997
ibm-1363 (IBM)
ibm-1363 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
windows-949 (Windows)
windows-949 (Java)
windows-949 (Untagged)
ibm-1363_P11B-1998
windows-949-2000
KS_C_5601-1987 (Windows)
KS_C_5601-1987 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
KS_C_5601-1989 (Windows)
KS_C_5601-1989 (IANA)
ibm-1363_P11B-1998
windows-949-2000
KSC_5601 (Windows)
KSC_5601 (MIME)
KSC_5601 (Java)
ibm-970_P110_P110-2006_U2
windows-949-2000
csKSC56011987 (Windows)
csKSC56011987 (IANA)
ibm-1363_P11B-1998
windows-949-2000
korean (Windows)
korean (IANA)
ibm-1363_P11B-1998
windows-949-2000
iso-ir-149 (Windows)
iso-ir-149 (IANA)
ibm-1363_P11B-1998
ibm-874_P100-1995
TIS-620 (Java)
TIS-620 (IANA)
TIS-620 (Windows)
windows-874-2000
ibm-1250_P100-1995
windows-1250 (Untagged)
windows-1250 (Windows)
windows-1250 (Java)
windows-1250 (IANA)
ibm-5346_P100-1998
ibm-1251_P100-1995
windows-1251 (Untagged)
windows-1251 (Windows)
windows-1251 (Java)
windows-1251 (IANA)     ibm-5347_P100-1998
ibm-1252_P100-2000
windows-1252 (Untagged)
windows-1252 (Windows)
windows-1252 (Java)
windows-1252 (IANA)     ibm-5348_P100-1997
ibm-1253_P100-1995
windows-1253 (Untagged)
windows-1253 (Windows)
windows-1253 (Java)
windows-1253 (IANA)     ibm-5349_P100-1998
ibm-1254_P100-1995
windows-1254 (Untagged)
windows-1254 (Windows)
windows-1254 (Java)
windows-1254 (IANA)     ibm-5350_P100-1998
ibm-5351_P100-1998
windows-1255 (Untagged)
windows-1255 (Windows)
windows-1255 (Java)
windows-1255 (IANA)     ibm-9447_P100-2002
ibm-5352_P100-1998
windows-1256 (Untagged)
windows-1256 (Windows)
windows-1256 (Java)
windows-1256 (IANA)     ibm-9448_X100-2005
ibm-5353_P100-1998
windows-1257 (Untagged)
windows-1257 (Windows)
windows-1257 (Java)
windows-1257 (IANA)     ibm-9449_P100-2002
ibm-1258_P100-1997
windows-1258 (Untagged)
windows-1258 (Windows)
windows-1258 (Java)
windows-1258 (IANA)     ibm-5354_P100-1998

Tom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211121/56ce083d/attachment-0001.htm>


More information about the Unicode mailing list