ICU encoding name alias conflicts

Tom Honermann tom at honermann.net
Mon Nov 15 18:20:14 CST 2021


I conducted an audit of all of the encoding names recognized by ICU with 
the goal of identifying any cases where comparison under the COMP_NAME 
loose matching algorithm specified in P1885 <https://wg21.link/p1885> 
would lead to a conflict in selecting an ICU converter. The good news is 
that no conflicts were identified that can be attributed to the loose 
matching algorithm. However, I found that the same alias is used for 
different encodings in multiple cases as described in the table below. 
These can be verified with ICU Converter Explorer 
<https://icu4c-demos.unicode.org/icu-bin/convexp?s=UTR22&s=IBM&s=WINDOWS&s=JAVA&s=IANA&s=MIME&s=-&s=ALL&ShowUnavailable=>.

I did not scrape the ICU Converter Explorer page to perform the audit. 
The data I worked off of was produced with ICU 70.1 by running uconv -l 
--canon and then massaging the output.

Each row of the table describes a conflict between two ICU encodings, 
each of which is named in the left most and right most columns 
respectively. The inner columns list the specific aliases that conflict 
and which provider they correspond to.

For at least some of these, one has to wonder if the ICU data is simply 
incorrect. Cases that only involve a conflict with an untagged alias are 
illustrated in gray so that the others stand out.

Can anyone offer an explanation for these conflicts? Do these reflect 
defects in ICU (particularly for the cases where the untagged aliases 
disagree with)?

*ICU encoding**
* 	*Encoding alias****(provider)**
* 	*Encoding alias****(provider)* 	*ICU encoding**
*
ibm-943_P15A-2003
	cp932 (Windows)
	cp932 (Untagged)
	ibm-942_P12A-1999
ibm-943_P130-1999
	ibm-943 (IBM)
ibm-943 (Java) 	ibm-943 (Untagged)
	ibm-943_P15A-2003
ibm-943_P130-1999
	Shift_JIS (Untagged)
	Shift_JIS (Windows)
Shift_JIS (Java)
Shift_JIS (IANA)
Shift_JIS (MIME)
	ibm-943_P15A-2003
ibm-33722_P120-1999
	ibm-33722 (IBM)
ibm-33722 (Java) 	ibm-33722 (Untagged)
	ibm-33722_P12A_P12A-2009_U2
ibm-33722_P120-1999
	ibm-5050 (IBM)
	ibm-5050 (Untagged)
	ibm-33722_P12A_P12A-2009_U2
windows-950-2000
	windows-950 (Windows)
	windows-950 (Untagged)
	ibm-1373_P100-2002
ibm-5471_P100-2006
	Big5-HKSCS (Untagged)
	Big5-HKSCS (Java)
Big5-HKSCS (IANA)
	ibm-1375_P100-2008
windows-936-2000
	windows-936 (Windows)
windows-936 (Java)
windows-936 (IANA)
	windows-936 (Untagged)
	ibm-1386_P100-2001
ibm-949_P11A-1999
	ibm-949 (Untagged)
	ibm-949 (IBM)
ibm-949 (Java)
	ibm-949_P110-1999
ibm-1363_P11B-1998
	KS_C_5601-1987 (IANA)
	KS_C_5601-1987 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
	KSC_5601 (IANA)
	KSC_5601 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P11B-1998
	5601 (Untagged)
	5601 (Java)
	ibm-970_P110_P110-2006_U2
ibm-1363_P110-1997
	ibm-1363 (IBM)
	ibm-1363 (Untagged)
	ibm-1363_P11B-1998
windows-949-2000
	windows-949 (Windows)
windows-949 (Java)
	windows-949 (Untagged)
	ibm-1363_P11B-1998
windows-949-2000
	KS_C_5601-1987 (Windows)
	KS_C_5601-1987 (Java)
	ibm-970_P110_P110-2006_U2
windows-949-2000
	KS_C_5601-1989 (Windows)
	KS_C_5601-1989 (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	KSC_5601 (Windows)
KSC_5601 (MIME)
	KSC_5601 (Java)
	ibm-970_P110_P110-2006_U2
windows-949-2000
	csKSC56011987 (Windows)
	csKSC56011987 (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	korean (Windows)
	korean (IANA)
	ibm-1363_P11B-1998
windows-949-2000
	iso-ir-149 (Windows)
	iso-ir-149 (IANA)
	ibm-1363_P11B-1998
ibm-874_P100-1995
	TIS-620 (Java)
TIS-620 (IANA)
	TIS-620 (Windows)
	windows-874-2000
ibm-1250_P100-1995
	windows-1250 (Untagged)
	windows-1250 (Windows)
windows-1250 (Java)
windows-1250 (IANA)
	ibm-5346_P100-1998
ibm-1251_P100-1995
	windows-1251 (Untagged)
	windows-1251 (Windows)
windows-1251 (Java)
windows-1251 (IANA) 	ibm-5347_P100-1998
ibm-1252_P100-2000
	windows-1252 (Untagged)
	windows-1252 (Windows)
windows-1252 (Java)
windows-1252 (IANA) 	ibm-5348_P100-1997
ibm-1253_P100-1995
	windows-1253 (Untagged)
	windows-1253 (Windows)
windows-1253 (Java)
windows-1253 (IANA) 	ibm-5349_P100-1998
ibm-1254_P100-1995
	windows-1254 (Untagged)
	windows-1254 (Windows)
windows-1254 (Java)
windows-1254 (IANA) 	ibm-5350_P100-1998
ibm-5351_P100-1998
	windows-1255 (Untagged)
	windows-1255 (Windows)
windows-1255 (Java)
windows-1255 (IANA) 	ibm-9447_P100-2002
ibm-5352_P100-1998
	windows-1256 (Untagged)
	windows-1256 (Windows)
windows-1256 (Java)
windows-1256 (IANA) 	ibm-9448_X100-2005
ibm-5353_P100-1998
	windows-1257 (Untagged)
	windows-1257 (Windows)
windows-1257 (Java)
windows-1257 (IANA) 	ibm-9449_P100-2002
ibm-1258_P100-1997
	windows-1258 (Untagged)
	windows-1258 (Windows)
windows-1258 (Java)
windows-1258 (IANA) 	ibm-5354_P100-1998

Tom.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20211115/21b361e0/attachment.htm>


More information about the Unicode mailing list