RE: Unicode is universal, so how come that universality doesn’t apply to digits?
doug at ewellic.org
Wed Dec 16 09:40:15 CST 2020
What I don't understand here is why this is being framed implicitly as a Unicode problem, or an XML problem, or a general law of nature ("why can’t a Bengali-speaking person use the Bengali digits"), instead of an inherent limitation of that particular library function used for that particular language.
One could easily extend strtol() to accept a string of characters with a General_Category of "Nd", and use the Numeric_Value property of each character to get its numeric value instead of subtracting 48 (ASCII '0').
Of course, in order to do that, the Unicode properties General_Category and Numeric_Value must be available to the conversion function. The C language and its standard libraries are optimized for speed and size, and are still chosen to this day when speed and size are at a premium. Operating only on ASCII '0' through '9' and subtracting ASCII '0' to get the numeric value is much faster and lighter-weight than table lookup. ICU probably provides a method to do this in C.
A good follow-up question for me is why the heavier-weight C# and .NET Framework (Core, Standard) also don't support non-ASCII digits in the Convert.ToInt32() method, even when the string of digits is all from the same script (unlike your mixed Bengali/Oriya example), and even when the appropriate locale is specified as a parameter. C# compiles to intermediate code and runs in an interpreter, and has huge libraries available to it, including all of the Unicode properties, so the "speed and size" constraints don't apply as much.
But this is still a characteristic of the code libraries, not a Unicode problem.
Doug Ewell, CC, ALB | Thornton, CO, US | ewellic.org
More information about the Unicode