Re: Question about “Uppercase” in DerivedCoreProperties.txt

Philippe Verdy verdy_p at wanadoo.fr
Sun Nov 9 00:19:24 CST 2014


glibc is not more borken and any other C library implementing toupper and
tolower from the legacy "ctype" standard library. These are old APIs that
are just widely used and still have valid contexts were they are simple and
safe to use. But they are not meant to convert text.

The i18n data just shows the mappings used for tolower, toupper (and
totile) but it is clearly not enough to implement strtolower and strtoupper
which require more rules (notably 1 to 2 or 2 to 1 mappings, plus support
for normalisation/composition/decomposition and recognizing canonical
equivalents, in all possible reorderings, and more data for contextual
rules such as the final form of sigma). Such data may be be easily
expressible in some cases with such tabular format, and could be
implemented by locale-specific code, for example to handle some dictionary
lookups (as required with some Asian scripts for word breaking, and
implicilty needed for the Korean script whose normalisation is not handle
by table lookups but algorithmically by code only within the normalizer)

I don't see anything wrong with existing glibc "18n" data. Glibc would be
wrong however if it *only* used tolower/toupper to implement
strtolower/strtoupper (but this was what was still done in the past since
the creation of the "standard" C library on Unix and even later on DOS,
MacOS, Windows and most other systems... before the creation of Unicode and
its development to support more languages, scripts, and orthographic
systems.)

Modern i18n libraries (for various programming languages) contain more
advanced support API for correct case mappings on full strings (including
M-to-N mappings, contextual rules and support of canonical equivalences),
and these API no longer assume that the output string will be the same
length as the input and only 1:1 mappings will be performed over each
character (even if this is still what is done when using the "C" root
locale working only for a few languages and only with simple texts using
restricted alphabets without all the possible Unicode extensions, needed
now to support more than the native language but also many proper names and
"foreign" toponyms, or texts containing small citations in another
language, or any multilingual document).

2014-11-09 1:45 GMT+01:00 Christopher Vance <cjsvance at gmail.com>:

> So glibc is broken. This doesn't make it a Unicode problem.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20141109/09ee5d0f/attachment.html>


More information about the Unicode mailing list