Re: Question about “Uppercase” in DerivedCoreProperties.txt

Philippe Verdy verdy_p at wanadoo.fr
Sat Nov 8 17:50:07 CST 2014


Do not try to get consisant results with only a character to character
mapping, it does not work with all letters, because sometimes you need 1->2
or 2->1 mappings (not all composable characters exist in precombined forms,
or sometimes the combination must be split into its canonical decomposed
equivalent prior to map the base character) or other mappings.
toupper() and tolower() should not be used for something else than just
mapping number-like sequences (e.g. to convert hexadecimal numbers).

Use strupper() and strlower() (or equivalent functions not alocating memory
but writing to a given buffer or stream, and similiar functions to other
languages than C/C++) to perform mappings on full strings so that the
string length can safely change.
- this is needed for example to convert city names or people names to
capitals in a postal address, or to style a book title or chapter heading).
- it is needed as well to perform case insensitive searches (using "case
folding", which is different from converting to lowercase or to uppercase)
to match input, or to implement some input completion UI to locate possible
matches within a known dictionnary or input history.


2014-11-08 10:22 GMT+01:00 Mike FABIAN <mfabian at redhat.com>:

> Philippe Verdy <verdy_p at wanadoo.fr> さんはかきました:
>
> > note that tolower() and toupper() can only work one 1-character level, it
> > is not recommended for use for changing case of plain text.
> >
> > For correct handling of locales, to upper and toupper should be replaced
> by
> > strtolower and strtoupper (or their aliases) which will be able to
> process
> > character clusters and contextual casing rules needed for a language or
> > orthographic style
>
> Yes, thank you for explaining this.
>
> But these details of upper and lower casing cannot be expressed in the
> “i18n” file of glibc:
>
> https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/locales/i18n
>
> For toupper and tolower, this file just has character -> character
> mapping tables, for example the “tolower” table contains only
>
>     (<U03A3>,<U03C3>)
>
> (i.e. mapping Σ U+03A3 -> σ U+03C3, never to the final sigma ς
> U+03C2).
>
> More correct, detailed information about upper and lower case must come
> from elsewhere, not from this “i18n” file in glibc.  Using only the
> information from this “i18n” file, not even the Greek sigma can be
> handled correctly.
>
> Pravin and me want to update this “i18n” file to the latest
> data from Unicode 7.0.0, doing it as correct as possible within
> the limitations caused by this file and the ISO C standard.
>
> --
> Mike FABIAN <mfabian at redhat.com>
> ☏ Office: +49-69-365051027, internal 8875027
> 睡眠不足はいい仕事の敵だ。
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20141109/69596c83/attachment.html>


More information about the Unicode mailing list