Question about “Uppercase” in DerivedCoreProperties.txt

Mike FABIAN mfabian at redhat.com
Fri Nov 7 05:32:05 CST 2014


Philippe Verdy <verdy_p at wanadoo.fr> さんはかきました:

> this is a "feature" of the Greek alphabet that the lowercase iota subscript
> can be capitalized in two different ways : either as a subscript below the
> uppercase main letter, or as a standard iota capitalized. The subscript
> form is a combining character, but not the non-subscript form.

Now I understand why these are titlecase letters, as Laurentiu
explained:

Laurentiu> All of the characters you enumerated are titlecase letters
Laurentiu> (gc=Lt) rather than uppercase letters (gc=Lu),

U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ.
ᾈ is something like Ἀι so I understand now that ᾈ can be considered as
titlecase (gc=Lt).

Thank you very much, Phillipe and Laurentiu for explaining!

I stumbled on this question because I am trying to update the character
class data for glibc for Unicode 7.0.0.

glibc has character classes “upper” and “lower” but not “title”.

Bruno Haible’s program to generate the character class data from
UnicodeData.txt tries to enforce that every character which has
a “toupper” mapping *must* be in either “upper” or “lower”.

https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c;h=0c001b299d4601a375a1e814fd2ab06b0536b337;hb=HEAD#l660

I think Bruno’s program does this because

ISO C 99 (ISO/IEC 9899 - Programming languages - C)
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf

contains:

> 7.4.2.2 The toupper function
> 
> [...]
> 
> If the argument is a character for which islower is true and there are
> one or more corresponding characters, as specified by the current
> locale, for which isupper is true, the toupper function returns one of
> the corresponding characters (always the same one for any given locale);
> otherwise, the argument is returned unchanged.

which seems to require that toupper should only do something for
characters where islower is true.

Therefore, Bruno’s program puts title case characters like U+1F88 ᾈ
or U+01C5 Dž into *both*, “upper” and “lower”. Which does not
look so unreasonable, given the limitations of C99.

So it looks like because of this limitation, we have to continue using
this approach because ISO C 99 requires it, we cannot use the
“Uppercase” property from DerivedCoreProperties.txt for this.

But the “Alphabetic” property from DerivedCoreProperties.txt can
probably be used to generate the “alpha” character class for glibc.

I hope this is correct.

-- 
Mike FABIAN <mfabian at redhat.com>
☏ Office: +49-69-365051027, internal 8875027
睡眠不足はいい仕事の敵だ。


More information about the Unicode mailing list