Re: Question about “Uppercase” in DerivedCoreProperties.txt

Philippe Verdy verdy_p at wanadoo.fr
Fri Nov 7 07:57:37 CST 2014


note that tolower() and toupper() can only work one 1-character level, it
is not recommended for use for changing case of plain text. Its purpose
should be limited to use cases where letters can be safely isolated from
their context, for example when handling letters as numbers (e.g. section
numbering).

For correct handling of locales, to upper and toupper should be replaced by
strtolower and strtoupper (or their aliases) which will be able to process
character clusters and contextual casing rules needed for a language or
orthographic style (such as monotonic and polytonic Greek, or for specific
locales intended for medieval texts or old classic scriptures).
strupper and strlower can then perform MORE mappings that tolower and
toupper cannot perform using only simple mappings. So precombined Greek
letters with iota subscripts can only be converted by preserving the iota
subscript (for which islower() and isupper() are BOTH false when it is
encoded separately and not precombined).

When a Greek letter precombined with a iota subscript is found, the letter
case of this iota subscript should be ignored, and only the lettercase of
the base letter will be considered, and this means that it will only be
possible for toupper() and toupper() to map one orthographic style: the
style that preserves the subscript but not the classic Greek or modern
monotonic style that doesn't "know" anything about this "medieval"
extension of the Greek alphabet, which was still in use in the begining of
the 1970's (handling polytonic Greek with tolower() and toupper(), or with
islower() and isupper() will not produce the correct result). For modern
Greek, there's no use of this iota subscript, so we are in the same
situation as classic Greek (before the Christian era), except that modern
Greek still uses a few accents (notably the "tonos" equivalent in Unicode
to the acute accent, even if its placement over Greek capitals is
preferably before the letter rather than above it as it could be suggested
by its assigned combining class).

2014-11-07 12:32 GMT+01:00 Mike FABIAN <mfabian at redhat.com>:

> Philippe Verdy <verdy_p at wanadoo.fr> さんはかきました:
>
> > this is a "feature" of the Greek alphabet that the lowercase iota
> subscript
> > can be capitalized in two different ways : either as a subscript below
> the
> > uppercase main letter, or as a standard iota capitalized. The subscript
> > form is a combining character, but not the non-subscript form.
>
> Laurentiu> All of the characters you enumerated are titlecase letters
> Laurentiu> (gc=Lt) rather than uppercase letters (gc=Lu),
>
> U+1F80 ᾀ is something like ἀι and could be capitalized as ἈΙ or as ᾈ.
> ᾈ is something like Ἀι so I understand now that ᾈ can be considered as
> titlecase (gc=Lt).
>

Note that for modern Greek there's still a difficulty about the special
final form of lowercase sigma: it is effectively lowercase (islower should
return true), not titlecase, and toupper will map it to a standard capital
Sigma. But the reverse conversion will only be able to convert the
uppercase sigma to a standard lowercase sigma, ignoring the final form. To
handle the final form correctly, don't use tolower() character per
character, but use strtolower() and use a decent library that supports
contextual rules (the same will be true for the German ess-tsett which was
capitalized as a two S but not reversible, even if recently an "uppercase"
variant of ess-tsett was added in Unicode, but it is still extremely rarely
used: it is extremly difficult to determine how to convert a double capital
S and most libraries will only convert it to a double lowercase s, and some
locales deliberatly decide not to alter the lowercase ess-tsett with
loupper or strtoupper; this is still correct if those libraries have not be
updated to use the capital ess-tsett now supported in more recent versions
of Unicode, but not found in any other legacy encodings).

We still have a difficulty with the ampersand "&" because it has been
encoded only as a symbol, assuming that for most used locales it is just
used in isolation as an abbreviated form of a word. But in some locales it
was still considered a letter and used everywhere "et" could be used
including in abreviations like "etc." == "&c.", or in the middle of words
like "caret" == "car&" or "comm&tre" == "commettre"). But the modern use of
ampersand implies there's a word break before and after the symbol an we
should have a separate encoding for "&" as a lowercase ligature, and we
should even have an uppercase variant like the German ess-tsett, as there
are glyphic variants of the ligature for uppercased titles where the modern
"&" ampersand does not fit very well, or where it should be mapped to a
non-ligatured "ET" letter pair, distinct from the mapping (with spaces
around) to " ET " in French or to " AND " in English, as implied by the
modern meaning of the current symbol as a separate word by itself. With a
distinct encoding of the ligature, the common abreviation "etc." ligatured
as "&c." would correctly map to uppercase "&C." with the uppercase
ligature, or "ETC." without adding any space.

Note that "&" was even considered in some classic European alphabets as an
extra letter (with letter forms exhibiting more evidently its origin from
"et"/"ET" ligatured), just like the German ess-tsett "ß", or the French
"œ"/"Œ" (distinguised semantically from "oe"/"OE" letter pairs, which allow
a syllable break in the middle and allow titlecasing as "Oe" : in French
the titlecased common term "Oeuf" is semantically and graphically
incorrect, it should be "Œuf" where "Œ" is fully uppercase in the ligature
and not mixed-cased), or the Latin "æ"/"Æ" ligature (also used in other
classic European languages) or the Dutch ligature "ij"/"IJ".
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20141107/93dbfd9d/attachment.html>


More information about the Unicode mailing list