German sharp S uppercase mapping
Daniel Buncic
daniel.buncic at uni-koeln.de
Tue Dec 3 02:01:54 CST 2024
Am 03.12.2024 um 02:51 schrieb Asmus Freytag via Unicode:
> Rather than getting hung up on details of parsing one particular
> part of one sentence, it would be more useful from Unicode's
> perspective if someone (Daniel?) could sum up in a short document
> base on this discussion where Unicode is behind the curve and to
> make sure the support in CLDR is up to actual current practice and
> not what it was 10 or 15 years ago.
Thank you very much for the idea. I could certainly sum up the
arguments of the discussion (though I’m too busy to do it right now, you
would have to have a few weeks’ patience), but I still haven’t
understood where in the CLDR such casing information is stored. There
are data subsets that have “casing” in the title, but they only say
whether the days of the week, month names, language names, etc. are
capitalized in a certain language. There is a field called “main
examplars” that contains all the small letters (for German, including ß)
and another field called “index examplars”, which for German does not
even include Ä, Ö, and Ü. I surmise that this is only meant for
numbering items using letters (where indeed you can have parts A, B, C,
etc. of a book, but you would never have a “part Ä”). I cannot find any
information saying something like a ↔ A, b ↔ B, etc.
For Turkish (https://www.unicode.org/cldr/charts/46/summary/tr.html),
the “main letters” in the very first line are given as
[a b c ç d e f g ğ h ı iİ j k l m n o ö p r s ş t u ü v y z].
So there i and its capital counterpart İ are not separated by a space.
But for German (https://www.unicode.org/cldr/charts/46/summary/de.html),
the “main letters” are
[aä b c d e f g h i j k l m n oö p q r s ß t uü v w x y z],
where the missing space does not imply capitalization, so I guess
changing this list to “… s ßẞ t …” would not automatically inform people
that ß should be capitalized as ẞ.
In
https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf
on page 198 I find:
“Examples of case tailorings which are not covered by data in
SpecialCasing.txt include: […] Uppercasing of U+00DF ‘ß’ LATIN SMALL
LETTER SHARP S to U+1E9E LATIN CAPITAL LETTER SHARP S[.] The preferred
mechanism for defining tailored casing operations is the Unicode Common
Locale Data Repository (CLDR), https://cldr.unicode.org, where
tailorings such as these can be specified on a per-language basis, as
needed.” So the idea is already there. On page 295 the problem with ß
is addressed in detail, and right underneath it says, “Additional
language-specific or orthography-specific contexts and casing behavior
is specified in the Unicode Common Locale Data Repository (CLDR),
https://cldr.unicode.org.” So does this already exist? Or where does
it have to be added?
Can anybody help?
Best wishes,
Daniel
--
Prof. Dr. Daniel Bunčić
===============================================================
Slavisches Institut der Universität zu Köln
Weyertal 137, D-50931 Köln
Telefon: +49 (0)221 470-90535
Sprechstunden: https://uni.koeln/ENZEB
E-Mail: daniel.buncic at uni-koeln.de = daniel at buncic.de
Threema: https://threema.id/8M375R5K
===============================================================
Homepage: http://daniel.buncic.de/
Academia: http://uni-koeln.academia.edu/buncic
ResearchGate: https://researchgate.net/profile/Daniel-Buncic-2
===============================================================
More information about the Unicode
mailing list