German sharp S uppercase mapping

Tue Dec 3 02:01:54 CST 2024

Am 03.12.2024 um 02:51 schrieb Asmus Freytag via Unicode:
> Rather than getting hung up on details of parsing one particular
> part of one sentence, it would be more useful from Unicode's
> perspective if someone (Daniel?) could sum up in a short document
> base on this discussion where Unicode is behind the curve and to
> make sure the support in CLDR is up to actual current practice and
> not what it was 10 or 15 years ago.

Thank you very much for the idea.  I could certainly sum up the 
arguments of the discussion (though I’m too busy to do it right now, you 
would have to have a few weeks’ patience), but I still haven’t 
understood where in the CLDR such casing information is stored.  There 
are data subsets that have “casing” in the title, but they only say 
whether the days of the week, month names, language names, etc. are 
capitalized in a certain language.  There is a field called “main 
examplars” that contains all the small letters (for German, including ß) 
and another field called “index examplars”, which for German does not 
even include Ä, Ö, and Ü.  I surmise that this is only meant for 
numbering items using letters (where indeed you can have parts A, B, C, 
etc. of a book, but you would never have a “part Ä”).  I cannot find any 
information saying something like a ↔ A, b ↔ B, etc.

For Turkish (https://www.unicode.org/cldr/charts/46/summary/tr.html), 
the “main letters” in the very first line are given as

[a b c ç d e f g ğ h ı iİ j k l m n o ö p r s ş t u ü v y z].

So there i and its capital counterpart İ are not separated by a space. 
But for German (https://www.unicode.org/cldr/charts/46/summary/de.html), 
the “main letters” are

[aä b c d e f g h i j k l m n oö p q r s ß t uü v w x y z],

where the missing space does not imply capitalization, so I guess 
changing this list to “… s ßẞ t …” would not automatically inform people 
that ß should be capitalized as ẞ.

In 
https://www.unicode.org/versions/Unicode16.0.0/UnicodeStandard-16.0.pdf 
on page 198 I find:
“Examples of case tailorings which are not covered by data in 
SpecialCasing.txt include: […] Uppercasing of U+00DF ‘ß’ LATIN SMALL 
LETTER SHARP S to U+1E9E LATIN CAPITAL LETTER SHARP S[.] The preferred 
mechanism for defining tailored casing operations is the Unicode Common 
Locale Data Repository (CLDR), https://cldr.unicode.org, where 
tailorings such as these can be specified on a per-language basis, as 
needed.”  So the idea is already there.  On page 295 the problem with ß 
is addressed in detail, and right underneath it says, “Additional 
language-specific or orthography-specific contexts and casing behavior 
is specified in the Unicode Common Locale Data Repository (CLDR), 
https://cldr.unicode.org.”  So does this already exist?  Or where does 
it have to be added?

Can anybody help?

Best wishes,

Daniel

-- 
Prof. Dr. Daniel Bunčić
===============================================================
Slavisches Institut der Universität zu Köln
Weyertal 137, D-50931 Köln
Telefon:       +49 (0)221  470-90535
Sprechstunden: https://uni.koeln/ENZEB
E-Mail:        daniel.buncic at uni-koeln.de = daniel at buncic.de
Threema:       https://threema.id/8M375R5K
===============================================================
Homepage:      http://daniel.buncic.de/
Academia:      http://uni-koeln.academia.edu/buncic
ResearchGate:  https://researchgate.net/profile/Daniel-Buncic-2
===============================================================