German sharp S uppercase mapping

Markus Scherer markus.icu at gmail.com
Tue Dec 3 09:56:32 CST 2024


On Tue, Dec 3, 2024 at 12:05 AM Daniel Buncic via Unicode <
unicode at corp.unicode.org> wrote:

> Thank you very much for the idea.  I could certainly sum up the
> arguments of the discussion (though I’m too busy to do it right now, you
> would have to have a few weeks’ patience), but I still haven’t
> understood where in the CLDR such casing information is stored.


CLDR has "transform" data for case mappings, but practical case mapping
functions, as implemented in libraries like ICU and ICU4X, are very
low-level, and hardcode exceptional cases.

I have implemented most of the ICU case mapping/case folding functions.
Their behavior mostly follows the Unicode Standard core spec and data files
-- including what SpecialCasing.txt says but handling most of that in code.
Over time, we have added and refined language-specific case mappings for
Dutch (IJ), Armenian (a ligature that has gotten reinterpreted), and modern
Greek ("drop accents" but with exceptions on exceptions).

There is a CLDR ticket for documenting all of this in the CLDR spec (UTS
#35).

For uppercasing ß to ẞ rather than SS, we have this ticket:
https://unicode-org.atlassian.net/browse/CLDR-17624
I have added some information there from this thread.

And that is why I am engaging in this thread and looking for evidence that
ẞ is replacing SS (and ß) in German all-uppercase text.
I am looking for a noticeable increase in relative frequency, not one-offs.

Another way to approach this, also discussed in that ticket, is to add some
kind of explicit option that lets one choose the uppercasing behavior of ß.
Given how low-level uppercase functions are and what limited inputs they
take, that is also not an easy problem.
It might in some ways be easier if the new behavior had become widespread
already, so that implementers could just change their code for most
contexts.

Viele Grüße,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241203/a8b909d4/attachment.htm>


More information about the Unicode mailing list