German sharp S uppercase mapping

Tue Nov 26 15:18:18 CST 2024

On 11/26/2024 12:41 PM, Daniel Buncic via Unicode wrote:
> Dear Marius, dear Ivan, dear Peter, dear all,
>
> Thanks to Marius for the compromise idea that the ß → SS mapping could 
> remain in the standard table but ß → ẞ be handled as special casing 
> for German.  However, I wonder what language the standard table would 
> be there for then, given that ß is used in no other language but German.

Unicode's default casing table is essentially an "identifier-safe" 
casing table. Not only is it identifier-safe, it is also geared towards 
all situations where it needs to run unattended.

This is distinct from full-fidelity text processing. When 
text-processing functions support authoring (editing), there's an 
immediate quality control, and also, the same stability concerns related 
to both identifiers and unattended use do not apply.

>
> (Or if the ß → SS rule was then only applied to those few older 
> non-German texts that did use ß, it would be wrong in most cases, as 
> in this Polish Bible from 1846: 
> https://books.google.de/books?id=W4xbAAAAMAAJ&hl=de.  Google Books, 
> certainly on the basis of some ß → ss rule, gives one of the words in 
> the title as “Wssystko”, but that does not make sense; the word 
> spelled “Wßystko” on the title page has to be transcribed as 
> “Wszystko” (‘Whole’), in the same way as e.g. the first word in the 
> heading of Genesis is spelled “PIERWSZE” (‘First’), not, of course, 
> “PIERWSSE”.)
>
> As to the interpretation of spelling rules, one has to know that 
> “auch” (‘also’) in normative dictionaries always separates a secondary 
> form from a preferred one.  Equal options are separated by “oder” 
> (‘or’) or merely by a comma or a slash.  In this light, see the change 
> from the previous version of the rule (§25 E3) to the current one:
I totally agree with the parsing of the sentence. It is quite clear, 
that the way this statement is written implies the use of captial sharp 
S as the ordinary (or "unmarked") case, while the "SS" can be used in 
addition (implicit in that is the suggestion that you might have a 
particular reason, such as compatibility with older usage, but also, 
things like identifiers.
>
> “Bei Schreibung mit Großbuchstaben schreibt man SS. Daneben ist auch 
> die Verwendung des Großbuchstabens ẞ möglich. Beispiel: Straße – 
> STRASSE – STRAẞE.”
> (‘When writing in capital letters, one writes SS. In addition to this, 
> the use of the capital letter ẞ is also possible: Straße – STRASSE – 
> STRAẞE.’ – 
> https://www.rechtschreibrat.com/DOX/rfdr_Regeln_2016_redigiert_2018.pdf, 
> p. 29)
>     ↓
> “Bei Schreibung mit Großbuchstaben ist neben der Verwendung des
> Großbuchstabens ẞ auch die Schreibung SS möglich: Straße – STRAẞE –
> STRASSE.”
> (‘When writing in capital letters, in addition to using the capital 
> letter ẞ, it is also possible to write SS: Straße – STRAẞE – STRASSE.’ 
> – 
> https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk_2024.pdf, 
> p. 48)
>
> Before, capital ẞ was classified as ‘also possible’, now SS is ‘also 
> possible’, and the order of the examples was also changed from 
> “STRASSE – STRAẞE” to “STRAẞE – STRASSE”.  If they had meant the 
> alternatives to be equal, they would have written something like “Bei 
> Schreibung mit Großbuchstaben kann man ẞ oder SS schreiben” (‘When 
> writing in capital letters, one can write ẞ or SS’).  It is correct 
> that the order by itself does not indicate a preference, but the 
> wording does.
>
I think your analysis is very conclusive on that aspect. The rules have 
definitely changed.
> Peter, can you give me an example of an implementation that would 
> crash if there was a new version of CaseFolding.txt or 
> SpecialCasing.txt? Wouldn’t a programmer either copy the data of the 
> file into their application so that it still works if the server 
> unicode.org is down? And then changing the original would have no 
> effect until the programmer decides to implement the change in their 
> application, but then it would be their responsibility to take care of 
> the effects of that change within their application.  Or in the worst 
> case, the application would download its data directly from, say, 
> https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, but then a 
> new version would just have to be stored under …/17.0.0/… and it would 
> not affect the application.  How can a new version of a file like this 
> directly “break existing implementations”? Probably I am 
> misunderstanding something here.

The mistake is to assume that people supporting text processing should 
be using the default table to begin with.

.NET has a concept of an "invariant culture", which you can specify 
instead of a locale-specific culture. It will violate some cultures' 
preferences, but will produce a dependable and stable result. This is 
pretty much what the the Unicode "default" casing does.

Somebody should go over the text and look at all the descriptions and 
propose an update that more clearly explains that for any processing 
that should be correct (for any language) on needs to not use this 
identifier-safe casing mode, but use the correct casing table from CLDR.

I don't know whether CLDR has its own default for something that's 
culturally a "least common denominator" but may deviate from the 
"identifier-safe" version. That would be something useful to use for all 
cases where you can't tailor for a single language, but still want to 
get most cases correct (with exception of the few where several 
languages disagree on the casing for the same character).

It's all well and good to cite "stability requirements" but these 
requirements apply to specific scenarios. That doesn't mean we should 
throw them overboard, but that, instead, we need to be much clearer as 
to where they apply (and where insisting on them is meaningless).

Identifiers and unattended use with stable outcomes are one scenario.

Functions executed during manual edits of text are another.

Let's get some more clarity for what is what, and where to use UCD and 
where to use CLDR.

Someone will have to drive this by proposing actual language in detail 
(covering all discussions of case).

A./

>
> Best wishes,
>
> Daniel
>