German sharp S uppercase mapping

Sat Nov 30 11:16:44 CST 2024

On 11/27/2024 12:15 PM, Dominikus Dittes Scherkl via Unicode wrote:
> Am 26.11.24 um 17:59 schrieb Peter Constable via Unicode:
>> The case pair stability requirement applies to default data.

"Data" is ambiguous here, so let's unpack that a bit.

The /policy/ only applies to the default tables (data files) as 
published in the UCD.

It does not apply directly to any other "data", whether that refers to 
text or tables.
>> Conforming applications can still tailor case mapping behaviour use 
>> language-specific overrides. However, the default caise pairs as 
>> defined in the SpecialCasing.txt and UnicodeData.txt files must 
>> remain stable so as not to break existing implementations of file 
>> systems or other identifier systems that depend on the default case 
>> pairs.
>
The /requirement/ that lead to this policy apply to certain scenarios 
only (not that that makes them unimportant). A key one is caseless 
identifiers or case conversions of same (including file names). Here you 
need the strict guarantee that repeating the process on a different 
system / different version does not change the result.

There's also the use of these data as the backbone onto which you apply 
tailorings. Again, stability is tantamount, because changing the 
backbone would require changes (or at least review) of all tailorings. 
(Something that Unicode does not control).

However, speaking of this as a "default" is confusing to readers who 
think in terms of text processing or authoring environments where a 
different set of requirements rule. Here, the proper "default" is the 
best implementation of a culturally appropriate case transform. And what 
is "best"  can change over time. The fact that this is implemented as a 
tailoring on some underlying "default" is not something users need to 
consider.

Therefore the word "default" is also subject to misunderstanding and we 
might to well to consider whether we've framed this correctly in the text.

> I think nowadays ẞ is preferred over SS, and _especially_ the default
> should be changed to use this, because if a text is automatically
> processed by e.g. functions like toUpper(), the old form is not
> invertible. 

This statement is a clear example of what I mean. From the perspective 
of a user, the minute you use something that's not an "immutable" 
identifier-safe implementation, you expect as your "default" a 
culturally appropriate tailoring.

If you have a toUpper method that takes a locale identifier or object 
then you should not need to apply further tailorings, and how the 
behavior of the locale tracks changes to the rules as used in the 
culture is subject to a different debate (e.g. whether to define 
sub-locales for either the old or new rules or both, etc.).

The real question with casing is what do you do with text that uses 
"locale-adjacent" characters?

If a French customer has the name of a German supplier in a database, 
why should that use the old rules, while the same for a German customer 
should use the new rules?

Casing is one of the algorithms that need very little tailoring, unlike 
sorting, and therefore there ought to be a different level of "default", 
one which handles all characters with the same behavior, as long as two 
or more locales don't disagree on the casing (like, for example, dotless i).

This would not be an "immutable" version of the casing, but the most 
current "least common denominator" version, and which would be targeted 
to scenarios where otherwise locale-dependent tailorings would be used, 
except that this one would be a "multi-locale" variant.

> If the old form is intended, it is very easy to replace
> every occurrence of ẞ by SS, but in the other direction not every SS has
> to be replaced by ẞ, making it a time consuming manual task to change 
> back.
> And this problem was the reason why ẞ was introduced at all. After this
> introduction the main reason to use SS was that it was not officially
> allowed to use ẞ until 2017. Now the only reason to use SS as uppercase
> would be if old equipment is used, that doesn't provide the new letter.
> Luckily that is vanishing.

Not entirely true. Again, if you think of authoring text, I would agree. 
If you think of identifiers or file names, you better not change your 
uppercasing. There are some things that look like text but for which 
strict stability is more important than cultural correctness.

*Conclusion:* the fact that we are having this discussion at all points 
to the need to look at our way of describing this topic and possible 
deficiencies in describing the full eco-system. Bare bones Unicode, 
including UCD, does not solve everything but apparently we are not doing 
enough to hand people off to CLDR, for example, when they are looking 
for locale-appropriate solutions. In other words, we can and should 
improve our presentation and positioning, but that would best be done in 
response to somebody taking the time to track down some of the key text 
passages (or file headers) and filing a problem report.

A./

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20241130/7b2e24ed/attachment-0001.htm>