Identifier caseless matching without toNFKC_Casefold
Addison Phillips
addisoni18n at gmail.com
Tue Mar 12 08:44:46 CDT 2024
Hi Yuri,
The part of the W3C “Character Model” called “String Matching for the Web” illustrates the case in this section:
https://www.w3.org/TR/charmod-norm/#normalizationAndCasefold
You might find the rest of the document useful in your work as well.
Best regards,
Addison
Addison Phillips
Chair (W3C Internationalization WG)
Internationalization is not a feature.
It is an architecture.
From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Yuri Sukhov via Unicode
Sent: Monday, March 11, 2024 5:06 PM
To: unicode at corp.unicode.org
Subject: Identifier caseless matching without toNFKC_Casefold
Hi,
I'm implementing a caseless matching for strings used as identifiers. I'm aware that NFKC_Casefold mapping and related toNFKC_Casefold() string transform are designed for such scenario. Unfortunately, the language and libraries I'm using do not implement toNFKC_Casefold(), so I'm looking for an alternative approach.
My use case does not seem to require the removal of default-ignorables, for now I'm only concerned with the case and compatibility variations. It looks like the definition of the compatibility caseless match is what I need:
A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))
However, I can't seem to find the case where that extra cycle of folding/normalization makes the difference. It seems to me that the same result - compatibility caseless match - can be achieved with a simpler approach:
NFC(toCasefold(NFKD(X)))
Basically, I think about it as 1) removing the compatibility variations by normalizing with decomposition, 2) then removing the case differences from this decomposed sequence, 3) and finally storing a folded string in a potentially shorter NFC form.
It looks like it checks all the boxes, and my - likely naive - testing shows that
NFC(toCasefold(NFKD(X))) = NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
I'm sure I'm missing something, and would appreciate an explanation why/when this won't work.
Yuri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240312/ea920bce/attachment.htm>
More information about the Unicode
mailing list