Identifier caseless matching without toNFKC_Casefold

Tue Mar 12 08:44:46 CDT 2024

Hi Yuri,

The part of the W3C “Character Model” called “String Matching for the Web” illustrates the case in this section:

https://www.w3.org/TR/charmod-norm/#normalizationAndCasefold

You might find the rest of the document useful in your work as well.

Best regards,

Addison

Addison Phillips

Chair (W3C Internationalization WG)

Internationalization is not a feature.

It is an architecture.

From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Yuri Sukhov via Unicode
Sent: Monday, March 11, 2024 5:06 PM
To: unicode at corp.unicode.org
Subject: Identifier caseless matching without toNFKC_Casefold

Hi,

I'm implementing a caseless matching for strings used as identifiers. I'm aware that NFKC_Casefold mapping and related toNFKC_Casefold() string transform are designed for such scenario. Unfortunately, the language and libraries I'm using do not implement toNFKC_Casefold(), so I'm looking for an alternative approach.

My use case does not seem to require the removal of default-ignorables, for now I'm only concerned with the case and compatibility variations. It looks like the definition of the compatibility caseless match is what I need:

A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))

However, I can't seem to find the case where that extra cycle of folding/normalization makes the difference. It seems to me that the same result - compatibility caseless match - can be achieved with a simpler approach:

NFC(toCasefold(NFKD(X))) 

Basically, I think about it as 1) removing the compatibility variations by normalizing with decomposition, 2) then removing the case differences from this decomposed sequence, 3) and finally storing a folded string in a potentially shorter NFC form.

It looks like it checks all the boxes, and my - likely naive - testing shows that

NFC(toCasefold(NFKD(X))) = NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))

I'm sure I'm missing something, and would appreciate an explanation why/when this won't work.

Yuri

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240312/ea920bce/attachment.htm>