Identifier caseless matching without toNFKC_Casefold

Addison Phillips addisoni18n at gmail.com
Tue Mar 12 08:44:46 CDT 2024


Hi Yuri,

 

The part of the W3C “Character Model” called “String Matching for the Web” illustrates the case in this section:

 

https://www.w3.org/TR/charmod-norm/#normalizationAndCasefold

 

You might find the rest of the document useful in your work as well.

 

Best regards,

 

Addison

 

Addison Phillips

Chair (W3C Internationalization WG)

 

Internationalization is not a feature.

It is an architecture.

 

 

 

From: Unicode <unicode-bounces at corp.unicode.org> On Behalf Of Yuri Sukhov via Unicode
Sent: Monday, March 11, 2024 5:06 PM
To: unicode at corp.unicode.org
Subject: Identifier caseless matching without toNFKC_Casefold

 

Hi,

 

I'm implementing a caseless matching for strings used as identifiers. I'm aware that NFKC_Casefold mapping and related toNFKC_Casefold() string transform are designed for such scenario. Unfortunately, the language and libraries I'm using do not implement toNFKC_Casefold(), so I'm looking for an alternative approach.

 

My use case does not seem to require the removal of default-ignorables, for now I'm only concerned with the case and compatibility variations. It looks like the definition of the compatibility caseless match is what I need:

 

A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))

 

However, I can't seem to find the case where that extra cycle of folding/normalization makes the difference. It seems to me that the same result - compatibility caseless match - can be achieved with a simpler approach:

 

NFC(toCasefold(NFKD(X))) 

 

Basically, I think about it as 1) removing the compatibility variations by normalizing with decomposition, 2) then removing the case differences from this decomposed sequence, 3) and finally storing a folded string in a potentially shorter NFC form.

 

It looks like it checks all the boxes, and my - likely naive - testing shows that

 

NFC(toCasefold(NFKD(X))) = NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))

 

I'm sure I'm missing something, and would appreciate an explanation why/when this won't work.


Yuri

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240312/ea920bce/attachment.htm>


More information about the Unicode mailing list