Identifier caseless matching without toNFKC_Casefold

Yuri Sukhov yuri.sukhov at gmail.com
Mon Mar 11 19:06:07 CDT 2024


Hi,

I'm implementing a caseless matching for strings used as identifiers. I'm
aware that NFKC_Casefold mapping and related toNFKC_Casefold() string
transform are designed for such scenario. Unfortunately, the language and
libraries I'm using do not implement toNFKC_Casefold(), so I'm looking for
an alternative approach.

My use case does not seem to require the removal of default-ignorables, for
now I'm only concerned with the case and compatibility variations. It looks
like the definition of the compatibility caseless match is what I need:

A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))

However, I can't seem to find the case where that extra cycle of
folding/normalization makes the difference. It seems to me that the same
result - compatibility caseless match - can be achieved with a simpler
approach:

NFC(toCasefold(NFKD(X)))

Basically, I think about it as 1) removing the compatibility variations by
normalizing with decomposition, 2) then removing the case differences from
this decomposed sequence, 3) and finally storing a folded string in a
potentially shorter NFC form.

It looks like it checks all the boxes, and my - likely naive - testing
shows that

NFC(toCasefold(NFKD(X))) = NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))

I'm sure I'm missing something, and would appreciate an explanation
why/when this won't work.

Yuri
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240312/35216cfc/attachment-0001.htm>


More information about the Unicode mailing list