Identifier caseless matching without toNFKC_Casefold
Yuri Sukhov
yuri.sukhov at gmail.com
Tue Mar 12 15:08:18 CDT 2024
Hi Addison,
Thank you for the link, the examples were very useful. And the more I look
at them, I become increasingly convinced that the compatibility caseless
match transform
NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
is unnecessary excessive.
The second (outer) casefold+normalization cycle can be avoided if we
perform the initial NFKD normalization *before* the first casefold. Doing
compatibility decomposition before the casefolding eliminates the
problem with the U+3392 character illustrated in example 19. And since it's
recommended to decompose before the initial casefold anyway (the
Greek ypogegrammeni/iota issue), NFKD normalization as the first step also
covers that case.
As a result, the transform is reduced from 3 normalizations + 2 casefolds
to 2 normalizations and 1 casefold:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) --> NFC(toCasefold(NFKD(X)))
Am I missing any non-trivial cases? For the examples on the mentioned page,
as well as for the other "problematic" cases I've seen in other places,
this lighter transform produces the same output as the more expensive one
from the standard.
Kind regards,
Yuri
On Tue, Mar 12, 2024 at 5:44 PM Addison Phillips <addisoni18n at gmail.com>
wrote:
> Hi Yuri,
>
>
>
> The part of the W3C “Character Model” called “String Matching for the Web”
> illustrates the case in this section:
>
>
>
> https://www.w3.org/TR/charmod-norm/#normalizationAndCasefold
>
>
>
> You might find the rest of the document useful in your work as well.
>
>
>
> Best regards,
>
>
>
> Addison
>
>
>
> Addison Phillips
>
> Chair (W3C Internationalization WG)
>
>
>
> Internationalization is not a feature.
>
> It is an architecture.
>
>
>
>
>
>
>
> *From:* Unicode <unicode-bounces at corp.unicode.org> *On Behalf Of *Yuri
> Sukhov via Unicode
> *Sent:* Monday, March 11, 2024 5:06 PM
> *To:* unicode at corp.unicode.org
> *Subject:* Identifier caseless matching without toNFKC_Casefold
>
>
>
> Hi,
>
>
>
> I'm implementing a caseless matching for strings used as identifiers. I'm
> aware that NFKC_Casefold mapping and related toNFKC_Casefold() string
> transform are designed for such scenario. Unfortunately, the language and
> libraries I'm using do not implement toNFKC_Casefold(), so I'm looking for
> an alternative approach.
>
>
>
> My use case does not seem to require the removal of default-ignorables,
> for now I'm only concerned with the case and compatibility variations. It
> looks like the definition of the compatibility caseless match is what I
> need:
>
>
>
> A string X is a compatibility caseless match for a string Y if and only
> if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) =
> NFKD(toCasefold(NFKD(toCasefold(NFD(Y))))
>
>
>
> However, I can't seem to find the case where that extra cycle of
> folding/normalization makes the difference. It seems to me that the same
> result - compatibility caseless match - can be achieved with a simpler
> approach:
>
>
>
> NFC(toCasefold(NFKD(X)))
>
>
>
> Basically, I think about it as 1) removing the compatibility variations by
> normalizing with decomposition, 2) then removing the case differences from
> this decomposed sequence, 3) and finally storing a folded string in a
> potentially shorter NFC form.
>
>
>
> It looks like it checks all the boxes, and my - likely naive - testing
> shows that
>
>
>
> NFC(toCasefold(NFKD(X))) = NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
>
>
>
> I'm sure I'm missing something, and would appreciate an explanation
> why/when this won't work.
>
>
> Yuri
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20240313/98ae4a84/attachment.htm>
More information about the Unicode
mailing list