Compatibility Casefold Equivalence

Carl via Unicode unicode at unicode.org
Tue Nov 27 01:46:06 CST 2018


Thanks for the reply.    Responses inline:

> On November 24, 2018 at 5:33 PM Asmus Freytag via Unicode <unicode at unicode.org> wrote: 
>  
> 
> On 11/22/2018 11:58 AM, Carl via Unicode wrote: 
> > (It looks like my HTML email got scrubbed, sorry for the double post)
> > 
> > Hi,
> > 
> > 
> > In Chapter 3 Section 13, the Unicode spec defines D146:
> > 
> > 
> > "A string X is a compatibility caseless match for a string Y if and only if: NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))"
> > 
> > 
> > I am trying to understand the "if and only if" part of this.   Specifically, why is the outermost NFKD necessary?  Could it also be a NFKC normalization?   Is wrapping the outer NFKD in a NFC or NFKC on both sides of the equation okay?
> > 
> > 
> > My use case is that I am trying to store user-provided tags in a database.  I would like the tags to be deduplicated based on compatibility and caseless equivalence, which is how I ended up looking at D146.  However, because decomposition can result in much larger strings, I would prefer to keep  the stored version in NFC or NFKC (I *think* this doesn't matter after doing the casefolding as described above).
> 
> 
> Carl,
> 
> 
> you may find that some of the complications are limited to a small number of code points. In particular, classical (polytonic) Greek has some gnarly behavior wrt case; and some compatibility characters have odd edge cases.
> 
> 

I suspected that the number of edge cases would be small, but I lack a way of enumerating them.  (i.e. I don't know what I don't know)

> I'm personally not a fan of allowing every single Unicode code point in things like usernames (or other types of identifiers). Especially, if including some code points makes the "general case" that much more complex, my personal recommendation would be to simply disallow / reject a small set of troublesome characters; especially if they aren't part of some widespread modern orthography. 
> 
> 
> While Unicode is about being able to digitally represent all written text, identifiers don't follow the same rules. The main reason why people often allow "anything" is because it's easy in terms of specification. Sometimes, you may not have control over what to accept; for example if tags are generated from headers in a document, it would require some transform to handle disallowed code points.
> 
> 

The identifiers doc was what I had originally planned on using, but some of the rules there are too much.  For example, IIUC variation selectors are not allowed (scrubbed?), which prevents use of some emoji sequences.  Also, the ID_Start and XID_Start properties are too strict (since I'm not using this in a programming language or otherwise secure environment), as they forbid leading numbers.  Hashtags are close to what I want, but again, they specify a leading "#".  

Really the problem for me is that I don't know what liberties I can take with restricting/allowing certain characters.  Being too restrictive might be culturally insensitive, but being too lax might open the system for abuse.   Would it be overkill to render the tag text to a picture, hash the picture, and store that instead?  It seems like it would force visually identical strings to the same set of bytes.


> Case is also only one of the types of duplication you may encounter. In many South and South East Asian scripts you may encounter cases where two sequences of characters, while different, will normally render identical. Arabic also has instances of that. Finally, you may ask yourself whether your system should treat simplified and traditional Chinese ideographs as separate or as a variant not unlike the way you treat case.
> 
> 

Ideally I would like the same kind of matching as my browser does when I press Ctrl-F.  If simplified and traditional Chinese match, that's probably good enough.  



> About storing your tag data: you can obviously store them as NFC, if you like: in that case, you will have to run the operations both on the stored and on the new tag.
> 
> 
> Finally, there are some cases where you can tell that two string are identical without actually carrying out the full set of operations:
> 
> 
> Y = X
> 
> 
> NFC(Y) = NFC(X)
> 
> 
> and so on. (If these conditions are true, the full condition above must also be true). For example, let's apply 
> 
> NFKD(toCasefold(NFKD(toCasefold(NFD(X)))))
> 
> 
> on both sides of
> 
> 
> NFC(Y) = NFC(X)
> 
> 
> First:
> 
> 
> NFD(NFC(Y)) = NFD(NFC(X))
> 
> 
> Because the two sides are equal, applying toCaseFold results in equal strings, and so on all the way to the outer NFKD.

As a minor followup, TR 15 section 7 says:

"NFKC(NFKD(x)) == NFKC(x)"

which implies that the outer NFKD can be replaced:

NFKC(toCasefold(NFKD(toCasefold(NFD(X)))))


> 
> 
> In other words, you can stop the comparison at any point where the two sides are equal. From that point on, the outer operations cannot add anything.


That's a good point.  In my case, since one side of the equation will be stored in a DB, I believe I need to do the full transform.  That said, It would be useful for in-memory comparisons. 

> 
> 
> A./



More information about the Unicode mailing list