Normalization Generics (NFx, NFKx, NFxy)

Zach Lym indolering at gmail.com
Sat Dec 12 20:23:23 CST 2020


> The more general rule is that:
> NFC(X) = NFC(Y) if and only if NFD(X) = NFD(Y).
> I.e. you can always replace one canonical form with the other in
> equivalence comparisons. (As long as you apply the same one to both
> sides, of course, but which one is up to you.)

Yes, and a careful reading of the standard will show that this is the
case.  But we don't live in a world where people have time to read the
standard. Oh dear, I included the wrong link in my citation!  It
should have been:
https://lwn.net/ml/linux-fsdevel/20190206084752.nwjkeiixjks34vao@pali/

At any rate, someone suggested using NFC, but this objection came up:

>> Is there any case where
>>    NFC(x) == NFC(y) && NFD(x) != NFD(y)   , or
>>    NFC(x) != NFC(y) && NFD(x) == NFD(y)
>
>This is good question. And I think we should get definite answer for it
>prior inclusion of normalization into kernel.

Which was simply never followed up on.  This is a feature that was
included after years of debate and developed in an open process.  If
even Linux can't get this one right, then we need to do a better job
at explaining Unicode.

> > I would instead like to propose normalization form generics for use in
> > pseudo code definitions:
> >
> >     NFx = NFD|NFC
> >     NFKx = NFKD|NFKC
> >     NFxy = NFD|NFC|NFKD|NFKC
>
> I would prefer the last one to be:
> NF(K)x = NFD|NFC|NFKD|NFKC; or perhaps
> NF[K]x = NFD|NFC|NFKD|NFKC; to look a bit more like ABNF.

I don't care for NFxy either, but I strongly prefer sticking to C
programming conventions.


More information about the Unicode mailing list