Normalization Generics (NFx, NFKx, NFxy)
Zach Lym
indolering at gmail.com
Fri Dec 11 22:14:08 CST 2020
I have been tracking down the rationale behind the normalization
choices in filesystems. One trouble spot for implementers is
interpreting strict logician terminology paired with imprecise pseudo
code. Take the definition of Unicode's caseless matching algorithm
[D145]:
> A string X is a canonical caseless match for a string Y if and only if:
> NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))
The W3C Canonical Case Fold Normalization algorithm claims to be
compatible with [D145], but uses NFC in the last step
[w3c-charmod-norm], leading to an apparent contradiction. Even though
Unicode explains that "case folding is closed under canonical
normalization" it took me a long time to find that passage and
convince myself that the W3C and Unicode matching algorithms are
equivalent. I am not alone: *Linux kernel hackers couldn't figure it
out either* [linux-norm]!
I was originally going to propose additions to D145 textual
description, cross-references to the implementation section, and
adding discussion of W3C charmod-norm. However, I don't think this
would help as the text is already quite dense and most people will
just ignore everything outside the example anyway [minimalist-manual].
I would instead like to propose normalization form generics for use in
pseudo code definitions:
NFx = NFD|NFC
NFKx = NFKD|NFKC
NFxy = NFD|NFC|NFKD|NFKC
Freestanding `X`/`Y` variables should be probably be replaced to
disambiguate them from the `NFx` nomenclature. `s1`/`s2` would work
but `foo`/`bar` is less dense:
NFx(caseFold(NFD(foo))) = NFx(caseFold(NFD(bar)))
`NFx` does not currently appear within the Unicode standard itself,
but is used in the normalization technical note [UAX15]. However,
**UAX15 defines `NFx` twice**, first as NFD|NFC|NFKD|NFKC and later on
as NFD|NFC. I think the proposed convention gets the most mileage out
of the nomenclature and is how I have seen `NFx` used in the real
world [linus].
Thank you!
-Zach Lym
[w3c-charmod-norm]:
https://w3c.github.io/charmod-norm/#CanonicalFoldNormalizationStep
[linux-norm]: https://lwn.net/ml/linux-fsdevel/20190318202745.5200-10-krisman%40collabora.com
[minimalist-manual]: https://dl.acm.org/doi/10.1207/s15327051hci0302_2
[UAX15]: https://unicode.org/reports/tr15/
[linus]: https://lore.kernel.org/linux-fsdevel/CAHk-=wiFtZL5rK3T-HQPm0oG4vekDJEKS47P8BbzHSXt_6SHuA@mail.gmail.com/
More information about the Unicode
mailing list