Compatibility normalization (was: RE: Unicode encoding philosophy)

Thu Oct 12 06:32:18 CDT 2023

> 12 okt. 2023 kl. 00:37 skrev Doug Ewell via Unicode <unicode at corp.unicode.org>:
> 
> Kent Karlsson wrote:
> 
>>> Letʼs consider an equation that youʼll probably recognize, font
>>> support willing: 𝐸 = 𝑚𝑐².  Thanks to the power of Unicode, we could
>>> use it in the same plain‐text document as, say, ℰ = 𝐦𝕔² while
>>> keeping both
>> 
>> That's not really a proper way of representing math expressions.
>> For one thing, compatibility normalisation would ruin them (true,
>> one is not supposed to apply that, which I agree with, but it
>> sometimes is anyway).
> 
> I see this claim from time to time, and not only from Kent: we must not use character (sequence) X, or must not use it in contrast with character (sequence) Y which is compatibility-equivalent to X, because some random, unknown process might surreptitiously apply NFKC or NFKD to the text, obliterating the distinction.
> 
> Can Kent, or anyone else, please identify a *specific* program or process that does this?
> 
> If there are no attested, real-world examples of processes actually applying NFKC or NFKD behind the user’s back (which would indeed be evil), I’m likely to write this off as an urban myth.

It would be absolutely wonderful if it could (now) be written off, perhaps not as urban myth, but as old bugs. There have been even worse cases, removing ”accents” on e.g. åäö (ICU even has support for such a mapping). Just today, I saw a brand new(!) message where single apostrophe (not the ASCII one) somehow had been automatically replaced by three(!) question marks, likewise for some bullet point character (don’t know which one it was originally). So, while not NFKD/NFKC, that kind of ”downgrading” changes to text still happen.

Now, I do not have access to, let alone able to test, all the software in the world…

/Kent K

> --
> Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231012/0909dba0/attachment-0001.htm>