Compatibility normalization

Thu Oct 12 12:53:26 CDT 2023

On 12/10/2023 12:32, Kent Karlsson via Unicode wrote:
> It would be absolutely wonderful if it could (now) be written off, 
> perhaps not as urban myth, but as old bugs. There have been even worse 
> cases, removing ”accents” on e.g. åäö (ICU even has support for such a 
> mapping). 

I believe that's a "best-fit" mapping, such as those used by Microsoft 
Windows.[1]  The format of the files in that directory is a bit 
ideosyncratic and doesn't match any of the usual formats 
legacy-encoding-to-Unicode files (particularly evident for the CJK 
ones); I'm inclined to presume that Microsoft basically supplied the 
source files which the Windows code pages themselves are built from.  
ICU's UCM format has built-in support for one-way mappings in either 
direction (Unicode-to-legacy or legacy-to-Unicode); the ICU project has 
UCMs generated for all of the Windows code pages[2], including those not 
included in the |MAPPINGS/VENDORS| collection on unicode.org.

To be clear, best-fit conversion mappings have nothing to do with NFKD 
(or NFKC) normalisation /per se/, although NFKD normalisation in 
particular can certainly be used to aid generating them.  Note also that 
/any/ Unicode character not supported by the legacy encoding in question 
will either be best-fitted or substituted (with e.g. a question mark, 
katakana interpunct, geta mark, etc), irrespective of whether it has a 
compatibility decomposition.

(As a sidenote, however: it is also worth noting that, if one /must/ map 
some text with diacritics onto text in ISO Basic Latin letters (ASCII 
letters) for purposes beyond just fuzzy matching, it is usually better 
to use (with awareness of the language in use) an appropriate 
transcription scheme rather than just removing all diacritics; see 
German DIN 91379 for European languages[3], Vietnamese Telex[4], Gwoyeu 
Romatzyh for Mandarin tones[5], Revised Romanisation for Korean 
vowels[6], etc.)

[1] https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/

[2] https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm

[3] 
https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)

[4] https://en.wikipedia.org/wiki/Telex_(input_method)

[5] https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh

[6] https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean

> Just today, I saw a brand new(!) message where single apostrophe (not 
> the ASCII one) somehow had been automatically replaced by three(!) 
> question marks, likewise for some bullet point character (don’t know 
> which one it was originally). So, while not NFKD/NFKC, that kind 
> of ”downgrading” changes to text still happen.

U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—again, nothing to do 
with normalisation, and something which would impact any non-ASCII 
character regardless of whether it has a compatibility decomposition.

—Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231012/15277a78/attachment.htm>