Compatibility normalization
Harriet Riddle
harjitmoe at outlook.com
Thu Oct 12 12:53:26 CDT 2023
On 12/10/2023 12:32, Kent Karlsson via Unicode wrote:
> It would be absolutely wonderful if it could (now) be written off,
> perhaps not as urban myth, but as old bugs. There have been even worse
> cases, removing ”accents” on e.g. åäö (ICU even has support for such a
> mapping).
I believe that's a "best-fit" mapping, such as those used by Microsoft
Windows.[1] The format of the files in that directory is a bit
ideosyncratic and doesn't match any of the usual formats
legacy-encoding-to-Unicode files (particularly evident for the CJK
ones); I'm inclined to presume that Microsoft basically supplied the
source files which the Windows code pages themselves are built from.
ICU's UCM format has built-in support for one-way mappings in either
direction (Unicode-to-legacy or legacy-to-Unicode); the ICU project has
UCMs generated for all of the Windows code pages[2], including those not
included in the |MAPPINGS/VENDORS| collection on unicode.org.
To be clear, best-fit conversion mappings have nothing to do with NFKD
(or NFKC) normalisation /per se/, although NFKD normalisation in
particular can certainly be used to aid generating them. Note also that
/any/ Unicode character not supported by the legacy encoding in question
will either be best-fitted or substituted (with e.g. a question mark,
katakana interpunct, geta mark, etc), irrespective of whether it has a
compatibility decomposition.
(As a sidenote, however: it is also worth noting that, if one /must/ map
some text with diacritics onto text in ISO Basic Latin letters (ASCII
letters) for purposes beyond just fuzzy matching, it is usually better
to use (with awareness of the language in use) an appropriate
transcription scheme rather than just removing all diacritics; see
German DIN 91379 for European languages[3], Vietnamese Telex[4], Gwoyeu
Romatzyh for Mandarin tones[5], Revised Romanisation for Korean
vowels[6], etc.)
[1] https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/
[2] https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm
[3]
https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)
[4] https://en.wikipedia.org/wiki/Telex_(input_method)
[5] https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh
[6] https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean
> Just today, I saw a brand new(!) message where single apostrophe (not
> the ASCII one) somehow had been automatically replaced by three(!)
> question marks, likewise for some bullet point character (don’t know
> which one it was originally). So, while not NFKD/NFKC, that kind
> of ”downgrading” changes to text still happen.
U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—again, nothing to do
with normalisation, and something which would impact any non-ASCII
character regardless of whether it has a compatibility decomposition.
—Har.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231012/15277a78/attachment.htm>
More information about the Unicode
mailing list