<!DOCTYPE html><html><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<div class="moz-cite-prefix">On 12/10/2023 12:32, Kent Karlsson via
Unicode wrote:<br>
</div>
<blockquote type="cite" cite="mid:7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se">
<font class="" color="#5856d6">It would be absolutely wonderful if
it could (now) be written off, perhaps not as urban myth, but as
old bugs. There have been even worse cases, removing ”accents”
on e.g. åäö (ICU even has support for such a mapping). </font></blockquote>
<p><br>
</p>
<p>I believe that's a "best-fit" mapping, such as those used by
Microsoft Windows.[1] The format of the files in that directory
is a bit ideosyncratic and doesn't match any of the usual formats
legacy-encoding-to-Unicode files (particularly evident for the CJK
ones); I'm inclined to presume that Microsoft basically supplied
the source files which the Windows code pages themselves are built
from. ICU's UCM format has built-in support for one-way mappings
in either direction (Unicode-to-legacy or legacy-to-Unicode); the
ICU project has UCMs generated for all of the Windows code
pages[2], including those not included in the <code>MAPPINGS/VENDORS</code>
collection on unicode.org.</p>
<p>To be clear, best-fit conversion mappings have nothing to do with
NFKD (or NFKC) normalisation <i>per se</i>, although NFKD
normalisation in particular can certainly be used to aid
generating them. Note also that <em>any</em> Unicode character
not supported by the legacy encoding in question will either be
best-fitted or substituted (with e.g. a question mark, katakana
interpunct, geta mark, etc), irrespective of whether it has a
compatibility decomposition.</p>
<p>(As a sidenote, however: it is also worth noting that, if one <em>must</em>
map some text with diacritics onto text in ISO Basic Latin letters
(ASCII letters) for purposes beyond just fuzzy matching, it is
usually better to use (with awareness of the language in use) an
appropriate transcription scheme rather than just removing all
diacritics; see German DIN 91379 for European languages[3],
Vietnamese Telex[4], Gwoyeu Romatzyh for Mandarin tones[5],
Revised Romanisation for Korean vowels[6], etc.)<br>
</p>
<p><br>
</p>
<p>[1]
<a class="moz-txt-link-freetext" href="https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/">https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/</a>
</p>
<p>[2]
<a class="moz-txt-link-freetext" href="https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm">https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm</a></p>
<p>[3]
<a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)">https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)</a></p>
<p>[4] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Telex_(input_method)">https://en.wikipedia.org/wiki/Telex_(input_method)</a></p>
<p>[5] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh">https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh</a></p>
<p>[6] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean">https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean</a><br>
</p>
<p><br>
</p>
<blockquote type="cite" cite="mid:7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se"><font class="" color="#5856d6">Just today, I saw a brand new(!)
message where single apostrophe (not the ASCII one) somehow had
been automatically replaced by three(!) question marks, likewise
for some bullet point character (don’t know which one it was
originally). So, while not NFKD/NFKC, that kind of ”downgrading<span style="caret-color: rgb(88, 86, 214);" class="">” changes to
text still happen.</span></font></blockquote>
<p><br>
</p>
<p>U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—again, nothing to
do with normalisation, and something which would impact any
non-ASCII character regardless of whether it has a compatibility
decomposition.</p>
<p>—Har.<br>
</p>
</body>
</html>