<!DOCTYPE html><html><head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

  </head>

  <body>

    <div class="moz-cite-prefix">On 12/10/2023 12:32, Kent Karlsson via

      Unicode wrote:<br>

    </div>

    <blockquote type="cite" cite="mid:7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se">

      

      <font class="" color="#5856d6">It would be absolutely wonderful if

        it could (now) be written off, perhaps not as urban myth, but as

        old bugs. There have been even worse cases, removing ”accents”

        on e.g. åäö (ICU even has support for such a mapping). </font></blockquote>

    <p><br>

    </p>

    <p>I believe that's a "best-fit" mapping, such as those used by

      Microsoft Windows.[1]  The format of the files in that directory

      is a bit ideosyncratic and doesn't match any of the usual formats

      legacy-encoding-to-Unicode files (particularly evident for the CJK

      ones); I'm inclined to presume that Microsoft basically supplied

      the source files which the Windows code pages themselves are built

      from.  ICU's UCM format has built-in support for one-way mappings

      in either direction (Unicode-to-legacy or legacy-to-Unicode); the

      ICU project has UCMs generated for all of the Windows code

      pages[2], including those not included in the <code>MAPPINGS/VENDORS</code>

      collection on unicode.org.</p>

    <p>To be clear, best-fit conversion mappings have nothing to do with

      NFKD (or NFKC) normalisation <i>per se</i>, although NFKD

      normalisation in particular can certainly be used to aid

      generating them.  Note also that <em>any</em> Unicode character

      not supported by the legacy encoding in question will either be

      best-fitted or substituted (with e.g. a question mark, katakana

      interpunct, geta mark, etc), irrespective of whether it has a

      compatibility decomposition.</p>

    <p>(As a sidenote, however: it is also worth noting that, if one <em>must</em>

      map some text with diacritics onto text in ISO Basic Latin letters

      (ASCII letters) for purposes beyond just fuzzy matching, it is

      usually better to use (with awareness of the language in use) an

      appropriate transcription scheme rather than just removing all

      diacritics; see German DIN 91379 for European languages[3],

      Vietnamese Telex[4], Gwoyeu Romatzyh for Mandarin tones[5],

      Revised Romanisation for Korean vowels[6], etc.)<br>

    </p>

    <p><br>

    </p>

    <p>[1]

      <a class="moz-txt-link-freetext" href="https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/">https://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/</a>

    </p>

    <p>[2]

      <a class="moz-txt-link-freetext" href="https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm">https://github.com/unicode-org/icu-data/tree/main/charset/data/ucm</a></p>

    <p>[3]

<a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)">https://en.wikipedia.org/wiki/DIN_91379#Normative_mapping_of_Latin_letters_to_basic_letters_(search_form)</a></p>

    <p>[4] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Telex_(input_method)">https://en.wikipedia.org/wiki/Telex_(input_method)</a></p>

    <p>[5] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh">https://en.wikipedia.org/wiki/Gwoyeu_Romatzyh</a></p>

    <p>[6] <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean">https://en.wikipedia.org/wiki/Revised_Romanization_of_Korean</a><br>

    </p>

    <p><br>

    </p>

    <blockquote type="cite" cite="mid:7813AC68-0CF2-46ED-87AA-BDD4C00BD80F@bahnhof.se"><font class="" color="#5856d6">Just today, I saw a brand new(!)

        message where single apostrophe (not the ASCII one) somehow had

        been automatically replaced by three(!) question marks, likewise

        for some bullet point character (don’t know which one it was

        originally). So, while not NFKD/NFKC, that kind of ”downgrading<span style="caret-color: rgb(88, 86, 214);" class="">” changes to

          text still happen.</span></font></blockquote>

    <p><br>

    </p>

    <p>U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—again, nothing to

      do with normalisation, and something which would impact any

      non-ASCII character regardless of whether it has a compatibility

      decomposition.</p>

    <p>—Har.<br>

    </p>

  </body>

</html>