<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">12 okt. 2023 kl. 19:53 skrev Harriet Riddle via Unicode <<a href="mailto:unicode@corp.unicode.org" class="">unicode@corp.unicode.org</a>>:</div><div class=""><div class=""></div></div></blockquote><br class=""><blockquote type="cite" class=""><div class=""><div class=""><div class="moz-cite-prefix">  Note also that <em class="">any</em> Unicode character

      not supported by the legacy encoding in question will either be

      best-fitted or substituted (with e.g. a question mark, katakana

      interpunct, geta mark, etc), irrespective of whether it has a

      compatibility decomposition.</div></div></div></blockquote><div><br class=""></div>There is a special-purpose character JUST for that case: SUBSTITUTE. It is available in just about all (not too ancient, computer-wise) encodings (even EBCDIC) except the most crazy ones. For unclear reasons, Unicode has a duplicate of that character: REPLACEMENT CHARACTER, with the disadvantage that that copy is only available in Unicode encodings. SUBSTITUTE should display in a way that makes it clear that it is the SUBSTITUTE character, and not some ”ordinary” character nor not be displayed (though ECMA-48 does not say so explicitly).</div><div><br class=""></div><div><i class="">8.3.148 SUB - SUBSTITUTE

Notation: (C0)

Representation: 01/10

SUB is used in the place of a character that has been found to be invalid or in error. SUB is intended to

be introduced by automatic means.</i> </div><div><br class=""></div><div>”Invalid” here would include ”not available in the target encoding”.</div><div><br class=""></div><div>”Best-fit” is something that is very much in the eye of the beholder. If a programmer (”system”, if you want to be person-neutral) thinks one fallback mapping is great, that greatness might not hold for the users (readers of the resulting text).<br class=""><blockquote type="cite" class=""><div class=""><div class=""><p class="">(As a sidenote, however: it is also worth noting that, if one <em class="">must</em>

      map some text with diacritics onto text in ISO Basic Latin letters

      (ASCII letters) for purposes beyond just fuzzy matching,</p></div></div></blockquote>I would hesitate to say that such a mapping is appropriate for fuzzy matching. ”Fuzzy matching”, should, in an ideal world, cover common spelling variants, and common spelling mistakes (in the language in question). That often does <b class="">not</b> include ”diacritics removal”, which may easily result in a semantic change (or at least horrible misspelling).<br class=""><blockquote type="cite" class=""><div class=""><p class="">U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—</p></div></blockquote>Yes… But not at all clear that that is why I saw that result (I do not know what maneuvers the author did with that piece of text). Replacing single apostrophe (likely introduced ”by automatic means”) with a single question mark (still not great) I have seen plenty of times.</div><div><br class=""></div><div>/Kent K</div><div><br class=""></div><br class=""></body></html>