Compatibility normalization

Kent Karlsson kent.b.karlsson at bahnhof.se
Thu Oct 12 15:47:01 CDT 2023



> 12 okt. 2023 kl. 19:53 skrev Harriet Riddle via Unicode <unicode at corp.unicode.org>:

>   Note also that any Unicode character not supported by the legacy encoding in question will either be best-fitted or substituted (with e.g. a question mark, katakana interpunct, geta mark, etc), irrespective of whether it has a compatibility decomposition.

There is a special-purpose character JUST for that case: SUBSTITUTE. It is available in just about all (not too ancient, computer-wise) encodings (even EBCDIC) except the most crazy ones. For unclear reasons, Unicode has a duplicate of that character: REPLACEMENT CHARACTER, with the disadvantage that that copy is only available in Unicode encodings. SUBSTITUTE should display in a way that makes it clear that it is the SUBSTITUTE character, and not some ”ordinary” character nor not be displayed (though ECMA-48 does not say so explicitly).

8.3.148 SUB - SUBSTITUTE Notation: (C0) Representation: 01/10 SUB is used in the place of a character that has been found to be invalid or in error. SUB is intended to be introduced by automatic means. 

”Invalid” here would include ”not available in the target encoding”.

”Best-fit” is something that is very much in the eye of the beholder. If a programmer (”system”, if you want to be person-neutral) thinks one fallback mapping is great, that greatness might not hold for the users (readers of the resulting text).
> (As a sidenote, however: it is also worth noting that, if one must map some text with diacritics onto text in ISO Basic Latin letters (ASCII letters) for purposes beyond just fuzzy matching,
> 
I would hesitate to say that such a mapping is appropriate for fuzzy matching. ”Fuzzy matching”, should, in an ideal world, cover common spelling variants, and common spelling mistakes (in the language in question). That often does not include ”diacritics removal”, which may easily result in a semantic change (or at least horrible misspelling).
> U+2019 is three bytes (0xE2 0x80 0x99) in UTF-8—
> 
Yes… But not at all clear that that is why I saw that result (I do not know what maneuvers the author did with that piece of text). Replacing single apostrophe (likely introduced ”by automatic means”) with a single question mark (still not great) I have seen plenty of times.

/Kent K


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20231012/88d885f6/attachment.htm>


More information about the Unicode mailing list