Compatibility decomposables that are not compatibility characters

Ken Whistler kenwhistler at sonic.net
Thu Feb 17 19:32:57 CST 2022


In general, it is a good idea not to try to parse the discussion of 
compatibility characters too closely. That whole section of the core 
specification was written to help clarify the ambiguous, careless way 
that people were tending to wave around the term "compatibility 
character" in earlier days of the standard.

It is unfortunate that we ended up with the term "compatibility" used 
for a specific set of decomposition types baked into the data files and 
as a normative part of the Unicode Normalization Algorithm, but there we 
are. It just means that people need to be careful now when they evoke 
the *other* sense of "compatibility character" -- the shorthand usage 
for which is approximately "useless dreck we didn't really want to 
include in the standard but had to for one reason or another." That 
second use overlaps a lot with characters that formally have 
"compatibility decompositions", but the two sets are not the same -- 
hence the need for the explanation.

On 2/17/2022 5:52 AM, Giacomo Catenazzi via Unicode wrote:
> So it depends on how do you interpret U+00A0. As you write, you may 
> consider essential distinction in HTML, so it may not be a 
> compatibility character. On the other hand, a typesetter may interpret 
> U+00A0 as U+0020. Such person will decide to break or not the space 
> according the context (he know language rules and style, e.g. not to 
> break number with units, "Ms." with the name, etc.). So the context, 
> but not the character makes the distinction.

U+00A0 is a widely used, clearly necessary character. If it hadn't 
already been in significant character sets incorporated into the 
earliest drafts of the Unicode repertoire, the Unicode architects almost 
certainly would have invented it and added it in.

Now, from a certain point of view, characters added to Unicode 1.0 
because they were already encoded in ISO 8859-1 ("Latin-1") were added 
"for compatibility" with that earlier character set. That seems pretty 
obvious, because, for good reasons, U+0010..U+00FF were all added to 
Unicode in the exact same order and code values as for Latin-1. You 
don't get much more compatible than that! But at the time, nobody was 
really arguing that those were <airquote>compatibility 
characters</airquote>. It was assumed that we had to have all the 
Latin-1 characters in the standard. That was considered a no brainer at 
the time. None were "useless dreck". In fact, the big argument then was 
about the accented Latin letters in the range U+00C0..U+00FF, which 
ended up with *canonical* decompositions into their base letter + accent 
combinations. So those were canonical decomposibles, and not 
compatibility decomposibles, although quite arguably, they were encoded 
"for compatibility" with Latin-1.

See how slippery this gets?

By contrast, the archetypal examples at the time of "useless dreck" that 
were added as "compatibility characters" were the various ligatures in 
the Arabic Presentation Forms-A block and the Alphabetic Presentation 
Forms block. Those were all considered "compatibility characters" at the 
time, and were even quarantined in a range then known as the 
"Compatibility Area" in the code space.

>
> But your extra cases are more interesting.
> U+2000 is canonical equivalent to U+2002 (EN QUAD vs EN SPACE). These 
> not just have a compatibility decomposable character, but in my 
> opinion they are also just compatibility characters: there are exactly 
> the same character (there are included just because an error/wrong 
> interpretation of existing documents). The same for U+2001.
>
> I would consider U+2002 to U+200A without U+2007 also as compatibility 
> characters (and Unicode Database considers them as compatibility 
> decomposable characters). Probably Unicode do the same, because they 
> have the type "<compat>".
>
> It is just U+2007 (not just because like U+00A0 has a <NoBreak> 
> instead of <compat>) that make me think. For me, this is just a 
> decimal digit zero which it is not printed, so it has own merits: it 
> is not a separation, but a meaningful character. (context: tables). 
> Different people may have different opinions.

The fixed-width spaces in the 2000 block of punctuation have their own 
interesting history. The fact that they were added in Unicode 1.0 means 
that they were not part of the forced merger with 10646 repertoire in 
1992 that led to the Arabic ligatures and the like. Instead, they 
derived largely from the pre-existing XCCS (Xerox) character set, but 
some of them appeared also in other early character sets. In Unicode 1.0 
they had no decompositions -- nothing did. The decompositions were first 
added in Unicode 1.1, and at that point they were all tagged as "<font 
variant> [0020]". That was the beginning of the realization that most of 
the fixed-width space characters didn't really belong in plain text for 
interchange, but instead were artifacts of printing technology.

The addition of the *canonical* decompositions for 2000 and 2001 was a 
Unicode 2.0 innovation, when it became clear that nobody could come up 
with a convincing distinction between an "EM QUAD" as a space character 
and an "EM SPACE" as a space character.

Nowadays most people would agree that there would be little reason to 
put any of those other than 200B ZWSP and 2007 FIGURE SPACE into a plain 
text stream. The rest of the fixed width space characters are basically 
"useless dreck", but the interesting distinction here is that they 
didn't start out being considered to be compatibility characters, but 
rather graduated to that status as people came to appreciate the fact 
that there weren't valid reasons to use them in modern Unicode text 
representation. They aren't bad enough to be formally deprecated, but 
they live in a kind of limbo of useless stuff you'd be better off 
without, along with scads of other such artifacts in the standard.

--Ken

>
>


More information about the Unicode mailing list