Compatibility decomposables that are not compatibility characters
kenwhistler at sonic.net
Thu Feb 17 19:32:57 CST 2022
In general, it is a good idea not to try to parse the discussion of
compatibility characters too closely. That whole section of the core
specification was written to help clarify the ambiguous, careless way
that people were tending to wave around the term "compatibility
character" in earlier days of the standard.
It is unfortunate that we ended up with the term "compatibility" used
for a specific set of decomposition types baked into the data files and
as a normative part of the Unicode Normalization Algorithm, but there we
are. It just means that people need to be careful now when they evoke
the *other* sense of "compatibility character" -- the shorthand usage
for which is approximately "useless dreck we didn't really want to
include in the standard but had to for one reason or another." That
second use overlaps a lot with characters that formally have
"compatibility decompositions", but the two sets are not the same --
hence the need for the explanation.
On 2/17/2022 5:52 AM, Giacomo Catenazzi via Unicode wrote:
> So it depends on how do you interpret U+00A0. As you write, you may
> consider essential distinction in HTML, so it may not be a
> compatibility character. On the other hand, a typesetter may interpret
> U+00A0 as U+0020. Such person will decide to break or not the space
> according the context (he know language rules and style, e.g. not to
> break number with units, "Ms." with the name, etc.). So the context,
> but not the character makes the distinction.
U+00A0 is a widely used, clearly necessary character. If it hadn't
already been in significant character sets incorporated into the
earliest drafts of the Unicode repertoire, the Unicode architects almost
certainly would have invented it and added it in.
Now, from a certain point of view, characters added to Unicode 1.0
because they were already encoded in ISO 8859-1 ("Latin-1") were added
"for compatibility" with that earlier character set. That seems pretty
obvious, because, for good reasons, U+0010..U+00FF were all added to
Unicode in the exact same order and code values as for Latin-1. You
don't get much more compatible than that! But at the time, nobody was
really arguing that those were <airquote>compatibility
characters</airquote>. It was assumed that we had to have all the
Latin-1 characters in the standard. That was considered a no brainer at
the time. None were "useless dreck". In fact, the big argument then was
about the accented Latin letters in the range U+00C0..U+00FF, which
ended up with *canonical* decompositions into their base letter + accent
combinations. So those were canonical decomposibles, and not
compatibility decomposibles, although quite arguably, they were encoded
"for compatibility" with Latin-1.
See how slippery this gets?
By contrast, the archetypal examples at the time of "useless dreck" that
were added as "compatibility characters" were the various ligatures in
the Arabic Presentation Forms-A block and the Alphabetic Presentation
Forms block. Those were all considered "compatibility characters" at the
time, and were even quarantined in a range then known as the
"Compatibility Area" in the code space.
> But your extra cases are more interesting.
> U+2000 is canonical equivalent to U+2002 (EN QUAD vs EN SPACE). These
> not just have a compatibility decomposable character, but in my
> opinion they are also just compatibility characters: there are exactly
> the same character (there are included just because an error/wrong
> interpretation of existing documents). The same for U+2001.
> I would consider U+2002 to U+200A without U+2007 also as compatibility
> characters (and Unicode Database considers them as compatibility
> decomposable characters). Probably Unicode do the same, because they
> have the type "<compat>".
> It is just U+2007 (not just because like U+00A0 has a <NoBreak>
> instead of <compat>) that make me think. For me, this is just a
> decimal digit zero which it is not printed, so it has own merits: it
> is not a separation, but a meaningful character. (context: tables).
> Different people may have different opinions.
The fixed-width spaces in the 2000 block of punctuation have their own
interesting history. The fact that they were added in Unicode 1.0 means
that they were not part of the forced merger with 10646 repertoire in
1992 that led to the Arabic ligatures and the like. Instead, they
derived largely from the pre-existing XCCS (Xerox) character set, but
some of them appeared also in other early character sets. In Unicode 1.0
they had no decompositions -- nothing did. The decompositions were first
added in Unicode 1.1, and at that point they were all tagged as "<font
variant> ". That was the beginning of the realization that most of
the fixed-width space characters didn't really belong in plain text for
interchange, but instead were artifacts of printing technology.
The addition of the *canonical* decompositions for 2000 and 2001 was a
Unicode 2.0 innovation, when it became clear that nobody could come up
with a convincing distinction between an "EM QUAD" as a space character
and an "EM SPACE" as a space character.
Nowadays most people would agree that there would be little reason to
put any of those other than 200B ZWSP and 2007 FIGURE SPACE into a plain
text stream. The rest of the fixed width space characters are basically
"useless dreck", but the interesting distinction here is that they
didn't start out being considered to be compatibility characters, but
rather graduated to that status as people came to appreciate the fact
that there weren't valid reasons to use them in modern Unicode text
representation. They aren't bad enough to be formally deprecated, but
they live in a kind of limbo of useless stuff you'd be better off
without, along with scads of other such artifacts in the standard.
More information about the Unicode