Compatibility decomposables that are not compatibility characters
Giacomo Catenazzi
cate at cateee.net
Thu Feb 17 07:52:32 CST 2022
Hello Monica,
On 17.02.2022 13:18, Monica Merchant via Unicode wrote:
> However, I'm confused by the second example. In particular, I'm not sure
> if no-break space (*U+00A0*) and the fixed-width space characters
> (*U+2000-U+200A*) are compatibility characters or not. They are
> described as "serving essential functions", which I read as meaning that
> they would have been encoded even if it weren't for round-tripping, in
> which case they would not be considered as compatibility characters. Is
> this correct? If so, are they essential because they facilitate the
> typesetting of text-based markup like HTML (where formatting must be
> specified in plain text)? No-break space is also essential in that it is
> used to display standalone non-spacing marks (pg 267
> <https://www.unicode.org/versions/Unicode14.0.0/ch06.pdf>).
>
I read the section in this manner: the three examples before your
example 1 and example 2 describe the case of compatibility characters
that are not compatibility decomposable characters. Then the standard
describe two examples where we have compatibility decomposition, but
without being compatibility characters.
Note that on page 26 we have:
vvvv
There is no formal listing of all compatibility characters in the
Unicode Standard. This follows from the nature of the definition of
compatibility characters. It is a judgement call as to whether any
particular character would have been accepted for encoding if it had not
been required for interoperability with a particular standard. Different
participants in character encoding often disagree about the
appropriateness of encoding particular characters, and sometimes there
are multiple justifications for encoding a given character.
^^^^
So it depends on how do you interpret U+00A0. As you write, you may
consider essential distinction in HTML, so it may not be a compatibility
character. On the other hand, a typesetter may interpret U+00A0 as
U+0020. Such person will decide to break or not the space according the
context (he know language rules and style, e.g. not to break number with
units, "Ms." with the name, etc.). So the context, but not the character
makes the distinction.
But your extra cases are more interesting.
U+2000 is canonical equivalent to U+2002 (EN QUAD vs EN SPACE). These
not just have a compatibility decomposable character, but in my opinion
they are also just compatibility characters: there are exactly the same
character (there are included just because an error/wrong interpretation
of existing documents). The same for U+2001.
I would consider U+2002 to U+200A without U+2007 also as compatibility
characters (and Unicode Database considers them as compatibility
decomposable characters). Probably Unicode do the same, because they
have the type "<compat>".
It is just U+2007 (not just because like U+00A0 has a <NoBreak> instead
of <compat>) that make me think. For me, this is just a decimal digit
zero which it is not printed, so it has own merits: it is not a
separation, but a meaningful character. (context: tables). Different
people may have different opinions.
giacomo
>
>
> Thank you,
>
> Monica
>
>
More information about the Unicode
mailing list