Compatibility decomposables that are not compatibility characters

Giacomo Catenazzi cate at cateee.net
Thu Feb 17 07:52:32 CST 2022


Hello Monica,

On 17.02.2022 13:18, Monica Merchant via Unicode wrote:

> However, I'm confused by the second example. In particular, I'm not sure 
> if no-break space (*U+00A0*) and the fixed-width space characters 
> (*U+2000-U+200A*) are compatibility characters or not. They are 
> described as "serving essential functions", which I read as meaning that 
> they would have been encoded even if it weren't for round-tripping, in 
> which case they would not be considered as compatibility characters. Is 
> this correct? If so, are they essential because they facilitate the 
> typesetting of text-based markup like HTML (where formatting must be 
> specified in plain text)? No-break space is also essential in that it is 
> used to display standalone non-spacing marks (pg 267 
> <https://www.unicode.org/versions/Unicode14.0.0/ch06.pdf>).
> 

I read the section in this manner: the three examples before your 
example 1 and example 2 describe the case of compatibility characters 
that are not compatibility decomposable characters. Then the standard 
describe two examples where we have compatibility decomposition, but 
without being compatibility characters.

Note that on page 26 we have:

vvvv
There is no formal listing of all compatibility characters in the 
Unicode Standard. This follows from the nature of the definition of 
compatibility characters. It is a judgement call as to whether any 
particular character would have been accepted for encoding if it had not 
been required for interoperability with a particular standard. Different 
participants in character encoding often disagree about the 
appropriateness of encoding particular characters, and sometimes there 
are multiple justifications for encoding a given character.
^^^^

So it depends on how do you interpret U+00A0. As you write, you may 
consider essential distinction in HTML, so it may not be a compatibility 
character. On the other hand, a typesetter may interpret U+00A0 as 
U+0020. Such person will decide to break or not the space according the 
context (he know language rules and style, e.g. not to break number with 
units, "Ms." with the name, etc.). So the context, but not the character 
makes the distinction.

But your extra cases are more interesting.
U+2000 is canonical equivalent to U+2002 (EN QUAD vs EN SPACE). These 
not just have a compatibility decomposable character, but in my opinion 
they are also just compatibility characters: there are exactly the same 
character (there are included just because an error/wrong interpretation 
of existing documents). The same for U+2001.

I would consider U+2002 to U+200A without U+2007 also as compatibility 
characters (and Unicode Database considers them as compatibility 
decomposable characters). Probably Unicode do the same, because they 
have the type "<compat>".

It is just U+2007 (not just because like U+00A0 has a <NoBreak> instead 
of <compat>) that make me think. For me, this is just a decimal digit 
zero which it is not printed, so it has own merits: it is not a 
separation, but a meaningful character. (context: tables). Different 
people may have different opinions.

giacomo


> 
> 
> Thank you,
> 
> Monica
> 
> 


More information about the Unicode mailing list