a character for an unknown character

Marcel Schneider charupdate at orange.fr
Tue Dec 27 10:03:44 CST 2016

On 27/12/16 01:11, Richard Wordingham wrote:
> On Sun, 25 Dec 2016 19:31:28 +0200
> "Jukka K. Korpela"  wrote:
> > If some graphic symbol is by convention used to represent a lacuna,
> > then the issue, as regards to Unicode, is simply whether that symbol
> > exists as an encoded character or whether there is need to add that
> > graphic symbol to Unicode. But it would be a matter of encoding
> > graphic characters (irrespectively of their meaning in some content),
> > not about encoding abstract ideas like “an unrecognized character”.
> Unicode encodes pictograms, directives and abstract characters, not
> glyphs. There are few, if any characters, that have no semantics,
> though several characters can be ambiguous and context-sensitive as to
> what semantics they occur. If it was just a matter of appearance,
> then U+26C6 RAIN would be the character to use. It has the graphic
> used for characters in damaged inscriptions.

As far as my todayʼs understanding of Unicode goes, I believe that the 
“not encode glyphs but abstract characters” principle has a counterpart 
that makes Unicode characters polysemic by design, as results from 
TUS 3.3, D2. This compromise led to abandon the initially considered 
extensive disunification policy in favor of reasonable unifications that 
provided a correct benefit-cost ratio, Mark Davis explained on this List:


TUS 3.2, C4 and C5 (Conformance Requirements: Interpretation) seems to me 
to be specifying that the meanings of a given character are free and may be 
defined by any human convention, provided that they donʼt conflict with 
the Unicode character properties of that character.

> Of course, there is one character that is already widely used in this
> rôle - U+003F QUESTION MARK. Some of its Unicode properties are not
> suitable, and its informal 'unknown character' semantic conflicts with
> its rôle as a punctuation mark.

Effectively this use of QUESTION MARK is a plague that messes up almost 
every Unicode string dropped into an ANSI-encoded document. 
The only reason I can see for its use is that amidst the ASCII characters, 
this is the one that comes closest to the intended meaning. 

RAIN seems to me best fit for the discussed usage, and I canʼt see any 
problem in using it with this semantics. If Iʼm wrong, how about this:

> If I understand correctly, these issues are already addressed by the
> Leiden Conventions. Why do they not suffice?

I believe that they work well in historic texts that donʼt use the specified 
meta language characters. The Leiden Conventions could be settled because 
brackets and parentheses arenʼt found in old sources. Perhaps modern ones 
that do use these characters are never damaged and to be restored this way.

On the other hand, editors might wish to avoid mixing ASCII characters into 
original scripts. So the RAIN pictograph may be neutral enough. 
If so, the Leiden Conventions could eventually be extended to include it.


More information about the Unicode mailing list