a character for an unknown character

Richard Wordingham richard.wordingham at ntlworld.com
Mon Dec 26 18:05:22 CST 2016

On Sun, 25 Dec 2016 19:31:28 +0200
"Jukka K. Korpela" <jkorpela at cs.tut.fi> wrote:

> When it is not certain what character there is in some text to be 
> encoded, there is a wide range of possible situations. For example,
> it might be a thing like “there is letter U or letter V, probably the 
> latter” or “there is some Latin letter but no hint of what it might
> be” or even “there is an alphanumerical character” (though I find it 
> difficult to imagine such a situation). Such things can hardly be 
> described using new characters; rather, they need to be expressed
> using verbal descriptions (which are about the encoded text, not part
> of it) or some formal notations or both.

This does not appear to be the situation we are being asked about.  I
suspect the context is rather that of a document damaged by fire or

> If some graphic symbol is by convention used to represent a lacuna,
> then the issue, as regards to Unicode, is simply whether that symbol
> exists as an encoded character or whether there is need to add that
> graphic symbol to Unicode. But it would be a matter of encoding
> graphic characters (irrespectively of their meaning in some content),
> not about encoding abstract ideas like “an unrecognized character”.

Unicode encodes pictograms, directives and abstract characters, not
glyphs.  There are few, if any characters, that have no semantics,
though several characters can be ambiguous and context-sensitive as to
what semantics they occur.  If it was just a matter of appearance,
then U+26C6 RAIN would be the character to use.  It has the graphic
used for characters in damaged inscriptions.

Of course, there is one character that is already widely used in this
rôle - U+003F QUESTION MARK.  Some of its Unicode properties are not
suitable, and its informal 'unknown character' semantic conflicts with
its rôle as a punctuation mark.

If I understand correctly, these issues are already addressed by the
Leiden Conventions.  Why do they not suffice?


More information about the Unicode mailing list