a character for an unknown character

Jukka K. Korpela jkorpela at cs.tut.fi
Sun Dec 25 11:31:28 CST 2016


21.12.2016, 4:29, Martin Mueller wrote:

> Is there a Unicode character that says “I represent an alphanumerical
> character, but I don’t know which”.

I think including such a “character” in Unicode would not fit into the 
the idea of Unicode as a system for encoding plain text characters. You 
seem to be asking for a symbol that is not a graphic or control 
character but information about uncertainty regarding a character a data 
stream. So I think this does not fall into the category of plain text, 
and the information should be expressed at a higher protocol level, e.g. 
in markup or as out-of-band information.

When it is not certain what character there is in some text to be 
encoded, there is a wide range of possible situations. For example, it 
might be a thing like “there is letter U or letter V, probably the 
latter” or “there is some Latin letter but no hint of what it might be” 
or even “there is an alphanumerical character” (though I find it 
difficult to imagine such a situation). Such things can hardly be 
described using new characters; rather, they need to be expressed using 
verbal descriptions (which are about the encoded text, not part of it) 
or some formal notations or both.

> This is a very common problem in
> the transcription of historical texts where you have lacunas. Often, the
> extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by
> <gap/> elements with attributes about this or that. This is efficient
> when it comes to pages, very inefficient when it comes to individual
> characters.

Efficient in what sense? Saving bytes can hardly be an issue here. And 
if various attributes are needed to describe the case, then it would 
become awkward to try to do the same with encoded characters (or 
“characters”, Unicode code points).

> In the TCP project, various code points from the Geometrical were used
> to represent lacunae. The black circle (\u25cf) has been used as the
> character for a missing character.This is OK and unambiguous in its
> context.

If some graphic symbol is by convention used to represent a lacuna, then 
the issue, as regards to Unicode, is simply whether that symbol exists 
as an encoded character or whether there is need to add that graphic 
symbol to Unicode. But it would be a matter of encoding graphic 
characters (irrespectively of their meaning in some content), not about 
encoding abstract ideas like “an unrecognized character”.

> But would be nice to have a special character for just that
> purpose

Various symbols are used in different contexts to indicate situations 
like “there is a written symbol that cannot be recognized as a specific 
character”. Perhaps there should be a universal convention about this, 
but it is unrealistic to expect that to happen. The Unicode Standard can 
hardly standardize such things. And if there were such a universal 
symbol, it would surely have been encoded in Unicode—not because of its 
meaning, but because of its consistent use as a character in plain text.

So I think the conclusion is that you should use established 
conventions, if they exist, about using some symbol for such situations, 
or define a convention as needed. You should not expect the character to 
be recognized in this special meaning without such a higher-level 
convention.

There’s a theoretical (?) problem with this. Let us assume that you 
decide to use a particular character to represent “unknown character” in 
your data, when working with some type of written texts. What happens 
when you encounter, in the study of those text, a graphic symbol that is 
best identified as the character you decided to use in that special 
meaning? Well, I think you can decide to solve that problem if it ever 
appears.

Yucca




More information about the Unicode mailing list