a character for an unknown character

Janusz S. Bień jsbien at mimuw.edu.pl
Tue Dec 27 09:21:53 CST 2016

On Sun, Dec 25 2016 at 18:31 CET, jkorpela at cs.tut.fi writes:
> 21.12.2016, 4:29, Martin Mueller wrote:
>> Is there a Unicode character that says “I represent an alphanumerical
>> character, but I don’t know which”.
> I think including such a “character” in Unicode would not fit into the
> the idea of Unicode as a system for encoding plain text
> characters. You seem to be asking for a symbol that is not a graphic
> or control character but information about uncertainty regarding a
> character a data stream. So I think this does not fall into the
> category of plain text, and the information should be expressed at a
> higher protocol level, e.g. in markup or as out-of-band information.

The situation you describe is not the situation we are talking about, at
least as far as I am concerned.

A historical corpus uses of course a markup, in our case it is the
TEI-based XML Corpus Encoding Standard. We are dealing not with data
streams, but with XML plain Unicode texts. Words/tokens are just Unicode
strings indexed by the search engine. An unreadable letter in a
word/token has to be encoded somehow without breaking the segmentation
and searching. The best way seems to use a special character to
represent it.

> When it is not certain what character there is in some text to be
> encoded, there is a wide range of possible situations. For example, it
> might be a thing like “there is letter U or letter V, probably the
> latter” or “there is some Latin letter but no hint of what it might
> be” or even “there is an alphanumerical character” (though I find it
> difficult to imagine such a situation). Such things can hardly be
> described using new characters;

You are of course right, but has anybody proposed such an idea?

> rather, they need to be expressed
> using verbal descriptions (which are about the encoded text, not part
> of it) or some formal notations or both.

Again you are right, but it does not seem relevant to the problem.

>> This is a very common problem in
>> the transcription of historical texts where you have lacunas. Often, the
>> extent of the lacuna is known, and the alphabet is known as well. The
>> EEBO TCP transcriptions of English texts before 1700 are good examples.
>> They are SGML transcriptions, where missing stuff is represented by
>> <gap/> elements with attributes about this or that. This is efficient
>> when it comes to pages, very inefficient when it comes to individual
>> characters.
> Efficient in what sense?

It's not clear for me too. 

> Saving bytes can hardly be an issue here. And
> if various attributes are needed to describe the case, then it would
> become awkward to try to do the same with encoded characters (or
> “characters”, Unicode code points).
>> In the TCP project, various code points from the Geometrical were used
>> to represent lacunae. The black circle (\u25cf) has been used as the
>> character for a missing character.This is OK and unambiguous in its
>> context.
> If some graphic symbol is by convention used to represent a lacuna,
> then the issue, as regards to Unicode, is simply whether that symbol
> exists as an encoded character or whether there is need to add that
> graphic symbol to Unicode. But it would be a matter of encoding
> graphic characters (irrespectively of their meaning in some content),
> not about encoding abstract ideas like “an unrecognized character”.

I don't think it is so simple. Besides the character meaning in some
content, we have a Unicode specific meaning in the form of properties,
e.g. being a letter.

>> But would be nice to have a special character for just that
>> purpose
> Various symbols are used in different contexts to indicate situations
> like “there is a written symbol that cannot be recognized as a
> specific character”.

Can you provide some examples of these various symbols?

> Perhaps there should be a universal convention about this, but it is
> unrealistic to expect that to happen.  The Unicode Standard can hardly
> standardize such things. And if there were such a universal symbol, it
> would surely have been encoded in Unicode—not because of its meaning,
> but because of its consistent use as a character in plain text.

You are right there is no such universal symbol at the very moment, but
your other claims IMHO are controversial.

> So I think the conclusion is that you should use established
> conventions, if they exist, about using some symbol for such
> situations, or define a convention as needed. You should not expect
> the character to be recognized in this special meaning without such a
> higher-level convention.

We had already defined a convention and can live with it :-) But why not
improving it?

> There’s a theoretical (?) problem with this. Let us assume that you
> decide to use a particular character to represent “unknown character”
> in your data, when working with some type of written texts. What
> happens when you encounter, in the study of those text, a graphic
> symbol that is best identified as the character you decided to use in
> that special meaning? Well, I think you can decide to solve that
> problem if it ever appears.

What about character and glyph distinction? :-)

Best regards


Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/

More information about the Unicode mailing list