a character for an unknown character

Asmus Freytag asmusf at ix.netcom.com
Tue Dec 27 23:33:32 CST 2016


On 12/27/2016 8:03 AM, Marcel Schneider wrote:
> On 27/12/16 01:11, Richard Wordingham wrote:
>> On Sun, 25 Dec 2016 19:31:28 +0200
>> "Jukka K. Korpela"  wrote:
> […]
>>> If some graphic symbol is by convention used to represent a lacuna,
>>> then the issue, as regards to Unicode, is simply whether that symbol
>>> exists as an encoded character or whether there is need to add that
>>> graphic symbol to Unicode. But it would be a matter of encoding
>>> graphic characters (irrespectively of their meaning in some content),
>>> not about encoding abstract ideas like “an unrecognized character”.
>> Unicode encodes pictograms, directives and abstract characters, not
>> glyphs. There are few, if any characters, that have no semantics,
>> though several characters can be ambiguous and context-sensitive as to
>> what semantics they occur. If it was just a matter of appearance,
>> then U+26C6 RAIN would be the character to use. It has the graphic
>> used for characters in damaged inscriptions.
> As far as my todayʼs understanding of Unicode goes, I believe that the
> “not encode glyphs but abstract characters” principle has a counterpart
> that makes Unicode characters polysemic by design, as results from
> TUS 3.3, D2. This compromise led to abandon the initially considered
> extensive disunification policy in favor of reasonable unifications that
> provided a correct benefit-cost ratio, Mark Davis explained on this List:
>
> http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0145.html
>
> TUS 3.2, C4 and C5 (Conformance Requirements: Interpretation) seems to me
> to be specifying that the meanings of a given character are free and may be
> defined by any human convention, provided that they donʼt conflict with
> the Unicode character properties of that character.

(Most) character properties can be adjusted, so the statement above 
would need to
be drawn much more narrowly.

The generic issue that Unicode runs into is that there are things like 
"letters" that have
well-defined identities (the letter A), but, perhaps because of that, 
have a very wide
ranging set of real images - some of the fanciful ones may bear scant 
relation to
the archetypal shape. However, because they are members of bounded, and 
extremely
well-known sets (alphabets) users are tolerant of artistic license. In 
addition, they are
generally used in longer contexts (words) where their identity is 
reaffirmed, independent
of their shape, by occurring in the expected juxtapositions (and mostly 
not occurring in
other, unexpected ones).

However, the conventions where and when to use one of these letters are 
not fixed,
not even their phonetic equivalents.

Contrast that with many marks. The really common ones, like the period, 
are well-
known enough that fonts can substitute small squares or other shapes without
impeding their use in normal text. However, outside standard sentence 
punctuation,
they can be re-used for many other purposes. Some such uses, like the 
Swedish use
of ":" in the middle of an abbreviation, may be unusual enough to not 
readily be
catered to by all text-processing software (e.g. in word-segmentation).

Nevertheless, the same thing applies as with letters: where and when to 
use one of
these marks is not fixed as part of their encoding, not even their 
functions.

Many other "simple" marks: lines, circles, triangles, hooks, and 
squares, or groups
of them, are likewise subject to frequent reuse. Some of them may have been
incorrectly encoded more than once. Like the standard punctuation marks, 
both
their precise shapes and precise functions are subject to stylistic or 
other conventions.

When it comes to marks (or symbols) of less generic or more complex 
shapes, the
presumption that the mark only has "one" shape may be more common, and 
examples of the mark
being repurposed may be less common.  Not being as common, fewer readers 
will
recognize all stylistic variations as being "the same thing". A variant 
form will be more
likely to be understood as a related, but not identical symbol. That in 
turn fuels the
misperception that Unicode somehow encodes symbols based on a single
conventional usage.

A./




More information about the Unicode mailing list