What constitute? an abstract character?

Corentin corentin.jabot at gmail.com
Mon Jun 15 11:34:25 CDT 2020

On Mon, 15 Jun 2020 at 08:44, Asmus Freytag via Unicode <unicode at unicode.org>

> On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote:
> Hi Corentin,
> The term "abstract character" is ambiguous and can have multiple
> definitions. Depending on what you need, It can refer to visual (i.e.,
> grapheme), logical (i.e., code point), or byte-level (i.e., code unit)
> representation of a given piece of text.
> An abstract character is related to a code point by the character
> encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and
> Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212)
> It is never a "code unit" or a "byte-level" thing. It is also not the code
> point.
> It is the thing that is being assigned a code point. (D11: "Encoded
> character: An association (or mapping) between an abstract character and
> a code point." -- the definition should really have an added "or code
> point sequence". Unicode finesses that by saying that sequences never
> encode an abstract character directly, but they can be used to "represent
> it", see comment on D7. That formally makes encoding a 1:1 process, but
> muddies the waters a bit on what we should consider an 'abstract
> character'. For example, it means that all "building blocks" of any
> sequences must be seen as abstract characters themselves.)
> Now the abstract character A-diaresis (Ä) is encode by a  single code
> point and also has a canonically equivalent representation by a combining
> sequence. In effect, the whole sequence "encodes" a single abstract
> character, but that is formally not how Unicode defines it.
> A diaeresis is a recognizable item of the writing system; if used as an
> umlaut, it tends to act as a decoration of character that is more-or-less
> seen as a new entity (particularly in Swedish) and less a modified letter
> A. If used as a diaeresis, it acts more like a punctuation mark that has a
> function of its own (forcing separate pronunciation). Even though it's
> graphically applied to a vowel, it can be understood as its own abstract
> character.
> Treating the diaerersis as its own independent abstract character makes
> logical and not just formal sense. That may not be the case equally for all
> types of diacritical marks. However, since they can all be named, and thus
> arguably exist as their own concepts at least on a descriptive level, it
> becomes effectively a non-problem.
> The way combining marks are treated in other scripts, they can all be on
> different points of the scale as logically independent entities, and some
> are even on different points of the scale in terms of graphically combining
> (they may be graphically indistinguishable from regular spacing letters).
> To recap, an "abstract" character is a conceptual character, something
> that forms the atom of a writing system (smallest divisible particle) as
> viewed from the process of encoding, which associates with it a single code
> point. "Abstract" characters may exist that are not encoded; and some of
> them can be analyzed as series of smaller abstract characters, and thus be
> represented as code point sequences.
> Some abstract characters are more like small molecules; they can be
> encoded as such, or they can also have a more atomic sequence that
> represents them. The rationale of for allowing this dual nature is
> historical compatibility, not logical necessity, hence the model is in some
> ways not "pure" (just practical).

Thanks for this detailed reply, this is exactly the answer I was looking

> A./
> PS: while the character model document tries to unravel the implications
> of the Unicode Encoding model for W3C standards, it's not a substitution
> for the original definitions of how the Unicode Standard understands and
> defines the encoding process.
> FYI - W3C developed a Character Model document, which includes some
> guidelines on "characters" and may be useful to you:https://www.w3.org/TR/charmod/
> Cheers,
> Fuqiao
> 2020年6月15日(月) 8:01 Corentin via Unicode <unicode at unicode.org> <unicode at unicode.org>:
> Hello
> While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters
> Notably:
> - Would diatrics marks considered in isolation be considered abstract characters?
> - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ?
> I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character?
> My understanding is that is not the case, but I am eager to be enlighten
> Thanks,
> Corentin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200615/f7b6fd12/attachment.htm>

More information about the Unicode mailing list