What constitute? an abstract character?

Mon Jun 15 09:22:26 CDT 2020

Thanks for the hint Asmus - what you said makes sense and is very
useful information in addition to the definitions in The Unicode
Standard Section 3.4. I forgot that the term "abstract character" is
defined in TUS, sorry.

Fuqiao

2020年6月15日(月) 14:44 Asmus Freytag via Unicode <unicode at unicode.org>:
>
> On 6/14/2020 5:47 PM, Fuqiao Xue via Unicode wrote:
>
> Hi Corentin,
>
> The term "abstract character" is ambiguous and can have multiple
> definitions. Depending on what you need, It can refer to visual (i.e.,
> grapheme), logical (i.e., code point), or byte-level (i.e., code unit)
> representation of a given piece of text.
>
> An abstract character is related to a code point by the character encoding. See definitions D7 and D10-D12 in Section 3.4 Characters and Encoding. (http://www.unicode.org/versions/latest/ch03.pdf#G2212)
>
> It is never a "code unit" or a "byte-level" thing. It is also not the code point.
>
> It is the thing that is being assigned a code point. (D11: "Encoded character: An association (or mapping) between an abstract character and
> a code point." -- the definition should really have an added "or code point sequence". Unicode finesses that by saying that sequences never encode an abstract character directly, but they can be used to "represent it", see comment on D7. That formally makes encoding a 1:1 process, but muddies the waters a bit on what we should consider an 'abstract character'. For example, it means that all "building blocks" of any sequences must be seen as abstract characters themselves.)
>
> Now the abstract character A-diaresis (Ä) is encode by a  single code point and also has a canonically equivalent representation by a combining sequence. In effect, the whole sequence "encodes" a single abstract character, but that is formally not how Unicode defines it.
>
> A diaeresis is a recognizable item of the writing system; if used as an umlaut, it tends to act as a decoration of character that is more-or-less seen as a new entity (particularly in Swedish) and less a modified letter A. If used as a diaeresis, it acts more like a punctuation mark that has a function of its own (forcing separate pronunciation). Even though it's graphically applied to a vowel, it can be understood as its own abstract character.
>
> Treating the diaerersis as its own independent abstract character makes logical and not just formal sense. That may not be the case equally for all types of diacritical marks. However, since they can all be named, and thus arguably exist as their own concepts at least on a descriptive level, it becomes effectively a non-problem.
>
> The way combining marks are treated in other scripts, they can all be on different points of the scale as logically independent entities, and some are even on different points of the scale in terms of graphically combining (they may be graphically indistinguishable from regular spacing letters).
>
> To recap, an "abstract" character is a conceptual character, something that forms the atom of a writing system (smallest divisible particle) as viewed from the process of encoding, which associates with it a single code point. "Abstract" characters may exist that are not encoded; and some of them can be analyzed as series of smaller abstract characters, and thus be represented as code point sequences.
>
> Some abstract characters are more like small molecules; they can be encoded as such, or they can also have a more atomic sequence that represents them. The rationale of for allowing this dual nature is historical compatibility, not logical necessity, hence the model is in some ways not "pure" (just practical).
>
> A./
>
> PS: while the character model document tries to unravel the implications of the Unicode Encoding model for W3C standards, it's not a substitution for the original definitions of how the Unicode Standard understands and defines the encoding process.
>
> FYI - W3C developed a Character Model document, which includes some
> guidelines on "characters" and may be useful to you:
> https://www.w3.org/TR/charmod/
>
> Cheers,
>
> Fuqiao
>
> 2020年6月15日(月) 8:01 Corentin via Unicode <unicode at unicode.org>:
>
> Hello
> While trying to define suitable semantic for the lexing of C++, we seem to fail to agree on the definition of abstract characters
>
> Notably:
> - Would diatrics marks considered in isolation be considered abstract characters?
> - What about Hangul Jamos and other marks that are not found in isolation in their respective scripts, Variation selectors, etc ?
>
> I guess another way to phrase my question is: does every assigned codepoint represent on its own an abstract character?
>
> My understanding is that is not the case, but I am eager to be enlighten
>
> Thanks,
>
> Corentin
>
>