Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap)

Jens Maurer Jens.Maurer at gmx.net
Tue Apr 19 02:25:57 CDT 2022


On 19/04/2022 00.24, Marius Spix via Unicode wrote:
> Also note >
>
> On Mon, 18 Apr 2022 21:10:58 +0200
> Jens Maurer via Unicode wrote:
>
>> On 18/04/2022 20.47, Doug Ewell via Unicode wrote:
>>> Hans Åberg wrote:
>> "In both the source and execution basic character sets,
>> the value of each character after 0 in the above list
>> of decimal digits shall be one greater than the value
>> of the previous."
>>
>> Note the use of the term "basic character set".
>
> Also note that SHALL be does not mean MUST be.

Let me take exception to that statement.  Any standard describes
things that may conform to that standard or not.  Anyone can do
whatever they want, but if you want to claim conformance to some
standard for your thing, you need to satisfy all its SHALL
prescriptions.  So, if your implementation of C does not satisfy
the rule about contiguous encoding of (Latin) digits, yours is
simply not an implementation of C conforming to ISO 9899.

(Whether you're allowed to even call it "C" in that case is a
related, but different question.)

>   For example, the basic
> character set SHALL include certain characters like “[”, “]”, “{” or
> “}”, but whenever they do not exist in the current character set, C
> allows to replace them by digraphs and trigraphs.

The "basic character set" that the C and C++ standards talk about
is an abstract set of characters, unrelated to any specific
encoding.  In order to ease programming in C and C++, character
sequences to name characters that might not be easily accessible
on some keyboards have been introduced.  Conceptually, trigraphs
are replaced while reading the individual characters of your
source file, while digraphs are just alternative tokens,
recognized during lexing.  (The different treatment makes a
difference in string literals, for instance: trigraphs are
replaced in string literals, digraphs are not.)

A programmer can use trigraphs and digraphs even if the current
character set fully supports all the basic characters of C.

>   C++ also adds and
> alternative tokens (like “and” or “xor” instead of “&&” or “^”).
> Trigraphs are not supported in C++17 anymore, which breaks
> downwards-compatibility.

Those audiences who care about trigraphs have been assured that
their compilers can continue to support trigraphs as a conforming
extension, to limit the breakage in practice.

> C also expect that the backslash (\, ASCII codepoint 0x5C) is used for
> escape sequences in string literals, but some users of Shift JIS
> encoding use the Yen sign (¥), with shares the same codepoint 0x5C.

That sounds like the encoding expected by the compiler is different
from the encoding used for screen display.

Jens



More information about the Unicode mailing list