Basic Latin digits, not everything else (was: RE: How the C programming language bridges the man-machine gap)

Steve Downey sdowney at gmail.com
Tue Apr 19 08:25:58 CDT 2022


In C and C++, byte is primarily about the smallest unit of addressable
memory. The property being described is that the abstract characters in the
basic sets have single byte encodings, and can therefore be encoded in
strings of consecutive chars.
There are also other character types. Wchar_t was introduced to support
Unicode, and is specified to be able to encode all code points in a single
unit. Unfortunately, it does not, because Microsoft introduced it when 16
bits was sufficient, and that is now baked in into to many places.
We have since introduced char8_t, char16_t, and char32_t, which encode UTF
8, 16, 32 respectively, and are at least 8, 16, and 32 bits, but might be
larger due to memory architecture requirements.

C and C++ also support multi-byte character encodings in char strings.
UTF-8 matches the requirements, unsurprisingly, as that was one of the
design goals for it. The various CJKV encodings are also supported through
multi-byte encodings.

The requirement for consecutive encodings for 0-9 strictly applies only to
the basic character set, and in the C++23 standard we will be making it
clear that's the latin digits that are encoded that way. That's for
portability. If the literal encoding placed some other digits in the single
byte range, and the latin digits elsewhere, the transliteration between
source and results would produce broken code.

On Tue, Apr 19, 2022, 03:46 Hans Åberg via Unicode <unicode at corp.unicode.org>
wrote:

>
> > On 18 Apr 2022, at 23:42, Doug Ewell via Unicode <
> unicode at corp.unicode.org> wrote:
> >
> >> If the values used do not fit into an octet, one must use a larger
> >> byte, and such have used been in the past, but not nowadays, I think.
> >> But large enough to carry all the Unicode values in a byte might be a
> >> possibility. An expert on C might tune in.
> >
> > Is this related in some way to the topic that was under discussion?
>
> If one wants to use Unicode values that do not fit into an octet, the C
> byte must be enlarged.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/pipermail/unicode/attachments/20220419/2afe09eb/attachment.htm>


More information about the Unicode mailing list