Basic Unicode character/string support absent even in modern C++

Aleksey Tulinov aleksey.tulinov at gmail.com
Wed Apr 22 12:52:51 CDT 2020


C is agnostic, there is no much difference between "0123456789"
and "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡", it's just a bitstream (or bytestream).

#include <stdio.h>
int main() {
  printf("Hello World; This is a test file. " "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡" "\n");
  return 0;
}

$ gcc -Wall -Wextra -pedantic test.c
$ ./a.out
Hello World; This is a test file. 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡

Of course if you want to manipulate strings like this, or count number of
characters in string, then there should be some concept of encoding
present, then concept of character. Then some concept of locale too,
because "Dz" isn't always two characters, or rather in one language it's
two characters, in another language it's just one character etc.

ср, 22 апр. 2020 г. в 20:02, J Decker via Unicode <unicode at unicode.org>:

>
>
> On Wed, Apr 22, 2020 at 9:01 AM Shriramana Sharma via Unicode <
> unicode at unicode.org> wrote:
>
>> On Wed, Apr 22, 2020 at 5:09 PM Piotr Karocki via Unicode
>> <unicode at unicode.org> wrote:
>> >
>> > >> Or better C++ compiler, which understand Unicode source files.
>> > > My latest GCC and Clang don't have any problem with the source files.
>> > > The limitation is with the standard libraries which don't provide the
>> > > required functionality.
>>
>
> I was rather astonished to learn that C is really just ASCII.  How very
> limiting.
> Although; C/C++ surely have libraries that deal with such things?  I have
> one for C, so I know that it's at least possible.  Text is kept internally
> as utf8 code unit arrays with a known length, so I can include '\0' in a
> string.
>
> I would LOVE if C could magically substitute the string terminator with
> the byte 0xFF instead of 0x00.  I noticed that utf8 valid encodings must
> always have at least 1 bit off.
>
>
>>
>> note: candidate function not viable: no known conversion from
>> 'std::u16string' (aka 'basic_string<char16_t>') to 'XXX' for 2nd
>> argument
>>
>> And I realize I don't need to use 16-bit for Latin chars, but of
>> course I'm using Indic chars in my actual program.
>>
>> Anyhow, as I posted earlier, converting it to UTF-8 just works fine,
>> but it would be good if there's some mechanism that one doesn't have
>> to do that manually, making the learning curve for new learners
>> easier.
>>
>>
> and I just fwrite( logstring, length, 1, stdout ); and get wide character
> and unicode support if the terminal supports it, or if the editor reading
> the redirected output supports it... (where logstring is some thing I
> printed into with like vsnprintf() )...
>
> 🙃😶 and I copy and paste things a lot... I don't have a good entry method
> for codes.
>
> This is my favorite thing to keep around to test   console.log( "Hello
> World; This is a test file." + "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡" ); Because all of
> those are even surrogate pairs (0x10000+) .
>
> But that's some JS...but, then again, my C libraries are quite happy to
> take the utf8 strings from JS and regenerate them, just as a string of
> bytes.
> I have a rude text-x0r masking routine that generates valid codepoints,
> but can result in 0; normally you can even still use like strlen etc to
> deal with strings; so I don't see why C++ strings would have so much mroe
> difficulty (other than not supporting them in text files; but,then again,
> that's what preprocessors are for I suppose.
>
>
>>
>> --
>> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200422/549b1652/attachment.htm>


More information about the Unicode mailing list