Basic Unicode character/string support absent even in modern C++

J Decker d3ck0r at gmail.com
Wed Apr 22 11:54:13 CDT 2020


On Wed, Apr 22, 2020 at 9:01 AM Shriramana Sharma via Unicode <
unicode at unicode.org> wrote:

> On Wed, Apr 22, 2020 at 5:09 PM Piotr Karocki via Unicode
> <unicode at unicode.org> wrote:
> >
> > >> Or better C++ compiler, which understand Unicode source files.
> > > My latest GCC and Clang don't have any problem with the source files.
> > > The limitation is with the standard libraries which don't provide the
> > > required functionality.
>

I was rather astonished to learn that C is really just ASCII.  How very
limiting.
Although; C/C++ surely have libraries that deal with such things?  I have
one for C, so I know that it's at least possible.  Text is kept internally
as utf8 code unit arrays with a known length, so I can include '\0' in a
string.

I would LOVE if C could magically substitute the string terminator with the
byte 0xFF instead of 0x00.  I noticed that utf8 valid encodings must always
have at least 1 bit off.


>
> note: candidate function not viable: no known conversion from
> 'std::u16string' (aka 'basic_string<char16_t>') to 'XXX' for 2nd
> argument
>
> And I realize I don't need to use 16-bit for Latin chars, but of
> course I'm using Indic chars in my actual program.
>
> Anyhow, as I posted earlier, converting it to UTF-8 just works fine,
> but it would be good if there's some mechanism that one doesn't have
> to do that manually, making the learning curve for new learners
> easier.
>
>
and I just fwrite( logstring, length, 1, stdout ); and get wide character
and unicode support if the terminal supports it, or if the editor reading
the redirected output supports it... (where logstring is some thing I
printed into with like vsnprintf() )...

🙃😶 and I copy and paste things a lot... I don't have a good entry method
for codes.

This is my favorite thing to keep around to test   console.log( "Hello
World; This is a test file." + "𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡" ); Because all of
those are even surrogate pairs (0x10000+) .

But that's some JS...but, then again, my C libraries are quite happy to
take the utf8 strings from JS and regenerate them, just as a string of
bytes.
I have a rude text-x0r masking routine that generates valid codepoints, but
can result in 0; normally you can even still use like strlen etc to deal
with strings; so I don't see why C++ strings would have so much mroe
difficulty (other than not supporting them in text files; but,then again,
that's what preprocessors are for I suppose.


>
> --
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200422/5f1abb7a/attachment.htm>


More information about the Unicode mailing list