Basic Unicode character/string support absent even in modern C++

Thu Apr 23 14:54:49 CDT 2020

On 4/22/20 12:54 PM, J Decker via Unicode wrote:
>
>
> On Wed, Apr 22, 2020 at 9:01 AM Shriramana Sharma via Unicode 
> <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>
>     On Wed, Apr 22, 2020 at 5:09 PM Piotr Karocki via Unicode
>     <unicode at unicode.org <mailto:unicode at unicode.org>> wrote:
>     >
>     > >> Or better C++ compiler, which understand Unicode source files.
>     > > My latest GCC and Clang don't have any problem with the source
>     files.
>     > > The limitation is with the standard libraries which don't
>     provide the
>     > > required functionality.
>
>
> I was rather astonished to learn that C is really just ASCII.  How 
> very limiting.

C (and C++) are not ASCII.  The standards specify abstract basic source 
and execution character repertoires; implementations define what 
character sets are used.  In practice, these character sets are ASCII or 
EBCDIC based.  The standards also provide facilities to specify any 
Unicode character via an escape sequence. Implementations map extended 
characters onto these escape sequences in order to support source files 
encoded with characters outside the basic character repertoires.  (e.g., 
\u00e1 is a valid identifier and means the same thing as á written in an 
implementation supported source file encoding).

> Although; C/C++ surely have libraries that deal with such things?  I 
> have one for C, so I know that it's at least possible.  Text is kept 
> internally as utf8 code unit arrays with a known length, so I can 
> include '\0' in a string.

Yes, there are libraries.  ICU is the most well-known.

Tom.

>
> I would LOVE if C could magically substitute the string terminator 
> with the byte 0xFF instead of 0x00.  I noticed that utf8 valid 
> encodings must always have at least 1 bit off.
>
>
>     note: candidate function not viable: no known conversion from
>     'std::u16string' (aka 'basic_string<char16_t>') to 'XXX' for 2nd
>     argument
>
>     And I realize I don't need to use 16-bit for Latin chars, but of
>     course I'm using Indic chars in my actual program.
>
>     Anyhow, as I posted earlier, converting it to UTF-8 just works fine,
>     but it would be good if there's some mechanism that one doesn't have
>     to do that manually, making the learning curve for new learners
>     easier.
>
>
> and I just fwrite( logstring, length, 1, stdout ); and get wide 
> character and unicode support if the terminal supports it, or if the 
> editor reading the redirected output supports it... (where logstring 
> is some thing I printed into with like vsnprintf() )...
>
> 🙃😶 and I copy and paste things a lot... I don't have a good entry 
> method for codes.
>
> This is my favorite thing to keep around to test  console.log( "Hello 
> World; This is a test file."+"𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡"); Because all of 
> those are even surrogate pairs (0x10000+) .
> But that's some JS...but, then again, my C libraries are quite happy 
> to take the utf8 strings from JS and regenerate them, just as a string 
> of bytes.
> I have a rude text-x0r masking routine that generates valid 
> codepoints, but can result in 0; normally you can even still use like 
> strlen etc to deal with strings; so I don't see why C++ strings would 
> have so much mroe difficulty (other than not supporting them in text 
> files; but,then again, that's what preprocessors are for I suppose.
>
>
>
>     --
>     Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200423/12168c4e/attachment.htm>