Basic Unicode character/string support absent even in modern C++

Tom Honermann tom at honermann.net
Thu Apr 23 14:43:22 CDT 2020


On 4/22/20 11:54 AM, Shriramana Sharma via Unicode wrote:
> On Wed, Apr 22, 2020 at 5:09 PM Piotr Karocki via Unicode
> <unicode at unicode.org> wrote:
>>>> Or better C++ compiler, which understand Unicode source files.
>>> My latest GCC and Clang don't have any problem with the source files.
>>> The limitation is with the standard libraries which don't provide the
>>> required functionality.
>>   But you wrote that you got messages from compiler not from runtime. And
>> error from compiler is irrelevant to any error in any libraries, standard or
>> not, as code is not executed yet.
> ??? The error is given by the compiler because the stdlib doesn't
> provide the necessary functionality, which is was I was lamenting. For
> a simple program:
>
> #include <string>
> #include <iostream>
> int main() { std::cout << "abcd\n"; }
>
> This works fine, but:
>
> int main() { std::cout << u"abcd\n"; }
>
> just prints out a hex value which is probably the pointer, and changing that to:

C++20 fixed this surprising and undesirable behavior when P1423 
<https://wg21.link/p1423> was adopted (see the proposal section of that 
paper and option 7).  The above code is now ill-formed in C++20.

>
> int main() { std::cout << std::u16string(u"abcd"); }
>
> writes out *94* lines ending with an innocuous:
>
> “1 error generated.”
>
> all complaining about:
>
> note: candidate function not viable: no known conversion from
> 'std::u16string' (aka 'basic_string<char16_t>') to 'XXX' for 2nd
> argument
>
> And I realize I don't need to use 16-bit for Latin chars, but of
> course I'm using Indic chars in my actual program.
>
> Anyhow, as I posted earlier, converting it to UTF-8 just works fine,
> but it would be good if there's some mechanism that one doesn't have
> to do that manually, making the learning curve for new learners
> easier.

 From a portability standpoint, converting it to UTF-8 doesn't actually 
work fine.  Doing so might produce the behavior you want on platforms 
you care about, but it won't on all of them.  In particular, it won't 
work reliably on Windows (where the locale dependent encoding [*] is 
never UTF-8 and where the console is unable to display characters 
outside the BMP) or on z/OS (where the locale dependent encoding is 
EBCDIC based).

Tom.

[*]: Ok, very recent Windows releases have ways of making the locale use 
UTF-8, but it is still an experimental feature.

>
>
>
> --
> Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा 𑀰𑁆𑀭𑀻𑀭𑀫𑀡𑀰𑀭𑁆𑀫𑀸
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200423/95a85d7f/attachment.htm>


More information about the Unicode mailing list