UAX 31 for C++ Identifiers

Steve Downey sdowney at gmail.com
Fri Jun 19 22:22:35 CDT 2020


On Fri, Jun 19, 2020 at 10:44 PM Asmus Freytag via Unicode
<unicode at unicode.org> wrote:
>
> In source code, having ambiguous identifiers may not be worse than C-style obfuscation.
>

Until recently (the last release 10.1), gcc rejected much of allowed
unicode in UTF-8 input, even in places it would allow \u
universal-character-names. So this all becomes easier now. As a
Standard, we should have handled this better earlier, but the second
best time is now. The XID_ properties make this a lot more palatable
w.r.t. stability, though, and I'm not going to second guess people 10
or 20 or more years ago, too much. Ambiguity in external identifiers
is already ill-formed no diagnostic required, which means broken but
in ways that compilers can't treat as undefined.

>
> But with module names, etc. you may run into security issues if naming allows / facilitates spoofing.
>
I, and other people doing tools, both won and lost this battle
already. Module names in source do not correspond with anything
physical. `import some.module` connects you to whatever exported
`some.module` by magic as far as the standard is concerned. We're
working on the actual mechanics as a Technical Report, and compiler
vendors are participating and aren't, as far as I can tell, more
insane than the average infrastructure engineer. So I have hope.

Mapping anything to file paths is fraught beyond belief, and there are
many experienced engineers providing war stories and parades of
horribles, although I'd personally like to have more stories to tell.

The entire disconnect between logical and physical actually is
hopeful, in a way that `#include <ha/hahahahaha.h>` isn't. Even though
we have a lot of understanding of how that maps to filesystem
searches.

Province of wg21/sg15 , which I also participate in.

I suspect that trying to fix up anything with #include is infeasible
since it's currently the wild west, changes will break, and C++
depends in practice on system provided headers that at best conform to
old C standards.

Thanks!

-SMD


More information about the Unicode mailing list