UAX 31 for C++ Identifiers

Fri Jun 19 21:16:29 CDT 2020

I'm the lead author for a proposal to rework C++ identifiers in line with
the current recommendation of UAX 31. The current version is available at
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p1949r4.html.

The core of the proposal is to replace the current allowlist to using
XID_Start and XID_Continue with the addition of LOW LINE in start. The
summary

The allowed Unicode code points in identifiers include many that are
unassigned or unnecessary, and others that are actually counter-productive.
By adopting the recommendations of UAX #31, Unicode Identifier and Pattern
Syntax, C++ will be easier to work with in international environments and
less prone to accidental problems.

This proposal does not address some potential security concerns—so called
homoglyph attacks—where letters that appear the same may be treated as
distinct. Methods of defense against such attacks are complex and evolving,
and requiring mitigation strategies would impose substantial implementation
burden.

This proposal also recommends adoption of Unicode normalization form C
(NFC) for identifiers to ensure that when compared, identifiers intended to
be the same will compare as equal. Legacy encodings are generally naturally
in NFC when converted to Unicode. Most tools will, by default, produce NFC
text.

Some unusual scripts require the use of characters as joiners that are not
allowed by UAX #31, these will no longer be available as identifiers in C++.

As a side-effect of adopting the identifier characters from UAX #31, using
emoji in or as identifiers becomes ill-formed.

The most important open question is what are we losing by using the basic
XID_Start XID_Continue* pattern. There are apparently natural languages
that require code points outside that set in order to write some words? How
much of a problem is that, and are there solutions without complex script
analysis on potential identifiers?

Secondarily, what would an excellent conformance statement look like? I'm
proposing an annex to the C++ standard discussing the conformance points
and how we are or are not meeting them, so as to have clarity on how and
why.

There are also open questions about emoji. There are currently a large
number that are allowed, but it seems mostly due to open listing unassigned
code points. Has there been discussion of a standard profile that would
allow emoji in identifiers? I realize this has substantial overlap with
script checking and the security paper. C++ identifiers are sort of half
over the fence. ZWJ are allowed, but gender modifiers aren't, and neither
were intentional with respect to emoji. The feedback I've got is that we,
the C++ committee, would really like not to own this problem, even if
members participate in solving the problem.

Thanks!

-SMD (wg21/sg16)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://corp.unicode.org/mailman/private/unicode/attachments/20200619/b254ae96/attachment.htm>