Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Wed Jun 6 06:55:07 CDT 2018

On Mon, Jun 4, 2018 at 10:49 PM, Manish Goregaokar via Unicode
<unicode at unicode.org> wrote:
> The Rust community is considering adding non-ascii identifiers, which follow
> UAX #31 (XID_Start XID_Continue*, with tweaks).

UAX #31 is rather light on documenting its rationale.

I realize that XML is a different case from Rust considering how the
Rust compiler is something a programmer runs locally whereas control
XML documents and XML processors, especially over time, is
significantly less coupled.

Still, the experience from XML and HTML suggests that, if non-ASCII is
to be allowed in identifiers at all, restricting the value space of
identifiers a priori easily ends up restricting too. HTML went with
the approach of collecting everything up to the next ASCII code point
that's a delimiter in HTML (and a later check for names that are
eligible for Custom Element treatment that mainly achieves
compatibility with XML but no such check for what the parser can
actually put in the document tree) while keeping the actual vocabulary
to ASCII (except for Custom Elements whose seemingly arbitrary
restrictions are inherited from XML).

XML 1.0 codified for element and attribute names what then was the
understanding of the topic that UAX #31 now covers and made other
cases a hard failure. Later, it turned out that XML originally ruled
out too much and the whole mess that was XML 1.1 and XML 1.0 5th ed.
resulted from trying to relax the rules.

Considering that ruling out too much can be a problem later, but just
treating anything above ASCII as opaque hasn't caused trouble (that I
know of) for HTML other than compatibility issues with XML's stricter
stance, why should a programming language, if it opts to support
non-ASCII identifiers in an otherwise ASCII core syntax, implement the
complexity of UAX #31 instead of allowing everything above ASCII in
identifiers? In other words, what problem does making a programming
language conform to UAX #31 solve?

Allowing anything above ASCII will lead to some cases that obviously
don't make sense, such as declaring a function whose name is a
paragraph separator, but why is it important to prohibit that kind of
thing when prohibiting things risks prohibiting too much, as happened
with XML, and people just don't mint identifiers that aren't practical
to them? Is there some important badness prevention concern that
applies to programming languages more than it applies to HTML? The key
thing here in terms of considering if badness is _prevented_ isn't
what's valid HTML but what the parser can actually put in the DOM, and
the HTML parser can actually put any non-ASCII code point in the DOM
as an element or attribute name (after the initial ASCII code point).

(The above question is orthogonal to normalization. I do see the value
of normalizing identifiers to NFC or requiring them to be in NFC to
begin with. I'm inclined to consider NFKC as a bug in the Rust
proposal.)
-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/