Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Fri Jun 8 04:07:48 CDT 2018

On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <hsivonen at hsivonen.fi> wrote:
> Considering that ruling out too much can be a problem later, but just
> treating anything above ASCII as opaque hasn't caused trouble (that I
> know of) for HTML other than compatibility issues with XML's stricter
> stance, why should a programming language, if it opts to support
> non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> complexity of UAX #31 instead of allowing everything above ASCII in
> identifiers? In other words, what problem does making a programming
> language conform to UAX #31 solve?

After refreshing my memory of XML history, I realize that mentioning
XML does not helpfully illustrate my question despite the mention of
XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
ignore the XML part.

Trying to rephrase my question more clearly:

Let's assume that we are designing a computer-parseable syntax where
tokens consisting of user-chosen characters can't occur next to each
other and, instead, always have some syntax-reserved characters
between them. That is, I'm talking about syntaxes that look like this
(could be e.g. Java):

    ab.cd();

Here, ab and cd are tokens with user-chosen characters whereas space
(the indent),  period, parenthesis and the semicolon are
syntax-reserved. We know that ab and cd are distinct tokens, because
there is a period between them, and we know the opening parethesis
ends the cd token.

To illustrate what I'm explicitly _not_ talking about, I'm not talking
about a syntax like this:

αβ⊗γδ

Here αβ and γδ are user-named variable names and ⊗ is a user-named
operator and the distinction between different kinds of user-named
tokens has to be known somehow in order to be able to tell that there
are three distinct tokens: αβ, ⊗, and γδ.

My question is:

When designing a syntax where tokens with the user-chosen characters
can't occur next to each other without some syntax-reserved characters
between them, what advantages are there from limiting the user-chosen
characters according to UAX #31 as opposed to treating any character
that is not a syntax-reserved character as a character that can occur
in user-named tokens?

I understand that taking the latter approach allows users to mint
tokens that on some aesthetic measure don't make sense (e.g. minting
tokens that consist of glyphless code points), but why is it important
to prescribe that this is prohibited as opposed to just letting users
choose not to mint tokens that are inconvenient for them to work with
given the behavior that their plain text editor gives to various
characters? That is, why is conforming to UAX #31 worth the risk of
prohibiting the use of characters that some users might want to use?
The introduction of XID after ID and the introduction of Extended
Hashtag Identifiers after XID is indicative of over-restriction having
been a problem.

Limiting user-minted tokens to UAX #31 does not appear to be necessary
for security purposes considering that HTML and CSS exist in a
particularly adversarial environment and get away with taking the
approach that any character that isn't a syntax-reserved character is
collected as part of a user-minted identifier. (Informally, both treat
non-ASCII characters the same as an ASCII underscore. HTML even treats
non-whitespace, non-U+0000 ASCII controls that way.)

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/