Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Mark Davis ☕️ via Unicode unicode at
Wed Jun 13 06:49:56 CDT 2018

> That is, why is conforming to UAX #31 worth the risk of prohibiting the
use of characters that some users might want to use?

One could parse for certain sequences, putting characters into a number of
broad categories. Very approximately:

   - junk ~= [[:cn:][:cs:][:co:]]+
   - whitespace ~= [[:z:][:c:]-junk]+
   - syntax ~= [[:s:][:p:]] // broadly speaking, including both the
   language syntax & user-named operators
   - identifiers ~= [all-else]+

UAX #31 specifies several different kinds of identifiers, and takes roughly
that approach for, although the
focus there is on immutability.

So an implementation could choose to follow that course, rather than the
more narrowly defined identifiers in Alternatively,
one can conform to the Default Identifiers but declare a profile that
expands the allowable characters. One could take a Swiftian approach
for example...


On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <
unicode at> wrote:

> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <hsivonen at>
> wrote:
> > Considering that ruling out too much can be a problem later, but just
> > treating anything above ASCII as opaque hasn't caused trouble (that I
> > know of) for HTML other than compatibility issues with XML's stricter
> > stance, why should a programming language, if it opts to support
> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> > complexity of UAX #31 instead of allowing everything above ASCII in
> > identifiers? In other words, what problem does making a programming
> > language conform to UAX #31 solve?
> After refreshing my memory of XML history, I realize that mentioning
> XML does not helpfully illustrate my question despite the mention of
> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
> ignore the XML part.
> Trying to rephrase my question more clearly:
> Let's assume that we are designing a computer-parseable syntax where
> tokens consisting of user-chosen characters can't occur next to each
> other and, instead, always have some syntax-reserved characters
> between them. That is, I'm talking about syntaxes that look like this
> (could be e.g. Java):
> Here, ab and cd are tokens with user-chosen characters whereas space
> (the indent),  period, parenthesis and the semicolon are
> syntax-reserved. We know that ab and cd are distinct tokens, because
> there is a period between them, and we know the opening parethesis
> ends the cd token.
> To illustrate what I'm explicitly _not_ talking about, I'm not talking
> about a syntax like this:
> αβ⊗γδ
> Here αβ and γδ are user-named variable names and ⊗ is a user-named
> operator and the distinction between different kinds of user-named
> tokens has to be known somehow in order to be able to tell that there
> are three distinct tokens: αβ, ⊗, and γδ.
> My question is:
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?
> I understand that taking the latter approach allows users to mint
> tokens that on some aesthetic measure don't make sense (e.g. minting
> tokens that consist of glyphless code points), but why is it important
> to prescribe that this is prohibited as opposed to just letting users
> choose not to mint tokens that are inconvenient for them to work with
> given the behavior that their plain text editor gives to various
> characters? That is, why is conforming to UAX #31 worth the risk of
> prohibiting the use of characters that some users might want to use?
> The introduction of XID after ID and the introduction of Extended
> Hashtag Identifiers after XID is indicative of over-restriction having
> been a problem.
> Limiting user-minted tokens to UAX #31 does not appear to be necessary
> for security purposes considering that HTML and CSS exist in a
> particularly adversarial environment and get away with taking the
> approach that any character that isn't a syntax-reserved character is
> collected as part of a user-minted identifier. (Informally, both treat
> non-ASCII characters the same as an ASCII underscore. HTML even treats
> non-whitespace, non-U+0000 ASCII controls that way.)
> --
> Henri Sivonen
> hsivonen at
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list