Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Henri Sivonen via Unicode
unicode at unicode.org
Thu Nov 22 04:12:16 CST 2018
On Wed, Jun 13, 2018 at 2:49 PM Mark Davis ☕️ <span> wrote:
> > That is, why is conforming to UAX #31 worth the risk of prohibiting the use of characters that some users might want to use?
> One could parse for certain sequences, putting characters into a number of broad categories. Very approximately:
> junk ~= [[:cn:][:cs:][:co:]]+
> whitespace ~= [[:z:][:c:]-junk]+
> syntax ~= [[:s:][:p:]] // broadly speaking, including both the language syntax & user-named operators
> identifiers ~= [all-else]+
> UAX #31 specifies several different kinds of identifiers, and takes roughly that approach for http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the focus there is on immutability.
> So an implementation could choose to follow that course, rather than the more narrowly defined identifiers in http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively, one can conform to the Default Identifiers but declare a profile that expands the allowable characters. One could take a Swiftian approach, for example...
Thank you and sorry about my slow reply. Why is excluding junk important?
> On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <span> wrote:
>> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <span> wrote:
>> > Considering that ruling out too much can be a problem later, but just
>> > treating anything above ASCII as opaque hasn't caused trouble (that I
>> > know of) for HTML other than compatibility issues with XML's stricter
>> > stance, why should a programming language, if it opts to support
>> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
>> > complexity of UAX #31 instead of allowing everything above ASCII in
>> > identifiers? In other words, what problem does making a programming
>> > language conform to UAX #31 solve?
>> After refreshing my memory of XML history, I realize that mentioning
>> XML does not helpfully illustrate my question despite the mention of
>> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
>> ignore the XML part.
>> Trying to rephrase my question more clearly:
>> Let's assume that we are designing a computer-parseable syntax where
>> tokens consisting of user-chosen characters can't occur next to each
>> other and, instead, always have some syntax-reserved characters
>> between them. That is, I'm talking about syntaxes that look like this
>> (could be e.g. Java):
>> Here, ab and cd are tokens with user-chosen characters whereas space
>> (the indent), period, parenthesis and the semicolon are
>> syntax-reserved. We know that ab and cd are distinct tokens, because
>> there is a period between them, and we know the opening parethesis
>> ends the cd token.
>> To illustrate what I'm explicitly _not_ talking about, I'm not talking
>> about a syntax like this:
>> Here αβ and γδ are user-named variable names and ⊗ is a user-named
>> operator and the distinction between different kinds of user-named
>> tokens has to be known somehow in order to be able to tell that there
>> are three distinct tokens: αβ, ⊗, and γδ.
>> My question is:
>> When designing a syntax where tokens with the user-chosen characters
>> can't occur next to each other without some syntax-reserved characters
>> between them, what advantages are there from limiting the user-chosen
>> characters according to UAX #31 as opposed to treating any character
>> that is not a syntax-reserved character as a character that can occur
>> in user-named tokens?
>> I understand that taking the latter approach allows users to mint
>> tokens that on some aesthetic measure don't make sense (e.g. minting
>> tokens that consist of glyphless code points), but why is it important
>> to prescribe that this is prohibited as opposed to just letting users
>> choose not to mint tokens that are inconvenient for them to work with
>> given the behavior that their plain text editor gives to various
>> characters? That is, why is conforming to UAX #31 worth the risk of
>> prohibiting the use of characters that some users might want to use?
>> The introduction of XID after ID and the introduction of Extended
>> Hashtag Identifiers after XID is indicative of over-restriction having
>> been a problem.
>> Limiting user-minted tokens to UAX #31 does not appear to be necessary
>> for security purposes considering that HTML and CSS exist in a
>> particularly adversarial environment and get away with taking the
>> approach that any character that isn't a syntax-reserved character is
>> collected as part of a user-minted identifier. (Informally, both treat
>> non-ASCII characters the same as an ASCII underscore. HTML even treats
>> non-whitespace, non-U+0000 ASCII controls that way.)
>> Henri Sivonen
>> hsivonen at hsivonen.fi
hsivonen at hsivonen.fi
More information about the Unicode