Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Mark Davis ☕️ via Unicode unicode at unicode.org
Wed Jun 13 06:49:56 CDT 2018


> That is, why is conforming to UAX #31 worth the risk of prohibiting the
use of characters that some users might want to use?

One could parse for certain sequences, putting characters into a number of
broad categories. Very approximately:

   - junk ~= [[:cn:][:cs:][:co:]]+
   - whitespace ~= [[:z:][:c:]-junk]+
   - syntax ~= [[:s:][:p:]] // broadly speaking, including both the
   language syntax & user-named operators
   - identifiers ~= [all-else]+

UAX #31 specifies several different kinds of identifiers, and takes roughly
that approach for
http://unicode.org/reports/tr31/#Immutable_Identifier_Syntax, although the
focus there is on immutability.

So an implementation could choose to follow that course, rather than the
more narrowly defined identifiers in
http://unicode.org/reports/tr31/#Default_Identifier_Syntax. Alternatively,
one can conform to the Default Identifiers but declare a profile that
expands the allowable characters. One could take a Swiftian approach
<http://www.globalnerdy.com/2014/06/03/swift-fun-fact-1-you-can-use-emoji-characters-in-variable-constant-function-and-class-names/>,
for example...

Mark

On Fri, Jun 8, 2018 at 11:07 AM, Henri Sivonen via Unicode <
unicode at unicode.org> wrote:

> On Wed, Jun 6, 2018 at 2:55 PM, Henri Sivonen <hsivonen at hsivonen.fi>
> wrote:
> > Considering that ruling out too much can be a problem later, but just
> > treating anything above ASCII as opaque hasn't caused trouble (that I
> > know of) for HTML other than compatibility issues with XML's stricter
> > stance, why should a programming language, if it opts to support
> > non-ASCII identifiers in an otherwise ASCII core syntax, implement the
> > complexity of UAX #31 instead of allowing everything above ASCII in
> > identifiers? In other words, what problem does making a programming
> > language conform to UAX #31 solve?
>
> After refreshing my memory of XML history, I realize that mentioning
> XML does not helpfully illustrate my question despite the mention of
> XML 1.0 5th ed. in UAX #31 itself. My apologies for that. Please
> ignore the XML part.
>
> Trying to rephrase my question more clearly:
>
> Let's assume that we are designing a computer-parseable syntax where
> tokens consisting of user-chosen characters can't occur next to each
> other and, instead, always have some syntax-reserved characters
> between them. That is, I'm talking about syntaxes that look like this
> (could be e.g. Java):
>
>     ab.cd();
>
> Here, ab and cd are tokens with user-chosen characters whereas space
> (the indent),  period, parenthesis and the semicolon are
> syntax-reserved. We know that ab and cd are distinct tokens, because
> there is a period between them, and we know the opening parethesis
> ends the cd token.
>
> To illustrate what I'm explicitly _not_ talking about, I'm not talking
> about a syntax like this:
>
> αβ⊗γδ
>
> Here αβ and γδ are user-named variable names and ⊗ is a user-named
> operator and the distinction between different kinds of user-named
> tokens has to be known somehow in order to be able to tell that there
> are three distinct tokens: αβ, ⊗, and γδ.
>
> My question is:
>
> When designing a syntax where tokens with the user-chosen characters
> can't occur next to each other without some syntax-reserved characters
> between them, what advantages are there from limiting the user-chosen
> characters according to UAX #31 as opposed to treating any character
> that is not a syntax-reserved character as a character that can occur
> in user-named tokens?
>
> I understand that taking the latter approach allows users to mint
> tokens that on some aesthetic measure don't make sense (e.g. minting
> tokens that consist of glyphless code points), but why is it important
> to prescribe that this is prohibited as opposed to just letting users
> choose not to mint tokens that are inconvenient for them to work with
> given the behavior that their plain text editor gives to various
> characters? That is, why is conforming to UAX #31 worth the risk of
> prohibiting the use of characters that some users might want to use?
> The introduction of XID after ID and the introduction of Extended
> Hashtag Identifiers after XID is indicative of over-restriction having
> been a problem.
>
> Limiting user-minted tokens to UAX #31 does not appear to be necessary
> for security purposes considering that HTML and CSS exist in a
> particularly adversarial environment and get away with taking the
> approach that any character that isn't a syntax-reserved character is
> collected as part of a user-minted identifier. (Informally, both treat
> non-ASCII characters the same as an ASCII underscore. HTML even treats
> non-whitespace, non-U+0000 ASCII controls that way.)
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180613/73987048/attachment.html>


More information about the Unicode mailing list