Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Alastair Houghton via Unicode
unicode at unicode.org
Wed Jun 6 04:29:53 CDT 2018
On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
> The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC.
> Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31?
> (In general, are there other problems folks see with this proposal?)
IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of “everyone”, at any rate… most computer users aren’t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this.
Note that this is orthogonal to issues such as which language identifiers or comments are written in (indeed, there’s no problem with comments written in any script you please); the problem is that e.g. given a function
func الطول(s : String)
it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to call it. This isn’t true of e.g.
func pituus(s : String)
Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to type that.
Copy and paste is not always a good solution here, I might add; in bidi text in particular, copy and paste can have confusing results (and results that vary depending on the editor being used). There is also the issue of additional confusions that might be introduced; even if you stick to Latin scripts, this could be an problem sometimes (e.g. at small sizes, it’s hard to distinguish ă and ǎ or ȩ and ę), and of course there are Cyrillic and Greek characters that are indistinguishable from their Latin counterparts in most fonts. UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامهای; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker.
You could avoid *some* of these issues by restricting the allowable scripts somehow (e.g. requiring that an identifier that had Latin characters could not also contain Cyrillic and so on) or perhaps by establishing additional canonical equivalences between similar looking characters (so that e.g. while a and а - or, more radically, ă and ǎ - might be different characters, you might nevertheless regard them as the same for symbol lookup). It might be worth looking at UTR #36 and maybe UTR #39, not so much from a security standpoint, but more because those documents already have to deal with the problem of confusables.
You could also recommend that people stick to ASCII unless there’s a good reason to do otherwise (and note that using non-ASCII characters might impact on their ability to collaborate with teams in other countries).
None of this is necessarily a reason *not* to support non-ASCII identifiers, but it *is* something to be cautious about. Right now, most programming languages operate as a lingua franca, with code written by a wide range of people, not all of whom speak English, but all of whom can collaborate together to a greater or lesser degree by virtue of the fact that they all understand and can write code. Going down this particular rabbit hole risks changing that, and not for the better, and IMO it’s important to understand that when considering whether the trade-off of being able to use non-ASCII characters in identifiers is genuinely worth it.
More information about the Unicode