Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Alastair Houghton via Unicode unicode at
Thu Jun 7 11:01:08 CDT 2018

On 7 Jun 2018, at 15:51, Frédéric Grosshans via Unicode <unicode at> wrote:
>> IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community.  Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of “everyone”, at any rate… most computer users aren’t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this.
> Well, your ”reasonable” value of everyone exclude many kids,

Every keyboard I’ve ever seen, including Chinese ones, is marked with ASCII characters as well. Typing ASCII on a machine in the Chinese locale might not be entirely straightforward, but entering Chinese characters, even on such a machine, takes significant training, and on a machine not set to Chinese locale it might even require the installation of additional software. It isn’t even the case, as I understand it, that all machines set to Chinese locales use the same input method, so being able to enter Chinese on one system doesn’t necessarily mean you’ll be able to do so on another. (I imagine it makes it easier to learn, once you’ve done it once, but still…)

I appreciate that the upshot of the Anglicised world of software engineering is that native English speakers have an advantage, and those for whom Latin isn’t their usual script are at a particular disadvantage, and I’m sure that seems unfair to many of us — but that doesn’t mean that allowing the use of other scripts everywhere, desirable as it is, is entirely unproblematic.

>> it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to call it.
> OK. Clearly, someone not knowing the Arabic alphabet will have difficulties with this one, but if one has good reason to think the targeted developper community is literate in Arabic and a lower mastery of the latin alphabet, it still may be a good idea.
> If I understand you correctly, an Arabic speaker should always transliterate the function name to ASCII,

That’s one option; or they could write it in Arabic, but they need to be aware of the consequences of doing so (and those they are working for or with also need to understand that); or they could choose some other language, perhaps one shared with other teams who are likely to work on the code. Imagine you outsourced development to a team that happened to be Arabic speaking, and they developed (let’s say) French language software for you, but later you wanted to bring development in house and found all the identifiers were in Arabic script, which made the code very difficult for your developers to work with. That isn’t exactly going to make your day, and if it isn’t a problem that anyone has mentioned, it might not be obvious that you when you originally outsourced your development that you needed to make sure people weren't going to do that.

>>  UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker.
> In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. And it is not an artificial problem: I’ve once had some difficulties with an automatically generated login which was do11y but tried to type dolly, despites my familiarity with ASCII. So I guess this problem is not specific to the ASCII vs non-ASCII debate

It isn’t, though fonts used by programmers typically emphasise the differences between I, l and 1 as well as 0 and O, 5 and S and so on specifically to avoid this problem.

But please don’t misunderstand; I am not — and have not been — arguing against non-ASCII identifiers. We were asked whether there were any problems. These are problems (or perhaps we might call them “trade-offs”). We can debate the severity of them, and whether, and what, it’s worthwhile doing anything to mitigate any of them. What we shouldn’t do is sweep them under the carpet.

Personally I think a combination of documentation to explain that it’s worth thinking carefully about which script(s) to use, and some steps to consider certain characters to be equivalent even though they aren’t the same (and shouldn’t be the same even when normalised) might be a good idea. Is that really so controversial a position?

Kind regards,



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list