Can NFKC turn valid UAX 31 identifiers into non-identifiers?
Frédéric Grosshans via Unicode
unicode at unicode.org
Thu Jun 7 09:51:58 CDT 2018
Le 06/06/2018 à 11:29, Alastair Houghton via Unicode a écrit :
> On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
>> The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC.
>> Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31?
>> (In general, are there other problems folks see with this proposal?)
> IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community. Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of “everyone”, at any rate… most computer users aren’t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this.
Well, your ”reasonable” value of everyone exclude many kids, and puts
social barriers in the use of computer to non-native latin writers. If
the programme has no reason to be read and written by foreign
programmers, why not use native language and alphabet identifiers? Of
course, as long as you write a function named الطول, you consciously
restrict the developer community having access to this programme. But
you also make your programme more clear to your arabic speaking
community. If said community is e.g. school teachers (or students) in an
arab speaking country, it may be a good choice. I don’t see the
difference with choosing to write a book in a language or another.
> Note that this is orthogonal to issues such as which language identifiers [...] are written in [...];
It is indeed different, but not orthogonal
> the problem is that e.g. given a function
> func الطول(s : String)
> it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to call it.
OK. Clearly, someone not knowing the Arabic alphabet will have
difficulties with this one, but if one has good reason to think the
targeted developper community is literate in Arabic and a lower mastery
of the latin alphabet, it still may be a good idea.
If I understand you correctly, an Arabic speaker should always
transliterate the function name to ASCII, and there are many different
way to do it (see e.g.
https://en.wikipedia.org/wiki/Romanization_of_Arabic). Should they name
his function altawil, altwl, alt.wl ? And when calling it later, they
should remember their ad-hoc ASCII Arabic orthography. I don’t soubt
many, if not most, do it, but it can add an extra burden in programming.
It’s a bit like remembering if your name should be transliterated in
Greek as Ηουγητον or Ουχτων, and use that for every identifier you come
across. A mitigation strategy is to name your identifier x1, x2, x3 and
so on. The common knowledge is that this is a bad idea, and programming
teachers spend some time discouraging their student to use such a
strategy. However, many Chinese website and email addresses are of this
form, because it is the only one clear enough for a big fraction of the
> This isn’t true of e.g.
> func pituus(s : String)
> Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to type that.
Avoiding “special characters” can be annoying in Latin based language,
specially for beginners, and kids among them. Unicode (too slow)
adoption has already eased the difficulty of writing a “Hello world”
and “What‘s your name programme”, but avoiding non-ASCII characters in
identifiers can be a bit esoteric for kids with a native language full
of them. (And by the way, several big French companies regularly send me
mail with my first name mojibakeed, while their software is presumably
written by adults)
> UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامهای; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker.
In ASCII, identifiers with I, l, and 1 can be difficult to tell apart.
And it is not an artificial problem: I’ve once had some difficulties
with an automatically generated login which was do11y but tried to type
dolly, despites my familiarity with ASCII. So I guess this problem is
not specific to the ASCII vs non-ASCII debate
More information about the Unicode