Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Thu Jun 7 09:51:58 CDT 2018

Le 06/06/2018 à 11:29, Alastair Houghton via Unicode a écrit :
> On 4 Jun 2018, at 20:49, Manish Goregaokar via Unicode <unicode at unicode.org> wrote:
>> The Rust community is considering adding non-ascii identifiers, which follow UAX #31 (XID_Start XID_Continue*, with tweaks). The proposal also asks for identifiers to be treated as equivalent under NFKC.
>>
>> Are there any cases where this will lead to inconsistencies? I.e. can the NFKC of a valid UAX 31 ident be invalid UAX 31?
>>
>> (In general, are there other problems folks see with this proposal?)
> IMO the major issue with non-ASCII identifiers is not a technical one, but rather that it runs the risk of fragmenting the developer community.  Everyone can *type* ASCII and everyone can read Latin characters (for reasonably wide values of “everyone”, at any rate… most computer users aren’t going to have a problem). Not everyone can type Hangul, Chinese or Arabic (for instance), and there is no good fix or workaround for this.
Well, your ”reasonable” value of everyone exclude many kids, and puts 
social barriers in the use of computer to non-native latin writers. If 
the programme has no reason to be read and written by foreign 
programmers, why not use native language and alphabet identifiers? Of 
course, as long as you write a function named الطول, you consciously 
restrict the developer community having access to this programme. But 
you also make your programme more clear to your arabic speaking 
community. If said community is e.g. school teachers (or students) in an 
arab speaking country, it may be a good choice. I don’t see the 
difference with choosing to write a book in a language or another.
> Note that this is orthogonal to issues such as which language identifiers [...] are written in [...];
It is indeed different, but not orthogonal

> the problem is that e.g. given a function
>
>    func الطول(s : String)
>
> it isn’t obvious to a non-Arabic speaking user how to enter الطول in order to call it.
OK. Clearly, someone not knowing the Arabic alphabet will have 
difficulties with this one, but if one has good reason to think the 
targeted developper community is literate in Arabic and a lower mastery 
of the latin alphabet, it still may be a good idea.
If I understand you correctly, an Arabic speaker should always 
transliterate the function name to ASCII, and there are many different 
way to do it  (see e.g. 
https://en.wikipedia.org/wiki/Romanization_of_Arabic). Should they name 
his function altawil, altwl, alt.wl ? And when calling it later, they 
should remember their ad-hoc ASCII Arabic orthography. I don’t soubt 
many, if not most, do it, but it can add an extra burden in programming. 
It’s a bit like remembering if your name should be transliterated in 
Greek as Ηουγητον or Ουχτων, and use that for every identifier you come 
across. A mitigation strategy is to name your identifier x1, x2, x3 and 
so on. The common knowledge is that this is a bad idea, and programming 
teachers spend some time discouraging their student to use such a 
strategy. However, many Chinese website and email addresses are of this 
form, because it is the only one clear enough for a big fraction of the 
population.

> This isn’t true of e.g.
>
>    func pituus(s : String)
>
> Even though “pituus” is Finnish, it’s still ASCII and everyone knows how to type that.

Avoiding “special characters” can be annoying in Latin based language, 
specially for beginners, and kids among them. Unicode (too slow) 
adoption has already eased the difficulty of writing a “Hello world” 
and  “What‘s your name programme”, but avoiding non-ASCII characters in 
identifiers can be a bit esoteric for kids with a native language full 
of them. (And by the way, several big French companies regularly send me 
mail with my first name mojibakeed, while their software is presumably 
written by adults)

[...]

>   UAX #31 also manages (I suspect unintentionally?) to give a good example of a pair of Farsi identifiers that might be awkward to tell apart in certain fonts, namely نامهای and نامه‌ای; I think those are OK in monospaced fonts, where the join is reasonably wide, but at small point sizes in proportional fonts the difference in appearance is very subtle, particularly for a non-Arabic speaker.
In ASCII, identifiers with I, l, and 1 can be difficult to tell apart. 
And it is not an artificial problem: I’ve once had some difficulties 
with an automatically generated login which was do11y but tried to type 
dolly, despites my familiarity with ASCII. So I guess this problem is 
not specific to the ASCII vs non-ASCII debate