Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Thu Jun 7 03:42:46 CDT 2018

> The proposal also asks for identifiers to be treated as equivalent under
NFKC.

The guidance in #31 may not be clear. It is not to replace identifiers as
typed in by the user by their NFKC equivalent. It is rather to internally
*identify* two identifiers (as typed in by the user) as being the same. For
example, Pascal had case-insensitive identifiers. That means someone could
type in

myIdentifier = 3;
MyIdentifier = 4;

And both of those would be references to the same internal entity. So cases
like SARA AM doesn't necessarily play into this.

> IMO the major issue with non-ASCII identifiers is not a technical one,
but rather that it runs the risk of fragmenting the developer community.

IMO, forcing everyone to stick to the limitations of ASCII for all
identifiers is unnecessary and often counterproductive.

First, programmers tend to think of "identifiers" as being specifically
"identifiers in programming languages" (and often "identifiers in
programming languages that I think are important". Identifiers may occur in
much broader contexts, often being much closer to end users (eg spreadsheet
formulae) or scripting languages, user identifiers, and so on.

Secondly, even with programming languages that are restricted to ASCII,
people can choose identifiers in code like the following, which would not
be obvious to many people.

var Stellenwert = Verteidigungsministerium_Konto.verarbeite(); // Asmus
könnte realistischere Beispiele vorschlagen

For a given project, and for programming languages (as opposed to more
user-facing languages) the language to be used for variables, functions,
comments, &c. will often be English, to allow for broader participation.
But that should be a choice of the people involved. There are clearly many
cases where that restriction is not optimal for a given project, where not
all of the developers (and prospective developers) are fluent in English,
but do share another common language. Think of all the in-house development
in countries and organizations around the world.

And finally, it's not like you hear of huge problems from Java or Swift or
other programming languages because they support non-ASCII identifiers.

Mark

On Thu, Jun 7, 2018 at 9:36 AM, Richard Wordingham via Unicode <
unicode at unicode.org> wrote:

> On Tue, 5 Jun 2018 01:37:47 +0100
> Richard Wordingham via Unicode <unicode at unicode.org> wrote:
>
> > The decomposed
> > form that looks the same is นํ้า <U+0E19, U+0E4D, U+0E49, U+0E32>.
> > The problem is that for sane results, <tone mark, SARA AM> needs
> > special handling. This sequence is also often untypable - part of the
> > protection against Thai homographs.
>
> I've been misquoted on the Rust discussion topic - or the behaviour is
> more diverse that I was aware of.  On LibreOffice, with sequence
> checking not disabled, typing <U+0E19, U+0E4D> disables the input by
> typing of U+0E49 or U+0E32 immediately afterwards.  Another mechanism
> is for typing another vowel to replace the U+0E4D.  The problem here is
> that in standard Thai, U+0E4D may not be followed by another vowel or
> tone mark, so Wing Thuk Thi (WTT) rules cut in.  (They're also quite
> good at preventing one from typing Northern Khmer.)  In LibreOffice,
> typing the NFKC form <U+0E19, U+0E49, U+0E4D, U+0E32> is stopped at
> attempting to type U+0E4D, though one can get back to the original by
> typing U+0E33 instead.  To the rule checker, that is mission
> accomplished!
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180607/114a34d0/attachment.html>