Can NFKC turn valid UAX 31 identifiers into non-identifiers?

Philippe Verdy via Unicode unicode at
Wed Jun 6 06:19:31 CDT 2018

It could be argued that "modern" languages could use unique identifiers for
their syntax or API independantly of the name being rendered. The problem
is that translated names may collide in non-obvious way and become
We've already seen the problems it caused in Excel with its translated
function names in some spreadsheets (things being worse when the
spreadsheet itself does not contain a language identifier to indicate in
which these identifiers are defined, so English-only installations of Excel
(without the MUI/LUI installed) cannot open or process correctly the
spreadsheets created in other languages.

In practice, ASCII-only or ISO8859-1 only identifiers work realtively well,
but there's always a problem to enter these identifiers, a solution would
be to allow identifiers having an ASCII-only alias even if they are not so
friendly for the original authors. But I've not seen any programming
language or API allowing to define aliases for identifiers that have
exactly the same semantic as the few translated ones that non-English users
would prefer to see and use. In C/C++ you may have aliases but this
requires special support in the binary object or library format to allow
equivalent bindings and resolution.

For programming languages that are too near from the machine level
(assembly, C, C++), or for common libraries intended to be used worldwide,
in most cases these names are in English-only or use "augmented English"
with approximate transliterations when they use some borrowed words
(notably proper names), or invented words (company names, trademarks,
custom neologisms specific to an app or service, and a lot of acronyms).
These API or languages tend to create their own "jargon" with their own
definitions (which may be translated in their documentation).
Programmer comments however are very frequently written in any language or
script because they don't have to be restricted by uniqueness and name
resolution or binding mechanisms.
But newer scripting languages are now very liberal (notably
Javascript/ECMAscript) and are somewhat easy to rebind to other names to
generate an "equivalent" library, except if the library needs to work
through reflection mechanisms and introspection. scripting languages
designed to be used for user personalisation should however be user
friendly and only designed to work well with the language of the initial
author for his own usage (but cooperation will be limited on the Internet,
and if one wants to share his code, he will have to create some basic
translation or transliteration.

Most system-level APIs (filesystem or I/O, multiprocessing/multithreading,
networking) and data format options are specified using English terms only
(or near-English). The various IDE's however can make this language more
friendly by providing documentation searches, contextual helpers in the
editor itself, hinting popups, or various "machine learning" tools
(including "natural language" query wizards to help create and document the
technical language using the English-like jargon).

Most programming languages however do not define a lot of reserved keywords
(in English) and there's rarely the need to translate them (but I've seen
several programming languages also translating them in a few wellknown
languages), notably languages designed to be used by children or to learn
programming. Some of these languages do not use a plain-text syntax but use
graphic diagrams with symbols, arrows, boxes and programmers navigate in
the graphic layout or rearrange the layout to fit new items or
remove/combine them (then an "advanced" view can be used to present this
layout in plain-text using partly translated terms: this is easier if
there's a clear syntaxic separation of custom identifiers created by users
(not translated) and core keywords of the language (generally this
separation uses quotation marks around custom identifiers, but this is not
even needed everywhere for data-oriented syntaxes like JSON which does not
need any "reserved" identifier, but reserves only  some punctuations).

Anyway, all programming jobs require a basic proficiency to read/write
basic English correctly, and require acquiring a common English-like
technical jargon (that jargon does not have to be perfect English, it is
used as a de facto standard, which evolves too fast to be correctly
translated). This jargon is still NOT normal English and using it means
that documentation should still be adapted/translated to better English for
native English readers. If you look at some wellknown projects in China,
you'll see that many projects are documented and supported only in Chinese,
by programmers that have a very limtied knowledge of English (so their
usage of Engliush in the crearted technical jargon is liguistically
incorrect, but still correct for the technical needs (and to
translate/Adapt these programs to other languages, Chinese is the source of
all translations, and must be present in all translation files to map it to
English or any other language: most people don't know how to type it, what
they do is only to copy-paste the existing Chinese-to-English translation
files, then fix the English target, and then use that to create other
translations based on this English text; finally the resulting translation
is tested in the final target language and slightly modified to get a more
uniform or consistent terminology.

2018-06-06 11:49 GMT+02:00 Alastair Houghton via Unicode <
unicode at>:

> On 5 Jun 2018, at 07:09, Martin J. Dürst via Unicode <unicode at>
> wrote:
> >
> > Hello Rebecca,
> >
> > On 2018/06/05 12:43, Rebecca T via Unicode wrote:
> >
> >> Something I’d love to see is translated keywords; shouldn’t be hard
> with a
> >> line in the cargo.toml for a ruidmentary lookup. Again, I’m of the
> opinion
> >> that an imperfect implementation is better than no attempt. I remember
> >> reading an article about a professor who translated the keywords in...
> >> maybe it was Python? And found their students were much more engaged
> with
> >> the material. Anecdotal, of course, but it’s stuck with me.
> >
> > It would be good to have a reference for this. I can certainly see the
> point. But on the other hand, I have also heard that using keywords in a
> foreign language makes it clear that there may be a difference between the
> everyday use of the word and the specific formal meaning in the programming
> language. Then, there's also the problem that just translating keywords may
> work for languages with the same sentence structure, but not for languages
> with a completely different sentence structure. On top of that, keywords
> are just a start; class/function/method names in libraries would have to be
> translated, too, which would be much more work (especially if one wants to
> do a good job).
> ALGOL68 was apparently localised (the standard explicitly supported that;
> it wasn’t an extension but rather something explicitly encouraged).
> AppleScript was also designed to be (French and Japanese syntaxes were
> defined), and I have an inkling that someone once told me that at least one
> translation had actually shipped, though the translated variants are now
> deprecated as far as I’m aware.
> Translated keywords are in some ways better than allowing non-ASCII
> identifiers, because they’re typically amenable to machine translation
> (indeed, in AppleScript, the scripts are not usually saved in ASCII anyway,
> but IIRC as a set of Apple Event Descriptors, so the “language” is just a
> matter for rendering to the user), which means that they don’t suffer from
> the problem of community fragmentation that non-ASCII identifiers *could*
> cause.
> Kind regards,
> Alastair.
> --
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list