Philippe Verdy verdy_p at wanadoo.fr
Thu Jun 5 18:55:57 CDT 2014

IMHO, a programming language that accepts non-ASCII identifiers should
always nrmalize the identifiers it accepts, before heeding it in its hashed
symbol table.

And for this type of usage, we strongly need that normalization is stable,
but much more than with existing stability rules: the normalization
stability is not warrantied if the language can accept unassigned code
points that may be allocated later and will normalize differently (the
normalization of unassigned code points just assumes a default combining
class 0 where reordering and recombining cannot occur, but once code points
pass from unassigned to assigned, this may no longer be true.

For this reason, a reasonable programming language should restrict itself
to only characters of a defined Unicode version and should notaccept
unassigned characters in that version.

Alternatively compiled programs should track the Unicode version version to
make sure that later reusers of compiled programs will link properly to the
older compiled programs, by making sure that newer idenfiers used in newer
programs cannot never match an identifier defined by the older compield
program assuming a different normalization.

Programming languages should follow the practices used in IDNA for security
reasons. Then, extending the allowed subset should be done with care: this
extension will be compatible *only* if the newly assigned characters added
to the extended subset have combining class 0 and are not listed in
restricted recompositions. Otherwise, all other added characters in the
extension should not be compatible with older versions of the language (if
the language cannot check Uncidoe version or does not want to be
incompatible with past versions, they will not be allowed to extend safely
their allowed subset for identifiers, and notably not any combining
characers with non-zero combining class).

2014-06-05 19:24 GMT+02:00 Jeff Senn <senn at maya.com>:

> On Jun 5, 2014, at 12:41 PM, Hans Aberg <haberg-1 at telia.com> wrote:
> > On 5 Jun 2014, at 17:46, Jeff Senn <senn at maya.com> wrote:
> >
> >> That is: are identifiers merely sequences of characters or intended to
> be comparable as “Unicode strings” (under some sort of compatibility rule)?
> >
> > In computer languages, identifiers are normally compared only for
> equality, as it reduces lookup time complexity.
> Well in this case we are talking about parsing a source file and
> generating internal symbols, so the complexity of the comparison operation
> is a red herring.
> The real question is how does the source identifier get mapped into a
> (compiled) symbol.  (e.g. in C++ this is not an obvious operation)
> If your implication is that there should be no canonicalization (the
> string from the source is used as a sequence of characters only directly
> mapped to a symbol), then I predict sticky problems in the future.  The
> most obvious of which is that in some cases I will be able to change the
> semantics of the complied program by (accidentally) canonicalizing the
> source text (an operation, I will point out, that is invisible to the user
> in many (most?) Unicode aware editors).
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140606/01903528/attachment.html>

More information about the Unicode mailing list