AW: Security concerns: OGHAM SPACE MARK

Tue Jul 21 16:55:05 CDT 2015

On Tue, Jul 21, 2015 at 12:46 David Starner [mailto:prosfilaes at gmail.com] wrote:

On Tue, Jul 21, 2015 at 2:14 AM Dreiheller, Albrecht <albrecht.dreiheller at siemens.com> wrote:
If the author really intends to deceive potential readers he will succeed.
Possibly. Code is hard. But the Ogham space is not a real threat; it's easy to search for and obviously a deliberate attempt to confuse.

My concern is not about the Ogham space, but about the free usage of non-Ascii in programming languages in general.
Just imagine, when you decide to open a door for public traffic in busy city with a security check point, you wouldn't  consider only
how to check a single person; instead, you have to consider how you would check thousands of people within one hour, if you don’t plan to
close the door again.
Therefore, consider a huge software system written developed in, let's say, Serbia or Russia using Cyrillic names throughout for classes and variables.
int ци́фра = чита́ть(пе́речень);  return ци́фра;
It might be  a valuable system with some unique features and you want to evaluate the source code before you buy it.
Or the community want's to adopt it for Open Source because it has some nice features.
Looking for a deliberate attempt to confuse within this code  would be like looking for a needle in a haystack, since every line has non-Ascii in it.
 Programming languages like JS should at least implement exclusion rules from the "Unicode Confusables Characters" list.
Have you looked at that list? 1 and l is one pair of confusables in that list, and while that is an incredibly classic confusable pair,
it's not one that's implementable in a programming language. а and a is another pair; but if you ban а, you've practically banned Cyrillic identifiers completely.
Of course, there are confusables within the Ascii range, but they are well-known for years, and thus more likely to be detected.
Regarding your other example, some compilers warn if you have an assignment within an if-clause.
I used a term "exclusion rules", meaning a ruleset bases on the confusables list.
For example  the following code sequence
           int a;  {  int а;  a = 5;  }      (N.B. the second "а"  is Cyrillic)
could be banned by a rule saying
"It's not allowed to declare a variable that is DISTINCT from others (thus not hiding them) but which is CONFUSABLY SIMILAR  to another variable in the same scope."
Another rule could demand "It's not allowed to mix two alphabets within one name".
This would not ban Cyrillic identifiers in general.
Otherwise such programming languages ought to be black-listed.
Black-listed? By whom? If you wish to make sure a set of code you control does not use non-ASCII characters, most source-control systems.will let you reject such files from being checked in. If you want to reject JavaScript altogether, that is also your freedom. But of all the attacks weighed against JavaScript, I seriously doubt that this is the one that will bring it down.
With "black-listed" I meant "known to be unsafe" in some way.
Just the same way as domain-registration authorities  would be  "known to be unsafe"   if they  accept or allow domain names
like    mybаnk.com   beside   mybank.com  where one has a Latin "a" and the other has a Cyrillic  "а"  in it,  thus ignoring the confusables list.
BTW,  I don't want to attack JavaScript.  It's pretty.

The fathers of ALGOL  and other early languages racked their brain to avoid ambigous semantics caused by poor syntax rules.
Today when Unicode supersedes Ascii in some contexts the challenges are different, but not less important.

Albrecht.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150721/bcc562f3/attachment.html>