Directionality controls for malicious code

Thu Dec 2 12:43:12 CST 2021

> Date: Thu, 2 Dec 2021 12:10:40 -0500
> Cc: Unicode <unicode at corp.unicode.org>
> From: Sławomir Osipiuk via Unicode <unicode at corp.unicode.org>
> 
> > Blindly showing these controls wherever they are should not happen,
> > either, because most of their uses are not malicious.
> 
> Yes, it should. This is not general prose intended to look nice.

Comments and strings in a program _are_ general prose.

> The "users" in this case are assumed to be a (relatively) specialist
> technical audience.

The vast majority of those professionals have no idea about what the
UBA does, how the bidi control characters work, and what they are
for.  So them being specialists doesn't help in this matter.

> > There are many projects that require to compile without any warnings, or treat warnings as errors, and those won't compile with such "draconian" compilers.
> 
> Which is why I mentioned that whitelisting, or some method of
> suppressing the warnings, i.e. an "I know what I'm doing" option,
> should also be added.

Many projects frown on such measures, and some outright prohibit them.
Try telling your QA person that you want to suppress warnings because
they annoy you.

> But it should not be the default behavior.

If it is not the default, chances are it will seldom or never be
turned on.

> I think you're overestimating the amount of projects this would
> actually cause problems for.

That depends on the audience for which you are writing programs.  In
some locales around the world the number of projects for which this
could be a problem is very large.

> > It's even against UAX#9, which says those
> > controls should be invisible.
> 
> That rule should be ignored when it is counterproductive in a
> specialist context.

You are in a Unicode forum, and you are arguing for ignoring its
rules?

> > We need a formal criterion that allows to check that a given span of characters in logical order does not visually overflow those characters that preceed or succeed them.
> 
> Yes, this is ideal. The problem is that Unicode doesn't "understand"
> that string-terminating or comment-introducing characters in any given
> programming language should reset the directionality. That's why the
> solution must be at the same level that gives meaning to strings and
> comments (and variables, etc.) i.e. the programming language itself.

That's a worthy goal, but I think it isn't easy to achieve.  We could
instead employ a simpler, language-independent heuristics, based on
the bidi context of those control characters.  For example, if weak
characters of class EN or neutral characters of class ON have their
embedding level pushed too high (where "too high" depends on the base
paragraph direction), it becomes suspicious and can be flagged.

> Yes. It makes perfect sense for control characters to be permitted
> only as escape sequences.

That could be a solution for strings, but not for comments.  And even
in strings, using escapes makes the strings much harder to read and
proofread.