Directionality controls for malicious code

Eli Zaretskii eliz at gnu.org
Thu Dec 2 02:06:25 CST 2021


> Date: Thu, 2 Dec 2021 00:27:06 -0500
> From: Sławomir Osipiuk via Unicode <unicode at corp.unicode.org>
> 
> The burden of guarding against BiDi misuse should be on the programming languages and/or their compilers. I'm not sure why this hasn't been widely implemented yet. At minimum any BiDi controls within a source file should emit a warning during compilation, with compiler options available to error on any mixture of LTR and RTL text, or to whitelist specific files which are known to contain such a mixture with a valid cause, etc.

Such warnings should not be blindly emitted for bidi controls within
comments and strings, since that is human-readable text, where those
controls are completely legitimate.  At least the naïve warning for
any occurrence of these controls should avoided in those cases,
because it is likely to be a false positive, especially when a program
is intended to use RTL scripts.  There are many projects that require
to compile without any warnings, or treat warnings as errors, and
those won't compile with such "draconian" compilers.

Smart discovery of questionable usage of directional controls is
possible, and such warnings, even in comments and strings, should
employ that.  But it is harder to implement, and requires some minimal
understanding of UAX#9 and its use of explicit directional controls.

> There is nothing that can be done at the Unicode level to cater to coding languages that the coding languages can't do themselves via their own specifications and tools. Indeed it is far more appropriate for BiDi warnings and prohibitions to be tailored to the syntax of each language. (E.g. it may be generally "okay" for a line containing only a comment to mix directionality, but not for a line containing both code and comment).

Yes.  But there's more to it than just syntax.  For example,
directional controls that push weak or neutral characters one
embedding level could be okay inside a comment, but if the embedding
level is pushed to higher levels, that is suspicious.  The problem is
that compilers which do implement such warnings generally emit them
whenever they see such codepoints, disregarding the context and
bidi-specific knowledge (because it's much easier), and the result is
completely unacceptable for programs that need to communicate in RTL
scripts.


More information about the Unicode mailing list