Avoiding Source Code Spoofing

Eli Zaretskii eliz at gnu.org
Thu Mar 3 02:49:08 CST 2022


> Date: Wed, 2 Mar 2022 14:51:44 -0800
> From: announcements via announcements <announcements at corp.unicode.org>
> Cc: announcements <announcements at corp.unicode.org>
> 
> Unicode has convened a group of experts in programming languages,
> tooling, and security to provide guidance and recommendations on how
> to better handle international text in source code, as well as
> providing code to help implementations.

There was no address or place, neither in this announcement nor in the
report to which it pointed, regarding where to send any comments on
the issues raised by them, so I'm posting them here; apologies if that
is inappropriate.

First, I think the report fails to distinguish between legitimate use
of RTL characters and controls, just because the program code has
strings and/or comments with RTL characters; and the malicious use,
where the intent is to spoof and mislead the recipients of the code.
Such a distinction is important, because use of bidi controls that is
legitimate in the former case is highly suspicious in the latter.  For
example, any source code where the inherent directionality of a strong
directional character was overridden, or where a weak/neutral
character has an embedding level that's too high, should be suspected
as potentially malicious.

Second, I don't see in the Proposed Plan any activity to collect input
from users and implementors of compilers, linters, and editors.
Without collecting such input, I see no way that the work group will
appreciate the real-life problems and issues that the developers and
users of these tools are facing, and that could easily lead to
recommendations that are hard or impossible to implement at least in
some of these tools, and/or which could be disconnected from the real
problems and practices.  For example, the idea of rendering bidi
formatting control as "chits" will not solve the reordering issue in
Emacs, where bidi reordering is performed _before_ the actual glyphs
to present characters on the glass are fully known.  More generally,
editors differ significantly in how they implement various features
that support editing of program source, such as syntax highlighting
and on-the-fly analysis of the source tokens; the recommendations must
take these into considerations to be useful.

Finally, I'm sorry to say, but the report is strongly biased in that
it focuses almost entirely on the issues caused by visual reordering
of bidirectional text and the bidi formatting controls in particular.
While it does mention other issues that yield confusing program code,
those few references read more as a lip service than anything else.
OTOH, there's no real attempt to describe the legitimate needs of
program source code intended for RTL languages and scripts, and
without such description, with only the problematic (let alone
malicious) use of bidi characters discussed in this and many
referenced documents, which is exacerbated by the fact that many
people don't really understand the UBA and the needs of RTL scripts,
this and the future documents could lead to lopsided conclusions, like
"let's disallow those problematic characters from program source
code".  This isn't just theory: some compilers, evidently alarmed by
the brouhaha around these issues, actually went ahead and started
flagging the use of some of these characters in program source code as
errors!  While such ridiculous (IMO) "solutions" in this or that tool
could be dismissed as folly on the part of their developers, a
document written and sanctioned by the Unicode Consortium which leads
to similar conclusions would be a disastrous development, which will
significantly hamper development of bidi-aware program development
tools and disadvantage their users who work in RTL language
environment.  I hope this is not how this (very important, IMO)
initiative will end.


More information about the Unicode mailing list