Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Thu Nov 22 05:08:30 CST 2018

Context: https://github.com/whatwg/encoding/issues/115

Unicode Security Considerations say:
"3.6.2 Some Output For All Input

Character encoding conversion must also not simply skip an illegal
input byte sequence. Instead, it must stop with an error or substitute
a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
or an escape sequence in the output. (See also Section 3.5 Deletion of
Code Points.) It is important to do this not only for byte sequences
that encode characters, but also for unrecognized or "empty"
state-change sequences. For example:
[...]
ISO-2022 shift sequences without text characters before the next shift
sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
require at least one character in a text segment between shift
sequences. Security software written to the formal specification may
not detect malicious text  (for example, "delete" with a
shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
(https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)

The WHATWG Encoding Standard bakes this requirement by the means of
"ISO-2022-JP output flag"
(https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
ISO-2022-JP decoder algorithm
(https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).

encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
WHATWG spec.

After Gecko switched to encoding_rs from an implementation that didn't
implement this U+FFFD generation behavior (uconv), a bug has been
logged in the context of decoding Japanese email in Thunderbird:
https://bugzilla.mozilla.org/show_bug.cgi?id=1508136

Ken Lunde also recalls seeing such email:
https://github.com/whatwg/encoding/issues/115#issuecomment-440661403

The root problem seems to be that the requirement gives ISO-2022-JP
the unusual and surprising property that concatenating two ISO-2022-JP
outputs from a conforming encoder can result in a byte sequence that
is non-conforming as input to a ISO-2022-JP decoder.

Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
sequence is immediately followed by another ISO-2022-JP escape
sequence. Chrome and Safari do, but their implementations of
ISO-2022-JP aren't independent of each other. Moreover, Chrome's
decoder implementations generally are informed by the Encoding
Standard (though the ISO-2022-JP decoder specifically might not be
yet), and I suspect that Safari's implementation (ICU) is either
informed by Unicode Security Considerations or vice versa.

The example given as rationale in Unicode Security Considerations,
obfuscating the ASCII string "delete", could be accomplished by
alternating between the ASCII and Roman states to that every other
character is in the ASCII state and the rest of the Roman state.

Is the requirement to generate U+FFFD when there is no content between
ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
transitions or useless transitions between ASCII and Roman are not
also required to generate U+FFFD? Would it even be feasible (in terms
of interop with legacy encoders) to make useless transitions between
ASCII and Roman generate U+FFFD?

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/