Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Mon Dec 10 04:06:19 CST 2018

We're about to remove the U+FFFD generation for the case where there
is no content between two ISO-2022-JP escape sequences from the WHATWG
Encoding Standard.

Is there anything wrong with my analysis that U+FFFD generation in
that case is not a useful security measure when unnecessary
transitions between the ASCII and Roman states do not generate U+FFFD?

On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>
> Context: https://github.com/whatwg/encoding/issues/115
>
> Unicode Security Considerations say:
> "3.6.2 Some Output For All Input
>
> Character encoding conversion must also not simply skip an illegal
> input byte sequence. Instead, it must stop with an error or substitute
> a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
> or an escape sequence in the output. (See also Section 3.5 Deletion of
> Code Points.) It is important to do this not only for byte sequences
> that encode characters, but also for unrecognized or "empty"
> state-change sequences. For example:
> [...]
> ISO-2022 shift sequences without text characters before the next shift
> sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
> require at least one character in a text segment between shift
> sequences. Security software written to the formal specification may
> not detect malicious text  (for example, "delete" with a
> shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
> (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>
> The WHATWG Encoding Standard bakes this requirement by the means of
> "ISO-2022-JP output flag"
> (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
> ISO-2022-JP decoder algorithm
> (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>
> encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
> WHATWG spec.
>
> After Gecko switched to encoding_rs from an implementation that didn't
> implement this U+FFFD generation behavior (uconv), a bug has been
> logged in the context of decoding Japanese email in Thunderbird:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>
> Ken Lunde also recalls seeing such email:
> https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>
> The root problem seems to be that the requirement gives ISO-2022-JP
> the unusual and surprising property that concatenating two ISO-2022-JP
> outputs from a conforming encoder can result in a byte sequence that
> is non-conforming as input to a ISO-2022-JP decoder.
>
> Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
> sequence is immediately followed by another ISO-2022-JP escape
> sequence. Chrome and Safari do, but their implementations of
> ISO-2022-JP aren't independent of each other. Moreover, Chrome's
> decoder implementations generally are informed by the Encoding
> Standard (though the ISO-2022-JP decoder specifically might not be
> yet), and I suspect that Safari's implementation (ICU) is either
> informed by Unicode Security Considerations or vice versa.
>
> The example given as rationale in Unicode Security Considerations,
> obfuscating the ASCII string "delete", could be accomplished by
> alternating between the ASCII and Roman states to that every other
> character is in the ASCII state and the rest of the Roman state.
>
> Is the requirement to generate U+FFFD when there is no content between
> ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
> transitions or useless transitions between ASCII and Roman are not
> also required to generate U+FFFD? Would it even be feasible (in terms
> of interop with legacy encoders) to make useless transitions between
> ASCII and Roman generate U+FFFD?
>
> --
> Henri Sivonen
> hsivonen at hsivonen.fi
> https://hsivonen.fi/

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/