Generating U+FFFD when there's no content between ISO-2022-JP escape sequences
hsivonen at hsivonen.fi
Mon Aug 17 01:38:54 CDT 2020
Sorry about the delay. There is now
On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ <mark at macchiato.com> wrote:
> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
> Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/?
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>> We're about to remove the U+FFFD generation for the case where there
>> is no content between two ISO-2022-JP escape sequences from the WHATWG
>> Encoding Standard.
>> Is there anything wrong with my analysis that U+FFFD generation in
>> that case is not a useful security measure when unnecessary
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>> > Context: https://github.com/whatwg/encoding/issues/115
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> > Character encoding conversion must also not simply skip an illegal
>> > input byte sequence. Instead, it must stop with an error or substitute
>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)
>> > or an escape sequence in the output. (See also Section 3.5 Deletion of
>> > Code Points.) It is important to do this not only for byte sequences
>> > that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next shift
>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants
>> > require at least one character in a text segment between shift
>> > sequences. Security software written to the formal specification may
>> > not detect malicious text (for example, "delete" with a
>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> > The WHATWG Encoding Standard bakes this requirement by the means of
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into its
>> > ISO-2022-JP decoder algorithm
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements the
>> > WHATWG spec.
>> > After Gecko switched to encoding_rs from an implementation that didn't
>> > implement this U+FFFD generation behavior (uconv), a bug has been
>> > logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-440661403
>> > The root problem seems to be that the requirement gives ISO-2022-JP
>> > the unusual and surprising property that concatenating two ISO-2022-JP
>> > outputs from a conforming encoder can result in a byte sequence that
>> > is non-conforming as input to a ISO-2022-JP decoder.
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape
>> > sequence is immediately followed by another ISO-2022-JP escape
>> > sequence. Chrome and Safari do, but their implementations of
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's
>> > decoder implementations generally are informed by the Encoding
>> > Standard (though the ISO-2022-JP decoder specifically might not be
>> > yet), and I suspect that Safari's implementation (ICU) is either
>> > informed by Unicode Security Considerations or vice versa.
>> > The example given as rationale in Unicode Security Considerations,
>> > obfuscating the ASCII string "delete", could be accomplished by
>> > alternating between the ASCII and Roman states to that every other
>> > character is in the ASCII state and the rest of the Roman state.
>> > Is the requirement to generate U+FFFD when there is no content between
>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII
>> > transitions or useless transitions between ASCII and Roman are not
>> > also required to generate U+FFFD? Would it even be feasible (in terms
>> > of interop with legacy encoders) to make useless transitions between
>> > ASCII and Roman generate U+FFFD?
>> > --
>> > Henri Sivonen
>> > hsivonen at hsivonen.fi
>> > https://hsivonen.fi/
>> Henri Sivonen
>> hsivonen at hsivonen.fi
hsivonen at hsivonen.fi
More information about the Unicode