Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Mon Aug 17 02:17:38 CDT 2020

IMO, encodings, particularly ones depending on state such as this, may have multiple ways to output the same, or similar, sequences.  When means that pretty much any time an encoding transforms data any previous security or other validation style checks are no longer valid and any security/validation must be checked for again.  I've seen numerous mistakes due to people expecting encodings to play nicely, particularly if there are different endpoints that may use different implementations with slightly different behaviors.

-Shawn

-----Original Message-----
From: Unicode <unicode-bounces at unicode.org> On Behalf Of Henri Sivonen via Unicode
Sent: Sunday, August 16, 2020 11:39 PM
To: Mark Davis ☕️ <mark at macchiato.com>
Cc: Unicode Public <unicode at unicode.org>
Subject: Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences

Sorry about the delay. There is now
https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf

On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.
>
> Can you file this at http://www.unicode.org/reporting.html so that the committee can look at your proposal with an eye to changing http://www.unicode.org/reports/tr36/?
>
> Mark
>
>
> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode at unicode.org> wrote:
>>
>> We're about to remove the U+FFFD generation for the case where there 
>> is no content between two ISO-2022-JP escape sequences from the 
>> WHATWG Encoding Standard.
>>
>> Is there anything wrong with my analysis that U+FFFD generation in 
>> that case is not a useful security measure when unnecessary 
>> transitions between the ASCII and Roman states do not generate U+FFFD?
>>
>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen at hsivonen.fi> wrote:
>> >
>> > Context: https://github.com/whatwg/encoding/issues/115
>> >
>> > Unicode Security Considerations say:
>> > "3.6.2 Some Output For All Input
>> >
>> > Character encoding conversion must also not simply skip an illegal 
>> > input byte sequence. Instead, it must stop with an error or 
>> > substitute a replacement character (such as U+FFFD (   ) 
>> > REPLACEMENT CHARACTER) or an escape sequence in the output. (See 
>> > also Section 3.5 Deletion of Code Points.) It is important to do 
>> > this not only for byte sequences that encode characters, but also for unrecognized or "empty"
>> > state-change sequences. For example:
>> > [...]
>> > ISO-2022 shift sequences without text characters before the next 
>> > shift sequence. The formal syntaxes for HZ and most CJK ISO-2022 
>> > variants require at least one character in a text segment between 
>> > shift sequences. Security software written to the formal 
>> > specification may not detect malicious text  (for example, "delete" 
>> > with a shift-to-double-byte then an immediate shift-to-ASCII in the middle)."
>> > (https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input)
>> >
>> > The WHATWG Encoding Standard bakes this requirement by the means of 
>> > "ISO-2022-JP output flag"
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag) into 
>> > its ISO-2022-JP decoder algorithm 
>> > (https://encoding.spec.whatwg.org/#iso-2022-jp-decoder).
>> >
>> > encoding_rs (https://github.com/hsivonen/encoding_rs) implements 
>> > the WHATWG spec.
>> >
>> > After Gecko switched to encoding_rs from an implementation that 
>> > didn't implement this U+FFFD generation behavior (uconv), a bug has 
>> > been logged in the context of decoding Japanese email in Thunderbird:
>> > https://bugzilla.mozilla.org/show_bug.cgi?id=1508136
>> >
>> > Ken Lunde also recalls seeing such email:
>> > https://github.com/whatwg/encoding/issues/115#issuecomment-44066140
>> > 3
>> >
>> > The root problem seems to be that the requirement gives ISO-2022-JP 
>> > the unusual and surprising property that concatenating two 
>> > ISO-2022-JP outputs from a conforming encoder can result in a byte 
>> > sequence that is non-conforming as input to a ISO-2022-JP decoder.
>> >
>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP 
>> > escape sequence is immediately followed by another ISO-2022-JP 
>> > escape sequence. Chrome and Safari do, but their implementations of 
>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's 
>> > decoder implementations generally are informed by the Encoding 
>> > Standard (though the ISO-2022-JP decoder specifically might not be 
>> > yet), and I suspect that Safari's implementation (ICU) is either 
>> > informed by Unicode Security Considerations or vice versa.
>> >
>> > The example given as rationale in Unicode Security Considerations, 
>> > obfuscating the ASCII string "delete", could be accomplished by 
>> > alternating between the ASCII and Roman states to that every other 
>> > character is in the ASCII state and the rest of the Roman state.
>> >
>> > Is the requirement to generate U+FFFD when there is no content 
>> > between ISO-2022-JP escape sequences useful if useless 
>> > ASCII-to-ASCII transitions or useless transitions between ASCII and 
>> > Roman are not also required to generate U+FFFD? Would it even be 
>> > feasible (in terms of interop with legacy encoders) to make useless 
>> > transitions between ASCII and Roman generate U+FFFD?
>> >
>> > --
>> > Henri Sivonen
>> > hsivonen at hsivonen.fi
>> > https://hsivonen.fi/
>>
>>
>>
>> --
>> Henri Sivonen
>> hsivonen at hsivonen.fi
>> https://hsivonen.fi/
>>

--
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/