<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>


</head>


<body dir="ltr">


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


In terms of deployed ISO-2022-JP encoders which don't follow WHATWG behaviour, here's Python's (apparently contributed to Python by one


<span class="pl-c">Hye-Shik Chang</span>):</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


>>> "a¥bc~¥d".encode("iso-2022-jp")<br>


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


b'a\x1b(J\\\x1b(Bbc~\x1b(J\\\x1b(Bd'</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


This is so far as I can tell valid per the RFC (and of course ECMA-35 itself), but not per the WHATWG, whose output would be (to use another bytestring literal) b'a\x1b(J\\bc\x1b(B~\x1b(J\\d\x1b(B'. The difference being that Python's encoder appears to be using


 a preference order of codesets, with ASCII being before JIS-Roman, while the WHATWG logic is to encode the next character in the


<i>current</i> codeset if possible, and switch to another if it is not.</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


<br>


</div>


<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">


-- Har<br>


</div>


<div>


<div id="appendonsend"></div>


<div style="font-family:Calibri,Arial,Helvetica,sans-serif; font-size:12pt; color:rgb(0,0,0)">


<br>


</div>


<hr tabindex="-1" style="display:inline-block; width:98%">


<div id="divRplyFwdMsg" dir="ltr"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> Unicode <unicode-bounces@unicode.org> on behalf of Henri Sivonen via Unicode <unicode@unicode.org><br>


<b>Sent:</b> 17 August 2020 08:38<br>


<b>To:</b> Mark Davis ☕️ <mark@macchiato.com><br>


<b>Cc:</b> Unicode Public <unicode@unicode.org><br>


<b>Subject:</b> Re: Generating U+FFFD when there's no content between ISO-2022-JP escape sequences</font>


<div> </div>


</div>


<div class="BodyFragment"><font size="2"><span style="font-size:11pt">


<div class="PlainText">Sorry about the delay. There is now<br>


<a href="https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf">https://www.unicode.org/L2/L2020/20202-empty-iso-2022-jp.pdf</a><br>


<br>


On Mon, Dec 10, 2018 at 1:14 PM Mark Davis ☕️ <mark@macchiato.com> wrote:<br>


><br>


> I tend to agree with your analysis that emitting U+FFFD when there is no content between escapes in "shifting" encodings like ISO-2022-JP is unnecessary, and for consistency between implementations should not be recommended.<br>


><br>


> Can you file this at <a href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html</a> so that the committee can look at your proposal with an eye to changing


<a href="http://www.unicode.org/reports/tr36/?">http://www.unicode.org/reports/tr36/?</a><br>


><br>


> Mark<br>


><br>


><br>


> On Mon, Dec 10, 2018 at 11:10 AM Henri Sivonen via Unicode <unicode@unicode.org> wrote:<br>


>><br>


>> We're about to remove the U+FFFD generation for the case where there<br>


>> is no content between two ISO-2022-JP escape sequences from the WHATWG<br>


>> Encoding Standard.<br>


>><br>


>> Is there anything wrong with my analysis that U+FFFD generation in<br>


>> that case is not a useful security measure when unnecessary<br>


>> transitions between the ASCII and Roman states do not generate U+FFFD?<br>


>><br>


>> On Thu, Nov 22, 2018 at 1:08 PM Henri Sivonen <hsivonen@hsivonen.fi> wrote:<br>


>> ><br>


>> > Context: <a href="https://github.com/whatwg/encoding/issues/115">https://github.com/whatwg/encoding/issues/115</a><br>


>> ><br>


>> > Unicode Security Considerations say:<br>


>> > "3.6.2 Some Output For All Input<br>


>> ><br>


>> > Character encoding conversion must also not simply skip an illegal<br>


>> > input byte sequence. Instead, it must stop with an error or substitute<br>


>> > a replacement character (such as U+FFFD ( � ) REPLACEMENT CHARACTER)<br>


>> > or an escape sequence in the output. (See also Section 3.5 Deletion of<br>


>> > Code Points.) It is important to do this not only for byte sequences<br>


>> > that encode characters, but also for unrecognized or "empty"<br>


>> > state-change sequences. For example:<br>


>> > [...]<br>


>> > ISO-2022 shift sequences without text characters before the next shift<br>


>> > sequence. The formal syntaxes for HZ and most CJK ISO-2022 variants<br>


>> > require at least one character in a text segment between shift<br>


>> > sequences. Security software written to the formal specification may<br>


>> > not detect malicious text  (for example, "delete" with a<br>


>> > shift-to-double-byte then an immediate shift-to-ASCII in the middle)."<br>


>> > (<a href="https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input">https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input</a>)<br>


>> ><br>


>> > The WHATWG Encoding Standard bakes this requirement by the means of<br>


>> > "ISO-2022-JP output flag"<br>


>> > (<a href="https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag">https://encoding.spec.whatwg.org/#iso-2022-jp-output-flag</a>) into its<br>


>> > ISO-2022-JP decoder algorithm<br>


>> > (<a href="https://encoding.spec.whatwg.org/#iso-2022-jp-decoder">https://encoding.spec.whatwg.org/#iso-2022-jp-decoder</a>).<br>


>> ><br>


>> > encoding_rs (<a href="https://github.com/hsivonen/encoding_rs">https://github.com/hsivonen/encoding_rs</a>) implements the<br>


>> > WHATWG spec.<br>


>> ><br>


>> > After Gecko switched to encoding_rs from an implementation that didn't<br>


>> > implement this U+FFFD generation behavior (uconv), a bug has been<br>


>> > logged in the context of decoding Japanese email in Thunderbird:<br>


>> > <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=1508136">https://bugzilla.mozilla.org/show_bug.cgi?id=1508136</a><br>


>> ><br>


>> > Ken Lunde also recalls seeing such email:<br>


>> > <a href="https://github.com/whatwg/encoding/issues/115#issuecomment-440661403">


https://github.com/whatwg/encoding/issues/115#issuecomment-440661403</a><br>


>> ><br>


>> > The root problem seems to be that the requirement gives ISO-2022-JP<br>


>> > the unusual and surprising property that concatenating two ISO-2022-JP<br>


>> > outputs from a conforming encoder can result in a byte sequence that<br>


>> > is non-conforming as input to a ISO-2022-JP decoder.<br>


>> ><br>


>> > Microsoft Edge and IE don't generate U+FFFD when an ISO-2022-JP escape<br>


>> > sequence is immediately followed by another ISO-2022-JP escape<br>


>> > sequence. Chrome and Safari do, but their implementations of<br>


>> > ISO-2022-JP aren't independent of each other. Moreover, Chrome's<br>


>> > decoder implementations generally are informed by the Encoding<br>


>> > Standard (though the ISO-2022-JP decoder specifically might not be<br>


>> > yet), and I suspect that Safari's implementation (ICU) is either<br>


>> > informed by Unicode Security Considerations or vice versa.<br>


>> ><br>


>> > The example given as rationale in Unicode Security Considerations,<br>


>> > obfuscating the ASCII string "delete", could be accomplished by<br>


>> > alternating between the ASCII and Roman states to that every other<br>


>> > character is in the ASCII state and the rest of the Roman state.<br>


>> ><br>


>> > Is the requirement to generate U+FFFD when there is no content between<br>


>> > ISO-2022-JP escape sequences useful if useless ASCII-to-ASCII<br>


>> > transitions or useless transitions between ASCII and Roman are not<br>


>> > also required to generate U+FFFD? Would it even be feasible (in terms<br>


>> > of interop with legacy encoders) to make useless transitions between<br>


>> > ASCII and Roman generate U+FFFD?<br>


>> ><br>


>> > --<br>


>> > Henri Sivonen<br>


>> > hsivonen@hsivonen.fi<br>


>> > <a href="https://hsivonen.fi/">https://hsivonen.fi/</a><br>


>><br>


>><br>


>><br>


>> --<br>


>> Henri Sivonen<br>


>> hsivonen@hsivonen.fi<br>


>> <a href="https://hsivonen.fi/">https://hsivonen.fi/</a><br>


>><br>


<br>


<br>


-- <br>


Henri Sivonen<br>


hsivonen@hsivonen.fi<br>


<a href="https://hsivonen.fi/">https://hsivonen.fi/</a><br>


<br>


</div>


</span></font></div>


</div>


</body>


</html>