Inadvertent copies of test data in L2/17-197 ?

Henri Sivonen via Unicode unicode at unicode.org
Mon Aug 7 02:36:00 CDT 2017


On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
> I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.

It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input
(https://hsivonen.fi/broken-utf-8/test.html).

> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
>
> Can you check/confirm? Any idea what might have caused this?

The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each:
https://hsivonen.fi/broken-utf-8/spec.html
https://hsivonen.fi/broken-utf-8/one-per-byte.html
https://hsivonen.fi/broken-utf-8/win32.html
https://hsivonen.fi/broken-utf-8/java.html
https://hsivonen.fi/broken-utf-8/python2.html with non-conforming
output replaced with italic text saying what the bytes were
https://hsivonen.fi/broken-utf-8/perl5.html
https://hsivonen.fi/broken-utf-8/icu.html

I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no
duplicates).

For reference, I tested the Ruby standard library with the following program:

data = IO.read("test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
IO.write("ruby.html", encoded)

...where test.html was the file available at
https://hsivonen.fi/broken-utf-8/test.html

-- 
Henri Sivonen
hsivonen at hsivonen.fi
https://hsivonen.fi/



More information about the Unicode mailing list