Inadvertent copies of test data in L2/17-197 ?
Henri Sivonen via Unicode
unicode at unicode.org
Mon Aug 7 02:36:00 CDT 2017
On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst <duerst at it.aoyama.ac.jp> wrote:
> I just had a look at http://www.unicode.org/L2/L2017/17197-utf8-retract.pdf
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.
It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input
> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
> Can you check/confirm? Any idea what might have caused this?
The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each:
https://hsivonen.fi/broken-utf-8/python2.html with non-conforming
output replaced with italic text saying what the bytes were
I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no
For reference, I tested the Ruby standard library with the following program:
data = IO.read("test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
...where test.html was the file available at
hsivonen at hsivonen.fi
More information about the Unicode