Inadvertent copies of test data in L2/17-197 ?

Henri Sivonen via Unicode unicode at
Mon Aug 7 02:36:00 CDT 2017

On Mon, Aug 7, 2017 at 9:53 AM, Martin J. Dürst <duerst at> wrote:
> I just had a look at
> to use the test data in there for Ruby.
> I was under the impression from previous looks at it that it contained a lot
> of test data.

It contains the test outputs with identical results (output exhibiting
the spec-following behavior and output exhibiting the one REPLACEMENT
CHARACTER per bogus byte behavior) shown only once. Since the input
doesn't make sense as a PDF, it only mentions where to find the input

> However, when I looked at the test data more carefully (I had
> read the text before the test data carefully at least two times before, but
> not looked at the test data in that much detail), I discovered that there
> might be up to 7 copies of the same data. The first one starts on page 9,
> and then there's a new one about every 4 or 5 pages.
> Can you check/confirm? Any idea what might have caused this?

The test outputs are not identical. They should be the content of the
following files with a bit of introductory text before each: with non-conforming
output replaced with italic text saying what the bytes were

I inspected the PDF multiple times just now, and, as far as I can
tell, the content indeed matches what I described above (no

For reference, I tested the Ruby standard library with the following program:

data ="test.html", encoding: "UTF-8")
encoded = data.encode("UTF-16LE", :invalid=>:replace).encode("UTF-8")
IO.write("ruby.html", encoded)

...where test.html was the file available at

Henri Sivonen
hsivonen at

More information about the Unicode mailing list