Missing UAX#31 tests?

Mark Davis ☕️ via Unicode unicode at unicode.org
Sun Jul 8 04:21:59 CDT 2018


I'm surprised that the tests for 11.0 passed for a 10.0 implementation,
because the following should have triggered a difference for WB. Can you
check on this particular case?

÷ 0020 × 0020 ÷ #  ÷ [0.2] SPACE (WSegSpace) × [3.4] SPACE (WSegSpace) ÷
[0.3]

About the testing:

The tests are generated so that they go all the combinations of pairs, and
some combinations of triples. The generated test cases use a sample from
each partition of characters, to cut down on the file size to a reasonable
level. That also means that some changes in the rules don't cause changes
in the test results. Because it is not possible to test every combination,
so there is also provision for additional test cases, such as those at the
end of the files, eg:

https://unicode.org/Public/11.0.0/ucd/auxiliary/WordBreakTest.html
https://unicode.org/Public/10.0.0/ucd/auxiliary/WordBreakTest.html

We should extend those each time to make sure we cover combinations that
aren't covered by pairs. There were some additions to that end; if they
didn't cover enough cases, then we can look at your experience to add more.

I can suggest two strategies for further testing:

1. To do a full test, for each row check every combinations obtained by
replacing each sample character by every other character in its
partition. Eg for the above line that would mean testing every <WSegSpace,
WSegSpace> sequence.

2. Use a monkey test against ICU. That is, generate random combinations of
characters from different partitions and check that ICU and your
implementation are in sync.

3. During the beta period, test your previous-version with the new test
files. If there are no failures, yet there are changes in the rules, then
raise that issue during the beta period so we can add tests.

4. If possible, during the beta period upgrade your implementation and test
against the new and old test files.

Anyone else have other suggestions for testing?

Mark




Mark

On Sun, Jul 8, 2018 at 6:52 AM, Karl Williamson via Unicode <
unicode at unicode.org> wrote:

> I am working on upgrading from Unicode 10 to Unicode 11.
>
> I used all the new files.
>
> The algorithms for some of the boundaries, like GCB and WB, have changed
> so that some of the property values no longer have code points associated
> with them.
>
> I ran the tests furnished in 11.0 for these boundaries, without having
> changed the algorithms from earlier releases.  All passed 100%.
>
> Unless I'm missing something, that indicates that the tests furnished in
> 11.0 do not contain instances that exercise these changes.  My guess is
> that the 10.0 tests were also deficient.
>
> I have been relying on the UCD to furnish tests that have enough coverage
> to sufficiently exercise the algorithms that are specified in UAX 31, but
> that appears to have been naive on my part
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20180708/5e8a8a47/attachment.html>


More information about the Unicode mailing list