IdnaTest.txt and RFC 5893
alastair at alastairs-place.net
Thu Jan 5 03:46:53 CST 2017
On 4 Jan 2017, at 23:40, Markus Scherer <markus.icu at gmail.com> wrote:
> On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton <alastair at alastairs-place.net> wrote:
> RFC 5893 seems pretty clear to me, and the problem really is that the test vectors (which come from unicode.org) seem (to me) to be incorrect.
> https://tools.ietf.org/html/rfc5893#section-2 says "The following rule, consisting of six conditions, applies to labels in Bidi domain names."
> That's what the ICU code does -- applying the rule to each label -- and I assume that's the basis for the test data.
Absolutely. But the crucial part is “in Bidi domain names”. That is, it applies to *all* labels that are part of a Bidi domain name, not just RTL labels. It did not say “applies to RTL labels in Bidi domain names” and in fact even explicitly states that (in the first bullet point at the end of section 2):
...Note that even LTR labels and pure ASCII labels have to be tested.
Not to mention the fact that parts 5 and 6 of the rule apply specifically to LTR labels.
So it’s quite clear that given the domain name “0à.א”, both “א” *and* “0à” need to be checked using the Bidi Rule. Unless someone can explain why “0à” does not fail the test, surely we all agree that line 74 is incorrect:
> B; 0à.\u05D0; ; xn--0-sfa.xn--4db # 0à.א
and similarly with line 93:
> B; àˇ.\u05D0; ; xn--0ca88g.xn--4db # àˇ.א
> ICU does not currently check for multi-label bidi combinations.
I was a bit puzzled by this, because the code clearly does (both in the C++ and Java versions) and yet the online demo doesn’t appear to object to the above test cases. So I wrote a quick test program against the C++ version of ICU 58.2 and fed it both test cases, and, sure enough, ICU agrees that there is a BiDi error in both of the above cases.
More information about the Unicode