IdnaTest.txt and RFC 5893
Mark Davis ☕️
mark at macchiato.com
Thu Jan 5 09:55:47 CST 2017
Alastair, thanks for finding it and bringing it up. I think you're right
that the problem is in that the test generation code doesn't properly apply
the bidi criteria to *all* the labels if *any* of the labels are RTL, but
instead is probably just going on a label-by-label basis. Thankfully, it
looks like ICU does handle it right, by your note. (The test file
generation doesn't use the ICU code.)
Could you please report this via http://www.unicode.org/reporting.html so
that we make sure that it is tracked and brought up to the UTC?
On Thu, Jan 5, 2017 at 10:46 AM, Alastair Houghton <
alastair at alastairs-place.net> wrote:
> On 4 Jan 2017, at 23:40, Markus Scherer <markus.icu at gmail.com> wrote:
> > On Wed, Jan 4, 2017 at 2:28 AM, Alastair Houghton <
> alastair at alastairs-place.net> wrote:
> > RFC 5893 seems pretty clear to me, and the problem really is that the
> test vectors (which come from unicode.org) seem (to me) to be incorrect.
> > https://tools.ietf.org/html/rfc5893#section-2 says "The following rule,
> consisting of six conditions, applies to labels in Bidi domain names."
> > That's what the ICU code does -- applying the rule to each label -- and
> I assume that's the basis for the test data.
> Absolutely. But the crucial part is “in Bidi domain names”. That is, it
> applies to *all* labels that are part of a Bidi domain name, not just RTL
> labels. It did not say “applies to RTL labels in Bidi domain names” and in
> fact even explicitly states that (in the first bullet point at the end of
> section 2):
> ...Note that even LTR labels and pure ASCII labels have to be tested.
> Not to mention the fact that parts 5 and 6 of the rule apply specifically
> to LTR labels.
> So it’s quite clear that given the domain name “0à.א”, both “א” *and* “0à”
> need to be checked using the Bidi Rule. Unless someone can explain why
> “0à” does not fail the test, surely we all agree that line 74 is incorrect:
> > B; 0à.\u05D0; ; xn--0-sfa.xn--4db # 0à.א
> and similarly with line 93:
> > B; àˇ.\u05D0; ; xn--0ca88g.xn--4db # àˇ.א
> > ICU does not currently check for multi-label bidi combinations.
> I was a bit puzzled by this, because the code clearly does (both in the
> C++ and Java versions) and yet the online demo doesn’t appear to object to
> the above test cases. So I wrote a quick test program against the C++
> version of ICU 58.2 and fed it both test cases, and, sure enough, ICU
> agrees that there is a BiDi error in both of the above cases.
> Kind regards,
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode