Need reference to good ABNF for \uXXXX syntax

Martin J. Dürst duerst at it.aoyama.ac.jp
Fri Apr 16 00:33:34 CDT 2021


Hello Doug,

(Carsten cc'ed as a shortcut.)

On 2021-04-15 10:52, Doug Ewell via Unicode wrote:

> Martin J. Dürst wrote:
> 
>> So I guess you are looking for something like the regular expression
>> on https://www.w3.org/International/questions/qa-forms-utf-8, but for
>> the above syntax (rather than byte sequences in UTF-8) and in ABNF.
> 
> Yes.
> 
>> The closest I was able to come up from memory may be
>> https://tools.ietf.org/html/rfc5137, but it's not exactly what you
>> want.
> 
> No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post;

Sorry, I shouldn't have missed that.

> when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware.
> 
>> I'd guess it might be quicker for you to put something together on
>> your own (and then maybe run it by this list).
> 
> What Carsten Bormann came up with was this:
> 
> hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)
> 
> non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
>                  ("D" %x30-37 2HEXDIG )
> 
> high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG
> 
> low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG
> 
> (My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.)

What bothers me in this grammar is that the first "\u" isn't anywhere in 
sight, but the second one is there. It would be much clearer if either 
the first "\u" is at the start of hexchar, i.e.

hexchar = "\" %x75 (non-surrogate / (high-surrogate "\" %x75 low-surrogate))

or the various "\u" parts are integrated with the various parts, as follows:

hexchar = non-surrogate / (high-surrogate low-surrogate)

non-surrogate = "\" %x75 ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
                  ("D" %x30-37 2HEXDIG )

high-surrogate = "\" %x75 "D" ("8" / "9" / "A" / "B") 2HEXDIG

low-surrogate = "\" %x75 "D" ("C" / "D" / "E" / "F") 2HEXDIG

The way it is written, it looks like the convenience of ABNF details
(such as maybe line length) are dominating the expression of a clear 
structure.

Regards,   Martin.


More information about the Unicode mailing list