Need reference to good ABNF for \uXXXX syntax

Doug Ewell doug at ewellic.org
Wed Apr 14 20:52:11 CDT 2021


Markus Scherer wrote:

> I was looking for something, but all I can find is either loose about
> surrogates (e.g.,
> https://docs.oracle.com/javase/specs/jls/se7/html/jls-3.html), or
> deals in code points rather than UTF-16 code units.

Yes, the text of the Java spec knows about concatenating a high surrogate and a low surrogate, but doesn't know about excluding unpaired surrogates. So the syntax on that page is really just pre-1993 UCS-2.

> Can you say why you want/need strict 16-bit escapes for well-formed
> UTF-16 code units, rather than what others are doing?

It's for an update to RFC 8610, which defines CDDL, a metalanguage for expressing CBOR data structures. The syntax is already defined and out in the field, so it's too late to change it, but the ABNF describing it was incorrect and someone filed an erratum.

The discussion was on how to fix the ABNF, and I thought it would be better to find and validate a rule already published than to create an all-new, probably slightly different, and possibly buggy one.

In the end, some time after I wrote my message, the decision was made to create a new rule (see below). Fortunately it has a lot of eyes on it, and seems to be correct.

Martin J. Dürst wrote:

> So I guess you are looking for something like the regular expression
> on https://www.w3.org/International/questions/qa-forms-utf-8, but for
> the above syntax (rather than byte sequences in UTF-8) and in ABNF.

Yes.

> The closest I was able to come up from memory may be
> https://tools.ietf.org/html/rfc5137, but it's not exactly what you
> want.

No, that just repeats the Java spec's UCS-2 definition, in real ABNF instead of whatever the Java spec is using. I did mention RFC 5167 in my post; when I said it didn't include ABNF for the syntax we're talking about, I meant surrogate-aware.

> I'd guess it might be quicker for you to put something together on
> your own (and then maybe run it by this list).

What Carsten Bormann came up with was this:

hexchar = non-surrogate / (high-surrogate "\" %x75 low-surrogate)

non-surrogate = ((DIGIT / "A" / "B" / "C" / "E" / "F") 3HEXDIG) /
                ("D" %x30-37 2HEXDIG )

high-surrogate = "D" ("8" / "9" / "A" / "B") 2HEXDIG

low-surrogate = "D" ("C" / "D" / "E" / "F") 2HEXDIG

(My contribution was to define non-surrogate, high-surrogate, and low-surrogate separately instead of making this one behemoth rule.)

J Decker wrote:

> There's also long encode in JS using \u{NNNNN}  where the N digits
> aren't required, because there's a framing of {}.... this allows one
> to specify A character without surrogate encoding.

Thanks, but the goal was not to find a better encoding for CDDL, but to find good ABNF for the encoding that CDDL already uses.

--
Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org





More information about the Unicode mailing list