Unpaired surrogates

Mon Oct 19 19:07:12 CDT 2015

On Mon, 19 Oct 2015 13:32:07 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> Richard Wordingham wrote:

> > It was the once the
> > case that basic Unicode support in regular expressions required a
> > regular expression engine to be able to search for specified lone
> > surrogates - a real show-stopper for an engine working in UTF-8.
> > The Unicode collation algorithm conformance test once tested that
> > implementations of collation collated lone surrogates correctly.
> > Raising an exception was an automatic test failure! By contrast,
> > no-one's proposed collation rules for broken bits of UTF-8
> > characters or non-minimal length forms.

> Are these tests still included, or did someone notice that they were
> in conflict with the standard and removed them?

Markus Scherer has answered this question as it applies to collation.
For regular expressions, Requirement RL1.7 'Supplementary Code Points'
still reads:

"To meet this requirement, an implementation shall handle the full
range of Unicode code points, including values from U+FFFF to U+10FFFF.
In particular, where UTF-16 is used, a sequence consisting of a leading
surrogate followed by a trailing surrogate shall be handled as a single
code point in matching."

Now, as we know, UTF-32 does not handle the full range of Unicode code
points; it only handles scalar values.  In the discussion of UTS#18
RL1.7, my objections did result in the addition of:

"Note: It is permissible, but not required, to match an isolated
surrogate code point (such as \u{D800}), which may occur in Unicode
Strings. See Unicode String in the Unicode glossary."

I'm not sure that that text loosely associated with RL1.7 gets round
Requirement RL1.1, which still reads:

"To meet this requirement, an implementation shall supply a mechanism
for specifying any Unicode code point (from U+0000 to U+10FFFF), using
the hexadecimal code point representation."

Possibly a compliant implementation needs to parse hex codes for
surrogate points, even if only reject input containing them and
interpret them as a perverse alternative syntax for the perverse
expression \p{^any}. Or does \p{^any} actually matched by isolated
non-ASCII UTF-8 code units?  As there is no requirement for a regular
expression engine conforming to UTS#18 'Unicode Regular Expressions' to
handle non-conformant Unicode strings, this need not be a problem.

Richard.