Unicode Regular Expressions, Surrogate Points and UTF-8

Richard Wordingham richard.wordingham at ntlworld.com
Fri May 30 13:45:08 CDT 2014


Is there any good reason for UTS#18 'Unicode Regular Expressions' to
express its requirements in terms of codepoints rather than scalar
values?

I was initially worried by RL1.1 requiring that one be able to specify
surrogate codepoints in a pattern.  It would not be compliant for an
application to reject such patterns as syntactically or semantically
incorrect!  RL1.1 seemed to prohibit compliant regular expression
engines that only handled well-formed UTF-8 strings.

Furthermore, consider attempting to handle CESU-8 text as a sequence of
UTF-8 code units.  The code unit sequence for U+10000 will,
corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80
ED B0 80. If one follows the lead of the 'best practice' for processing
ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this
will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0,
and 80.  I am not aware of any recommendation as to how to interpret
these sequences as codepoints.

While being able to specify a search for surrogate codepoint U+D800
might be useful when dealing with ill-formed UTF-16 Unicode sequences,
UTS#18 Section 1.7, which discusses requirement RL1.7, states that there
is no requirement for a one-codepoint pattern such as \u{D800} to match
a UTF-16 Unicode string consisting just of one code unit with the value
0xD800.  The convenient, possibly intended, consequence of this is that
the RL1.1 requirement to allow patterns to specify surrogate codepoints
can be satisfied by simply treating them as unmatchable; For example,
such a 1-character RE could be treated as the empty Unicode set
[\p{gc=Lo} && \p{gc=Mn}].

Now, I suppose one might want to specify a match for ill-formed (in
context) UTF-8 code unit subsequences such as E0 80 (not a valid
initial subsequence) and E0 A5 (lacking a trailing byte), but as
matching is not required, I don't see the point in UTS#18 being
changed to ask for an appropriate syntax to be added.

Richard.


More information about the Unicode mailing list