Unicode Regular Expressions, Surrogate Points and UTF-8

Mark Davis ☕️ mark at macchiato.com
Sat May 31 08:41:10 CDT 2014

I think you have a point here. We should probably change to:

To meet this requirement, an implementation shall supply a mechanism for
specifying any Unicode scalar value (from U+0000 to U+D7FF and U+E000 to
U+10FFFF), using the hexadecimal code point representation.

and then in the notes say that the same notation can be used for codepoints
that are not scalar values, for implementation that handle them in Unicode

Mark <https://google.com/+MarkDavis>

*— Il meglio è l’inimico del bene —*

On Fri, May 30, 2014 at 8:45 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> Is there any good reason for UTS#18 'Unicode Regular Expressions' to
> express its requirements in terms of codepoints rather than scalar
> values?
> I was initially worried by RL1.1 requiring that one be able to specify
> surrogate codepoints in a pattern.  It would not be compliant for an
> application to reject such patterns as syntactically or semantically
> incorrect!  RL1.1 seemed to prohibit compliant regular expression
> engines that only handled well-formed UTF-8 strings.
> Furthermore, consider attempting to handle CESU-8 text as a sequence of
> UTF-8 code units.  The code unit sequence for U+10000 will,
> corresponding to the UTF-16 code unit sequence D800 DC00, be ED A0 80
> ED B0 80. If one follows the lead of the 'best practice' for processing
> ill-formed UTF-8 code unit sequences given in TUS Section 5.22, this
> will be interpreted as *four* ill-formed sequences, ED A0, 80, ED B0,
> and 80.  I am not aware of any recommendation as to how to interpret
> these sequences as codepoints.
> While being able to specify a search for surrogate codepoint U+D800
> might be useful when dealing with ill-formed UTF-16 Unicode sequences,
> UTS#18 Section 1.7, which discusses requirement RL1.7, states that there
> is no requirement for a one-codepoint pattern such as \u{D800} to match
> a UTF-16 Unicode string consisting just of one code unit with the value
> 0xD800.  The convenient, possibly intended, consequence of this is that
> the RL1.1 requirement to allow patterns to specify surrogate codepoints
> can be satisfied by simply treating them as unmatchable; For example,
> such a 1-character RE could be treated as the empty Unicode set
> [\p{gc=Lo} && \p{gc=Mn}].
> Now, I suppose one might want to specify a match for ill-formed (in
> context) UTF-8 code unit subsequences such as E0 80 (not a valid
> initial subsequence) and E0 A5 (lacking a trailing byte), but as
> matching is not required, I don't see the point in UTS#18 being
> changed to ask for an appropriate syntax to be added.
> Richard.
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20140531/b77c617c/attachment.html>

More information about the Unicode mailing list