Unicode Sets in 'Unicode Regular Expressions'

Richard Wordingham richard.wordingham at ntlworld.com
Tue May 27 19:19:26 CDT 2014

On Wed, 28 May 2014 00:56:40 +0200
Charlie Ruland ☘ <ruland at luckymail.com> wrote:

> So I take “Unicode set” to mean “set of Unicode characters” with
> their respective codepoints, whether decomposable or not.

The decomposability issue arises when trying to follow RL2.1
"Canonical Equivalence".  In a pattern such as "f\p{L}te".
\p{L} is not just a set of codepoints if the pattern is to be matched
by "fête" when processing NFD strings.  This is one reason I think Ken
is right when he says the ICU meaning is intended.  I believe I have a
coherent resolution of RL2.1, but I'm currently wrestling with the
other requirements that an implementation satisfying the spirit of
RL2.1 ought to address.


More information about the Unicode mailing list