Unicode Sets in 'Unicode Regular Expressions'

Richard Wordingham richard.wordingham at ntlworld.com
Tue May 27 19:19:26 CDT 2014


On Wed, 28 May 2014 00:56:40 +0200
Charlie Ruland ☘ <ruland at luckymail.com> wrote:

> So I take “Unicode set” to mean “set of Unicode characters” with
> their respective codepoints, whether decomposable or not.

The decomposability issue arises when trying to follow RL2.1
"Canonical Equivalence".  In a pattern such as "f\p{L}te".
\p{L} is not just a set of codepoints if the pattern is to be matched
by "fête" when processing NFD strings.  This is one reason I think Ken
is right when he says the ICU meaning is intended.  I believe I have a
coherent resolution of RL2.1, but I'm currently wrestling with the
other requirements that an implementation satisfying the spirit of
RL2.1 ought to address.

Richard.



More information about the Unicode mailing list