Regular Expressions and Canonical Equivalence

Philippe Verdy verdy_p at wanadoo.fr
Fri May 15 15:21:56 CDT 2015


2015-05-15 17:45 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> I think this discussion on search and replace would benefit from some
> examples.  I don’t see your problem.  Is it based on experience?  I have
> some fairly simple examples.
>

Just consider a regexp that attempts to search and subtitute "é" (for
example by "É") and that has to locate it where it is in NFC form (single
character) or NFD form (combining sequence). You'll also have to match
cases where there are other intermediate combining characters (with a
distinct non-zero combining class, different from the combining class of
the acute accent) between the base letter and the acute accent.

You have then to return discontiguous matches, but your replacement string
"É" should still preserve the other combining characters.

The situation is even worse if you are looking for strings in which you
want to discard only some combining characters (the replacement is empty):
there may be several discontiguities in the matches. Now imagine that the
replacement string is to replace all these distinct combining characters by
a single one (such things would be done for filters that want to eliminate
some combining characters not suitable for a given language, or because
there's a linguistic orthographic rule that permits these substitutions of
foreign combining characters, e.g. : drop combining dots above, replace all
combining characters below, except the cedilla by a single one such as a
low line. Such thing would also happen for languages that have
changed/simplified their orthography about combiing characters, or that use
two distinct orthographic conventions and you want to convert between them)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150515/4a35d82a/attachment.html>


More information about the Unicode mailing list