Regular Expressions and Canonical Equivalence

Richard Wordingham richard.wordingham at
Fri May 15 10:45:03 CDT 2015

On Fri, 15 May 2015 02:38:17 +0200
Philippe Verdy <verdy_p at> wrote:

> 2015-05-14 20:13 GMT+02:00 Richard Wordingham <
> richard.wordingham at>:
> > If the interval list is compacted, at most one of the intervals will
> > contain a character properly having combining class 0.
> This is not a sufficent condition, there is also the case where two
> intervals contain combining characters with the same combining class:
> their relative order is significant because one is blocking the other
> (it limits the alllowed reorderings that are canonically equivalent).

If two fully decomposed characters of combining class 0 are included in
the match to a subexpression, all the characters between them will be

The needs you perceive would be met by providing the start
and end points of the locations of the non-starters flanking the
matching string on the sides where it starts with a non-starter or ends
with a character with non-zero rccc.  (U+00E2 would probably have to
count as a non-starter for your purposes.)  However, I'm not sure that
passing the positions would not suffice.

Don't forget that the input string can be rearranged, preserving
canonical equivalence, so that the captured string is actually

I think this discussion on search and replace would benefit from some
examples.  I don’t see your problem.  Is it based on experience?  I have
some fairly simple examples.

My first example is the replacement of ô <U+006F LATIN SMALL LETTER O
the 4-character string buộc <U+0062, U+0075, U+1ED9, U+0063>.  U+1ED9
has the full decomposition <U+006F, U+0323, U+0302>.  The substring ô
has the discontiguous position, in inclusive:exclusive notation:

Component 1 at Position 2:Component 2 at Position 2 (content U+006F)
Component 3 at Position 2:Whole at Position 3 (content U+0302)

Now, the regular expression syntax for an identified substring suggests
that it is contiguous.  For substitution, it therefore makes most sense
to view the whole string as though it were the canonically equivalent
<U+0062, U+0075, U+006F,  U+0302, U+0323, U+0063>, a form in which the
identified substring is contiguous.  Replacement should therefore
create something canonically equivalent to   <U+0062, U+0075, U+00E2,
U+0323, U+0063>.

In terms of program logic, I would expect the string editing to proceed
something like this:

1. Decompose characters that straddle range boundaries, so:

  a.  String becomes <U+0062, U+0075, U+006F, U+0323, U+0302, U+0063>

  b.  Identified substring location updates to:

      i.  Whole at Position 2: Whole at Position 3 (content U+006F)

      ii. Whole at Position 4: Whole at Position 5 (content U+0302)

2. First portion contains a character with canonical combining class 0,
so replace it by replacement string.

3. Delete other portions.

4. Apply any normalisation requirements.

For my second example, let the replacement string be <U+006F, U+031D
COMBINING UP TACK BELOW> instead.  I would expect the same logic to
apply, yielding a substring <U+006F, U+031D, U+0323>, and would not be
concerned by its not being canonically equivalent to <U+006F, U+0323,

For my third example, consider the replacement of U+0302 by U+031B
COMBINING HORN in the 6-character string buộc <U+0062, U+0075,
U+006F, U+0302, U+0323, U+0063>.  The character is at location Whole at
Position 4:Whole at Position 5.  The identified substring does not
contain any characters of canonical combining class 0.

U+031B has ccc=216 and U+0323 has ccc=220, so it matters little how the
characters between U+006F and U+0063 are arranged -  the results are
canonically equivalent and the substitution should be made without

For my fourth example, consider again the replacement of U+0302 in the
6-character string buộc <U+0062, U+0075, U+006F, U+0302, U+0323,
U+0063>, but this time by U+0068 LATIN SMALL LETTER H.

We now have a problem.  Applying the substitution at this location
yields the string buoḥc (dot below the ‘h’), while applying the
substitution to the string in NFD form  yields buọhc (dot below the
‘o’), which is visually distinct.  In some ways this is similar to the
problem of grouping text into collating elements for collation.   The
Unicode Collation Algorithm resolves conflicts on the basis of the NFD
form.  Requiring the string to be in strict NFD might not be suitable –
it breaks compatibility ideographs.  Also, I can imagine wanting  to
make global substitutions so as to undo ill effects of normalisation.
There are many different ways to handle the problem, and I can imagine
a rich selection of flags for a substitution routine.  I would urge,
however, that the replacement text should be contiguous in some
canonical equivalent of the resulting string.


More information about the Unicode mailing list