Pure Regular Expression Engines and Literal Clusters

Richard Wordingham via Unicode unicode at unicode.org
Mon Oct 14 02:46:07 CDT 2019


On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
> 
>    - A query /X/ can match the separated A and C in the target string
>    "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how
> should it behave? "pqb", "pbq", "bpq"?

If A contains a non-starter, pqbC.
If C contains a non-starter, Abpq.
Otherwise, if the results are canonically inequivalent, it should
raise an exception for attempting a process that is either ill-defined
or not Unicode-compliant. 

> If the input was in NFD (for
> example), should the output be rearranged/decomposed so that it is
> NFD? and so on.

That is not a new issue.  It exists already.

>    - A query /A/ can match *part* of the X in the target string
> "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what
> should result: "apqBb"?

Yes, unless raising an exception is appropriate (see above).

> The syntax and APIs for regex engines are not built to handle these
> features. It introduces a enough complications in the code, syntax,
> and semantics that no major implementation has seen fit to do it. We
> used to have a section in the spec about this, but were convinced
> that it was better off handled at a higher level.

What higher level?  If anything, I would say that the handler is at a
lower level (character fragments and the like).

The potential requirement should be restored, but not subsumed in
Levels 1 to 3.  It is a sufficiently different level of endeavour.

Richard.



More information about the Unicode mailing list