Pure Regular Expression Engines and Literal Clusters
Richard Wordingham via Unicode
unicode at unicode.org
Mon Oct 14 02:46:07 CDT 2019
On Sun, 13 Oct 2019 21:28:34 -0700
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:
> The problem is that most regex engines are not written to handle some
> "interesting" features of canonical equivalence, like discontinuity.
> Suppose that X is canonically equivalent to AB.
> - A query /X/ can match the separated A and C in the target string
> "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how
> should it behave? "pqb", "pbq", "bpq"?
If A contains a non-starter, pqbC.
If C contains a non-starter, Abpq.
Otherwise, if the results are canonically inequivalent, it should
raise an exception for attempting a process that is either ill-defined
or not Unicode-compliant.
> If the input was in NFD (for
> example), should the output be rearranged/decomposed so that it is
> NFD? and so on.
That is not a new issue. It exists already.
> - A query /A/ can match *part* of the X in the target string
> "aXb". So if I have code to do [replace /A/ in "aXb" by "pq"], what
> should result: "apqBb"?
Yes, unless raising an exception is appropriate (see above).
> The syntax and APIs for regex engines are not built to handle these
> features. It introduces a enough complications in the code, syntax,
> and semantics that no major implementation has seen fit to do it. We
> used to have a section in the spec about this, but were convinced
> that it was better off handled at a higher level.
What higher level? If anything, I would say that the handler is at a
lower level (character fragments and the like).
The potential requirement should be restored, but not subsumed in
Levels 1 to 3. It is a sufficiently different level of endeavour.
More information about the Unicode