Pure Regular Expression Engines and Literal Clusters

Mark Davis ☕️ via Unicode unicode at unicode.org
Sun Oct 13 23:28:34 CDT 2019


The problem is that most regex engines are not written to handle some
"interesting" features of canonical equivalence, like discontinuity.
Suppose that X is canonically equivalent to AB.

   - A query /X/ can match the separated A and C in the target string
   "AbC". So if I have code do [replace /X/ in "AbC" by "pq"], how should it
   behave? "pqb", "pbq", "bpq"? If the input was in NFD (for example), should
   the output be rearranged/decomposed so that it is NFD? and so on.
   - A query /A/ can match *part* of the X in the target string "aXb". So
   if I have code to do [replace /A/ in "aXb" by "pq"], what should result:
   "apqBb"?

The syntax and APIs for regex engines are not built to handle these
features. It introduces a enough complications in the code, syntax, and
semantics that no major implementation has seen fit to do it. We used to
have a section in the spec about this, but were convinced that it was
better off handled at a higher level.

Mark


On Sun, Oct 13, 2019 at 8:31 PM Asmus Freytag via Unicode <
unicode at unicode.org> wrote:

> On 10/13/2019 6:38 PM, Richard Wordingham via Unicode wrote:
>
> On Sun, 13 Oct 2019 17:13:28 -0700
> Asmus Freytag via Unicode <unicode at unicode.org> <unicode at unicode.org> wrote:
>
>
> On 10/13/2019 2:54 PM, Richard Wordingham via Unicode wrote:
> Besides invalidating complexity metrics, the issue was what \p{Lu}
> should match.  For example, with PCRE syntax, GNU grep Version 2.25
> \p{Lu} matches U+0100 but not <A, U+0300>.  When I'm respecting
> canonical equivalence, I want both to match [:Lu:], and that's what I
> do. [:Lu:] can then match a sequence of up to 4 NFD characters.
>
> Formally, wouldn't that be rewriting \p{Lu} to match \p{Lu}\p{Mn}*;
> instead of formally handling NFD, you could extend the syntax to
> handle "inherited" properties across combining sequences.
>
> Am I missing anything?
>
> Yes.  There is no precomposed LATIN LETTER M WITH CIRCUMFLEX, so [:Lu:]
> should not match <U+004D LATIN CAPITAL LETTER M, U+0302 COMBINING
> CIRCUMFLEX ACCENT>.
>
> Why does it matter if it is precomposed? Why should it? (For anyone other
> than a character coding maven).
>
>  Now, I could invent a string property so
> that \p{xLu} that meant (:?\p{Lu}\p{Mn}*).
>
> I don't entirely understand what you said; you may have missed the
> distinction between "[:Lu:] can then match" and "[:Lu:] will then
> match".  I think only Greek letters expand to 4 characters in NFD.
>
> When I'm respecting canonical equivalence/working with traces, I want
> [:insc=vowel_dependent:][:insc=tone_mark:] to match both <U+0E39 THAI
> CHARACTER SARA UU, U+0E49 THAI CHARACTER MAI THO> and its canonical
> equivalent <U+0E49, U+0E39>.  The canonical closure of that
> sequence can be messy even within scripts.  Some pairs commute: others
> don't, usually for good reasons.
>
> Some models may be more natural for different scripts. Certainly, in SEA
> or Indic scripts, most combining marks are not best modeled with properties
> as "inherited". But for L/G/C etc. it would be a different matter.
>
> For general recommendations, such as UTS#18, it would be good to move the
> state of the art so that the "primitives" are in line with the way typical
> writing systems behave, so that people can write "linguistically correct"
> regexes.
>
> A./
>
>
> Regards,
>
> Richard.
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20191013/be87fb3e/attachment.html>


More information about the Unicode mailing list