Unicode Regular Expressions for Syllable Structure and Normalisation

Sun May 18 18:06:01 CDT 2014

While pondering the Indic Syllabic Category property and its
application in regular expressions, I found myself worrying as to what
Thai script expressions should match the 'regular expression'

\p{isc=Consonant}\p{isc=Nukta}\p{isc=Vowel_Dependent}

Now, with the present tables for the property Indic_Syllabic_Category,
this is not a problem, but it so happens that U+0331 COMBINING MACRON
BELOW serves as a nukta, and the problem I foresee will surface once it
is assigned isc=Nukta then, for U+0331 has canonical combining class 220
but U+0E38 THAI CHARACTER SARA U and U+0E39 THAI CHARACTER SARA UU have
canonical combining class 103.  Note that these problems do not arise
with this expression when one adds U+0E3A THAI CHARACTER PHINTHU to the
list of nuktas.

Using U+0E07 THAI CHARACTER NGO NGU as the conmsonant, and considering
the possibility of using U+034F COMBINING GRAPHEME JOINER to avoid
rendering problems with naïve rendering engines, for which of the
following strings ought a regular expression engine declare a match if
U+0331 is given isc=Nukta?

<U+0E07, U+0331, U+0E38> (Not in NFD)
<U+0E07, U+0E38, U+0331> (In NFD, but not the specified order)
<U+0E07, U+0331, U+034F, U+0E38> (Extraneous character)

I consulted UTS #18 Unicode Regular Expressions, and it appears from
first glance that the requirement 'RL2.1 Canonical Equivalents' should
supply the answer.  However, that transpired not to contain any actual
requirement!  It does suggest a three-part strategy:

1. Putting the text to be matched into a defined normalization form
(NFD or NFKD).

2. Having the user design the regular expression pattern to match
against that defined normalization form. For example, the pattern
should contain no characters that would not occur in that normalization
form, nor sequences that would not occur.

3. Applying the matching algorithm on a code point by code point basis,
as usual.

Part 1 (NFD, not NFKD) is reasonable.

Part 2 leaves me a bit confused.  Taken literally, there is no problem
with the pattern; it is in ASCII!  However, expanding the pattern out
to the possible sequences leaves one searching the NFD string <U+0E07,
U+0E38, U+0331> for the non-NFD substring <U+0E07, U+0331, U+0E38>.
Obviously, no matches will be found when Part 3 is applied.

Is the correct solution to change the comprehensible regular expression
to 

\p{isc=Consonant}(\p{isc=Nukta}\p{isc=Vowel_Dependent}|
                  \p{isc=Vowel_Dependent}\p{isc=Nukta})

?

It does contain impossible sequences, but it will find the sequences
that should be found.

One drawback is that it will match <U+0E07, U+0E34 THAI CHARACTER SARA
I, U+0331>, which one might expect to be a homograph of the
canonically inequivalent sequence <U+0E07, U+0331, U+0E34>.  (The
combining marks do not interact typographically, but U+0E34 follows
the Indic pattern of having canonical combining class 0.  In practice,
dotted circles are liable to appear for either string.)

I have done a little mathematical work on regular expressions and
canonical equivalence, but these were true regular expressions, i.e.
recognisable by finite automata.  I worked with NFD strings, and came
to the conclusion that the result of concatenating strings should be
defined as the result of character-wise concatenation followed by
normalisation.  Even with this definition, concatenations of regular
expressions are still regular expressions, as in the standard theory.
(Character-wise concatenation, excluding non-NFD juxtapositions, also
yields a regular expression - the set of NFD strings can be defined by
a regular expression.) 

I came to the disappointing conclusion that

(\p{name=COMBINING DOT BELOW}\p{name=COMBINING ACUTE ACCENT})*

was not a true regular expression, at least, not in any sense that
allows the expression to denote an infinite set of strings.  One could
define * to be restricted to character-wise concatenations that yielded
NFD strings, but this is potentially very confusing.  It might be
argued to be in line with RL2.1 in UTS #18.

If one takes the approach I've outlined of including normalisation in
the concatenation operation, then one can revert to the definition

\p{isc=Consonant}\p{isc=Nukta}\p{isc=Vowel_Dependent}

and this will, because it only considers text to be searched once it
has been converted to NFD, match both <U+0E07, U+0331, U+0E38> and
<U+0E07, U+0E38, U+0331>, but not <U+0E07, U+0331, U+034F, U+0E38>.

I don't know how acceptable this approach is.  Does anyone use it?

The handling of joiners, non-joiners and disruptors (i.e. U+034F) is
yet another topic.

Richard.