Denoting Abstract Substrings

Sat Jun 28 22:59:46 CDT 2014

I believe it is fairly natural to think of physical sequences of code
units as representatives of equivalence classes corresponding to
abstract strings.  One of the most important such equivalences is
canonical equivalence, though one might want to use some tailoring - in
which case one would not have canonical equivalence.

Given this abstraction, it is natural to want to be able to reference
substrings.  To this end, one may define the substrings of an abstract
string to be the equivalence classes containing a physical substring of
physical string in the original string.  (There is probably no need to
restrict oneself to vaguely contiguous substrings.)

For example, I might want to express U+00E2 LATIN SMALL LETTER A WITH
CIRCUMFLEX (canonical decomposition <U+0061, U+0302>) as a substring
of a string canonically equivalent to U+1EAD LATIN SMALL LETTER A WITH
CIRCUMFLEX AND DOT BELOW, whose NFD equivalent is <U+0061, U+0323,
U+0302>.  This corresponds to a decomposition into a Vietnamese vowel
(U+00E2) plus a Vietnamese tone mark (U+0323).

Now, if the canonical decomposition of U+1EAD in UTF-8 is held as x =
<61, CC, A3, CC, 82>, I might, adapting the boundary based notation,
choose to specify the *abstract* substring <U+00E2> as the
abstract substring x[0:1,3:5].  (This specifies code units 0, 3 and
4.)  If I specify that extracted substrings always contain entire scalar
values, I might, confusingly, abbreviate this notation to x[0, 3].
However, if U+1EAD is held as a single scalar value in UTF-8 string y =
<E1, BA, AD>, I want a notation that says, 'Take the first and third
components of the NFD equivalent of the scalar value held starting at
offset 0', e.g. "y[0.1, 0.3]".

Using my own notation is likely to cause confusion - are there any
shared, workable schemes in use?  I'm putting together a demonstration
regular expression engine that works on 'traces' (see
http://en.wikipedia.org/wiki/Trace_monoid for the definition of a
'trace') rather than strings, but for it to be 'useful' I see no reason
to restrict it to searching text in NFD.  I'm currently working on
capturing subexpressions.

My hope is that we will ultimately have regular expression
engines that fully grok canonical equivalence.  When RL2.1 in UTS#18
"Unicode Regular Expressions" last looked usable as a specification
(Version 13, 2008), the requirement that "an implementation shall
provide a mechanism for ensuring that all canonically equivalent
literal characters match" was too weak for my desire as a user.  
Concatenation of expressions is rather more complicated for traces than
for strings, though still within the scope of mathematical regular
expressions, where 'regular' means recognisable by a finite automaton.
There are issues with Unicode sets - does "K" match "\p{ASCII} &
\p{block=Letterlike Symbols}"?  (The simplest solution seems to be to
exclude codepoints with singleton decompositions, such as U+212A KELVIN
SIGN, from the set of scalar values in Unicode sets.)

As an aside, I'd have liked to have added fully decomposed Unicode
strings under canonical equivalence to the Wikipedia article as an
example of traces, but I couldn't find a source.

Richard.