Regular Expressions and Canonical Equivalence

Thu May 14 09:08:14 CDT 2015

Richard Wordingham <richard dot wordingham at ntlworld dot com> wrote:

> For example, I believe that one should be able to find
> [...]
> the Vietnamese letter ô U+00F4 LATIN SMALL LETTER O WITH
> CIRCUMFLEX in the word _buộc_ 'to bind' <U+0062, U+0075, U+1ED9 LATIN
> SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW, U+0063>.  As far as I
> can tell, U+1ED9 is not a letter of the Vietnamese alphabet; it is the
> combination <U+00F4 LATIN SMALL LETTER O WITH CIRCUMFLEX, U+0323
> COMBINING DOT BELOW> of Vietnamese letter and tone mark.

What you're looking for in this case is neither an NFC match nor an NFD
match, but a language-dependent match, as you imply further down. <1ED9>
decomposes to <006F 0323 0302>, and if you want a match with <00F4>,
which decomposes to <006F 0302>, your regex engine has to reorder the
marks. It sounds unlikely that you'll find such an engine, but there is
a lot of Vietnamese-language–specific software out there, so you never
know.

--
Doug Ewell | http://ewellic.org | Thornton, CO ����