Wild Card Collation Matches

Richard Wordingham richard.wordingham at ntlworld.com
Sun Jun 1 20:36:14 CDT 2014


In a fairly wild environment
(http://www.thaivisa.com/forum/topic/730564-new-front-end-to-ri-dictionary-alpha),
I encountered the following question:

"If you search for ก* do you expect to return words such as เก่ง and
ไก่?"

Now, as a regular expression, in UTS#18 'Unicode Regular Expressions'
Version 13 (dated 2008, superseded in 2012), RL3.5 comes pretty close
to this with ranges tailored for collation.  The pattern
[\u0E01-\u0E02]* would match both those words.  To be precise, one
would use a search for [ก-ไก]*.  RL3.5 has been with withdrawn because
of difficulties, though I can't say that I see it as a major difficulty
that at least one of [A-z] and [a-Z] is empty.  Even POSIX is aware of
that little issue.

Turning to fully collation-based definition of searches, UTS#10
Unicode Collation Algorithm's definition DS2 comes closest to answering
the question for the UTC. DS2 reads:

DS2. The pattern string P has a match at Q[s,e] according to collation
C if C generates the same sort key for P as for Q[s,e], and the offsets
s and e meet the boundary condition B. One can also say P has a match
in Q according to C.

It's a soft job to create sequences of codepoints P starting with
U+0E01 THAI CHARACTER KO KAI that are tertiary matches for เก่ง and
ไก่ under both DUCET and the CLDR collations for Thai.  Can I therefore
say that the two strings match the pattern ก* according to these
collations?  (A pattern P for ไก่ <U+0E44 THAI CHARACTER SARA AI
MAIMALAI, U+0E01 THAI CHARACTER KO KAI, U+0E48 THAI CHARACTER MAI EK> is
P = <U+0E01, U+0E34F COMBINING GRAPHEME JOINER, U+0E44, U+0E48>.)

Disturbingly, another possible answer is that there is no match for
<U+0E01 THAI CHARACTER KO KAI> in either string because it only occurs
in the legacy/extended grapheme cluster <U+0E01, U+0E48>.

Richard.



More information about the Unicode mailing list