Compatibility decomposition for Hebrew and Greek final letters

Richard Wordingham richard.wordingham at
Fri Feb 20 09:01:34 CST 2015

On Fri, 20 Feb 2015 10:04:32 +0200
Eli Zaretskii <eliz at> wrote:

> > Date: Thu, 19 Feb 2015 22:02:57 +0000
> > From: Richard Wordingham <richard.wordingham at>
> > 
> > > First, collation data is overkill for search,
> > > since the order information is not required, so the weights are
> > > simply wasting storage.
> > 
> > The big waste is not in text-dependent storage, but in the
> > processing for search orders that bear little relationship to
> > alphabetical order.
> Sorry, I don't think I follow: what is "processing for search orders"
> to which you allude here?

The examples in the CLDR root locale and in DUCET are the massive sets
of 'contractions' of consonants with vowels written before the
associated consonant in the scripts where spacing characters are stored
in the order written, namely Thai, Lao, Tai Viet and, soon, New Tai
Lue.  When customised collations are applied, there are enormous sets
for Burmese (in CLDR) and New Tai Lue (not published in CLDR).  The
latter two have 'logical order exception' final consonants.  (The
exception here is that the logical order of characters in a word is not
the order one wants for sorting.)

> I'm not talking about localized features, like for "å" to match "aa"
> in Danish locales.  I'm talking about matching strings that are
> equivalent under canonical and compatibility decompositions.

Nor was I.  I was talking about the user interface - commands, menus
and messages.

> As for user sophistication, AFAIR, Microsoft Word finds "²" when you
> search for "2" by default, so it sounds like Word considers all users
> sophisticated enough for that.  I think that's a solid enough
> precedent to follow.

But what switches the match off?


More information about the Unicode mailing list