Compatibility decomposition for Hebrew and Greek final letters

Eli Zaretskii eliz at gnu.org
Fri Feb 20 01:51:45 CST 2015


> Date: Thu, 19 Feb 2015 13:08:57 -0800
> From: Markus Scherer <markus.icu at gmail.com>
> Cc: Philippe Verdy <verdy_p at wanadoo.fr>, Julian Bradfield <jcb+unicode at inf.ed.ac.uk>, 
> 	Unicode Mailing List <unicode at unicode.org>
> 
>     Sorry, I disagree. First, collation data is overkill for search,
>     since the order information is not required, so the weights are simply
>     wasting storage. Second, people do want to find, e.g., "²" when they
>     search for "2" etc.
> 
> Depends on what you do.

The context is text search, where the user enters the search string
and specifies the strength of the required matches, and the editor
then searches a (potentially very large) buffer of text.

> "the weights are simply wasting storage" is not really
> true, you do have to encode something for which characters are same or
> different, and it turns out that that comes close to defining a sort order.
> Some people also want to ignore accents, others don't.

I think decomposition to NFKD solves these issues, doesn't it?

> As to your original question, Unicode collation would give you primary-equal
> "mem" and "sigma" characters.
> 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM
> FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM
> 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM
> FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] *
> HEBREW LETTER MEM WITH DAGESH
> 
> 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA
> 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL
> 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL
> FINAL SIGMA
> ...
> 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL
> SIGMA
> 
> You can certainly simplify a few things when you don't care about the order,
> therefore CLDR defines "search" tailorings. Some popular browsers use
> collation-based search for ctrl-F in-page search, either with strength=primary
> (ignore accent/case/etc. variants), or with asymmetric search. ICU implements
> those algorithms and carries the CLDR tailorings.
> 
> See http://www.unicode.org/reports/tr10/#Searching

Thanks.  I've studied that already, and I do know that collation data
can be used for search.  But it's still a lot of data that I'd like to
avoid loading, if possible.


More information about the Unicode mailing list