Compatibility decomposition for Hebrew and Greek final letters

Markus Scherer markus.icu at gmail.com
Thu Feb 19 15:08:57 CST 2015


On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii <eliz at gnu.org> wrote:

> Sorry, I disagree.  First, collation data is overkill for search,
> since the order information is not required, so the weights are simply
> wasting storage.  Second, people do want to find, e.g., "²" when they
> search for "2" etc.
>

Depends on what you do. "the weights are simply wasting storage" is not
really true, you do have to encode something for which characters are same
or different, and it turns out that that comes close to defining a sort
order. Some people also want to ignore accents, others don't.

As to your original question, Unicode collation would give you
primary-equal "mem" and "sigma" characters.
05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM
FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL
MEM
05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM
FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo
[1F81.0020.0002][0000.005F.0002] * HEBREW LETTER MEM WITH DAGESH

03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA
03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL
1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL
FINAL SIGMA
...
03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL
SIGMA

You can certainly simplify a few things when you don't care about the
order, therefore CLDR defines "search" tailorings. Some popular browsers
use collation-based search for ctrl-F in-page search, either with
strength=primary (ignore accent/case/etc. variants), or with asymmetric
search. ICU implements those algorithms and carries the CLDR tailorings.

See http://www.unicode.org/reports/tr10/#Searching

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150219/46d00add/attachment.html>


More information about the Unicode mailing list