Compatibility decomposition for Hebrew and Greek final letters

Markus Scherer markus.icu at gmail.com
Fri Feb 20 11:49:20 CST 2015


On Thu, Feb 19, 2015 at 11:51 PM, Eli Zaretskii <eliz at gnu.org> wrote:

> I think decomposition to NFKD solves these issues, doesn't it?
>

Not completely. Judging from your question, you expected more mappings than
NFKD has. You might want to try the mappings that are used as input for
deriving the DUCET (default Unicode collation):
http://www.unicode.org/Public/UCA/latest/decomps.txt

For a character-based search, you should still try to work with canonical
equivalence, for example by applying the FCD check and normalizing when
that fails. http://www.unicode.org/notes/tn5/

Thanks.  I've studied that already, and I do know that collation data
> can be used for search.  But it's still a lot of data that I'd like to
> avoid loading, if possible.
>

Sure, as I said, it depends on what you need and want.

FYI, the ICU data file corresponding to the DUCET is about 160kB (for UCA
7.0) and could be reduced if limited to one specific use case, but the
collation and string-search code is large and complex.

Best regards,
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150220/1dcd01dc/attachment.html>


More information about the Unicode mailing list