Compatibility decomposition for Hebrew and Greek final letters

Thu Feb 19 13:31:07 CST 2015

The decompositions are not needed for plain text searches, that can use the
collation data (with the collation data, you can unify at the primary level
differences such as capitalisation and ignore diacritics, or transform some
base groups of letters into a single entry, or make some significant
primary difference when there are diacritics (for example in German
equating 'ae' and 'ä' at the primary level).

Yes, collation must use the canonical decompositions, but does not need to
follow the compatibility decompositions for all locales (even if this is
done for the root locale and the DUCET... with some exceptions considering
the rules for the most important language using an encoded letter and all
its *canonical* equivalents).

Compatibility decompositions in the UCD have little use, they should be
preserved in encoded texts and transformations of text, they are just
suggestions which *may* be useful:
- for rendering text (the most important use is in character mappings
within fonts, or in fallback mappings implemented in the rendering engine),
- or for mappings to legacy encodings (e.g. when converting to GSM for SMS
services, or converting for display in text-only devices and terminals
using a limited OEM charset)

2015-02-19 12:59 GMT+01:00 Eli Zaretskii <eliz at gnu.org>:

> > Date: Thu, 19 Feb 2015 11:47:24 GMT
> > From: Julian Bradfield <jcb+unicode at inf.ed.ac.uk>
> >
> > In Arabic, the variant of a letter is determined entirely by its
> > position, so there is no compelling need to represent the forms
> separately
> > (as characters rather than glyphs) save for the existence of legacy
> > standards (and if there is, you can use the ZWJ/ZWNJ hacks). Thus the
> > forms would not have been encoded but for the legacy standards.
> > Whereas in Hebrew, non-final forms appear finally in certain contexts
> > in normal text; and in Greek, while Greek text may have a determinate
> > choice between σ and ς, there are many contexts where the two symbols
> > are distinguished (not least maths).
>
> Got it, thanks.
> _______________________________________________
> Unicode mailing list
> Unicode at unicode.org
> http://unicode.org/mailman/listinfo/unicode
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20150219/5ac32c95/attachment.html>