precomposed characters (was: Unicode organization is still anti-Serbian and anti-Macedonian)

Sun Feb 16 07:13:29 CST 2014

On Sat, 15 Feb 2014 19:39:59 +0100
"Janusz S. Bien" <jsbien at mimuw.edu.pl> wrote:

> Quote/Cytat - Richard Wordingham <richard.wordingham at ntlworld.com>  
> (Sat 15 Feb 2014 07:25:51 PM CET):
> > Each precomposed character adds a small processing
> > overhead to an extremely large number of computers, not just to the
> > computers that actually use it.

> This is a very strong claim. Would be so kind to elaborate?

The following need to be stored simply because the character has been
assigned:

name (typically for character pick-lists)
script (typically for breaking text runs by script)
casing (upper/lower/titlecase)
collation properties (not strictly necessary)

There are many other properties, but many of them will often be covered
by default rules and may not need to be stored explicitly.

The only likely subsetting options I can think of would be to not
support the supplementary planes or to not support CJK characters.
This data will be moved when an operating system is installed, and the
files are liable to be moved or replaced at other times.  I will concede
that it is possible that this information may not need to be moved from
disk to memory - the data is likely to be ordered by codepoint and if
nearby codepoints are never used either it will not need to be loaded.

Some data files are mapped to memory, but I unfortunately I can't
comment on the processing overhead of increasing their size if the
additional data is not accessed.

The operations that will be most significantly be affected is
composition.  I am assuming that composition information will be used
even in the presence of a composition exclusion, e.g. to select the
best glyph from a font.  (One could optimise this away by potentially
rendering the canonical decomposition of a precomposed character
differently to the precomposed character.)  The composition data,
consisting of the pairs of characters to which precomposed characters
decompose, will be stored in codepoint order of the decomposition.  The
net effect of this is that the existence of unused composition data
will increase the number of cache misses, and thus increase the amount
of processing required.

If there is not a separate store of compositions not subject to
composition exclusion, then the same effect will occur whenever a
composition happens as part of the transform of a character string to
NFC or NFKC, e.g. in the processing of a non-ASCII internet domain name.

If data access is not carefully optimised, there will be many more
occasions when unused decompositions will nevertheless add to the
processing load.

Richard.