Grapheme clusters and east asian width

Richard Wordingham richard.wordingham at ntlworld.com
Wed Sep 16 14:33:51 CDT 2015


On Wed, 16 Sep 2015 02:45:27 +0100
Daniel Bünzli <daniel.buenzli at erratique.ch> wrote:

> This will delimit a single grapheme cluster, but if I try to add up
> their east asian widths (W, N, N), this would result in 4 columns.

> Does something naïve like looking up only the east asian width of the
> first scalar value in the grapheme cluster and use 2 columns for it
> if this is F or W and 1 column otherwise work or are there counter
> examples where this breaks ? Or is there anything more clever that
> can be done ?

The silence is a bit worrying, but I can't see why that wouldn't work
for normal text in CJK scripts.  (Hangul LLLLLVVVVTTTT would probably
cause some problems!)

Have you addressed the issue of Indic scripts?  There are
discontiguous grapheme clusters composed of indecomposable code points
(e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g.
U+0BCA TAMIL VOWEL SIGN OO), and whether consonant + virama +
consonant is one cell or two may even depend on the font (e.g.
Devanagari).  How are you handling ligatures between grapheme clusters,
e.g. English <f, i>?  There are Tamil and Tai Tham examples of
compulsory ligatures, shri and naa.  Looking further ahead, there are
characters in the pipeline that should be either Mc or Mn depending on
what the base consonant is!

You have dealt with grapheme clusters with a width of one cell and a
depth of two, haven't you?  Actually, there's a good argument for some
grapheme clusters occupying cells above and below the line!

Richard.



More information about the Unicode mailing list