Grapheme clusters and east asian width

Richard Wordingham richard.wordingham at ntlworld.com
Thu Sep 17 13:59:04 CDT 2015


On Thu, 17 Sep 2015 19:30:41 +0300
Eli Zaretskii <eliz at gnu.org> wrote:

> > Date: Thu, 17 Sep 2015 17:25:34 +0100
> > From: Daniel Bünzli <daniel.buenzli at erratique.ch>
> > Cc: richard.wordingham at ntlworld.com, unicode at unicode.org
> > 
> > Le jeudi, 17 septembre 2015 à 17:24, Eli Zaretskii a écrit :
> > > > Is there a formal definition of the algorithm used ? This [1]
> > > > is not very helpful.
> > >  
> > > They just use a table of values, AFAIK.
> > 
> > But is it standardized or everyone has its own table ?  
> 
> I don't know, but I'm sure you will find out if you look into the
> glibc sources.  They are publicly available.

Shouldn't be that the locale sources?  That then makes sense, for
ambiguous width is resolved differently in Eastern and Western
traditions.

However, the calculation from single character width to string width is
quite naïve - they are just added up, at least in some version of glibc!
This doesn't work when a spacing mark decomposes into two spacing marks
- <U+0B95 TAMIL LETTER KA, U+0BCB TAMIL VOWEL SIGN OO> gets a length of
2, while the canonically equivalent string <U+0B95, U+0BC7 TAMIL VOWEL
SIGN EE, U+0BBE TAMIL VOWEL AA> gets a length of 3!  This affects the
positioning of text following them in gnome-terminal.

Richard.



More information about the Unicode mailing list