Grapheme clusters and east asian width

Richard Wordingham richard.wordingham at ntlworld.com
Wed Sep 16 20:25:47 CDT 2015


On Wed, 16 Sep 2015 22:34:17 +0100
Daniel Bünzli <daniel.buenzli at erratique.ch> wrote:

> Le mercredi, 16 septembre 2015 à 20:33, Richard Wordingham a écrit :
> > Have you addressed the issue of Indic scripts? There are
> > discontiguous grapheme clusters composed of indecomposable code
> > points (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code
> > points (e.g. U+0BCA TAMIL VOWEL SIGN OO),  
> 
> Not sure I understand what you mean here.

In Khmer, a sequence <KA, sign OO> is rendered with glyphs in the
order /sign E, KA, sign AA/, and in Tamil a sequence <KA, sign OO> is
rendered with the glyphs in the order /sign EE, KA, sign AA/.  All the
glyphs have non-zero advance width.

In both cases <KA, sign OO> splits into two legacy grapheme clusters
<KA>, <sign OO> but are a single extended grapheme cluster.

In Tamil, <KA, sign OO> is in NFC but not in NFD, and splits into 

> > and whether consonant + virama + consonant is one cell or two may
> > even depend on the font (e.g. Devanagari).  
> 
> Well anything that is related to font metrics is out of scope from
> the point of view of a tty as I can't get the information.

You asked, "Is there any guidance on how to combine the information
given by grapheme clusters and the east asian width property to do
fixed-width layouts in terminal emulators ?".  From this, I deduced that
you are trying to write a terminal emulator.  Are you actually trying
to work out how a terminal emulator someone else wrote will position
characters?

Whether consonant + virama +consonant is once cell or two isn't a
question of font metrics. For example, consider the sequence <U+0921
DEVANAGARI LETTER DDA, U+094D DEVANAGARI SIGN VIRAMA, U+0921>.  This is
composed of two legacy and extended grapheme clusters, <U+0921, U+094D>
and <U+0921>. In the 'Lohit Hindi' font, the two consonants are arranged
vertically with no other representation of VIRAMA; horizontally, this
is a single cell. In the 'gargi' font, one gets two instances of DDA
side by side, with VIRAMA visible below the first.  Both fonts are
fully compliant with Unicode.

If the terminal you are working with emulates a VT100, I believe it
should be possible to ask it what the current cursor position is.  At
http://www.ccs.neu.edu/research/gpc/VonaUtils/vona/terminal/VT100_Escape_Codes.html ,
the query and response are called getcursor DSR and cursor CPR.

> For
> example it seems that U+1F400 to U+1F579 have an east-asian width of
> N but will actually occupy two columns in the built-in osx terminal;
> of course these scalar values are not east asian text per se.

In so far as the property is useful, they probably should be ea=Wide.

> Of course the best way would be to be able to hand out a string to
> the tty for it to measure. But then it already seems impossible to
> test whether a terminal is able to handle UTF-8 or not…
 
> Maybe trying to use that east asian width property, was not a good
> idea to start with.

If you're trying to work out what a particular emulator will do, the
starting point is its documentation.  For many, the useful
documentation may turn out to be the source code, which is not always
available. However, a successful dialogue with the terminal would avoid
these problems.  It may even offer a solution to the problems of
terminal size and text wrapping behaviour.

Richard.



More information about the Unicode mailing list