Grapheme clusters and east asian width

Daniel Bünzli daniel.buenzli at erratique.ch
Wed Sep 16 16:34:17 CDT 2015


Le mercredi, 16 septembre 2015 à 21:27, Dominikus Dittes Scherkl a écrit :
> Why adding them up?
> I think every grapheme cluster of hangul syllables would have simply
> width 2 - that is the concept of CJK charakters.

I don't personally know how CJK characters behave in general w.r.t. to width, that's why I'm asking. I'm just trying to find a simple, best-effort, data-driven algorithm for the problem at-hand by using standard properties and possibly without making built-in assumptions about scripts.


Le mercredi, 16 septembre 2015 à 20:33, Richard Wordingham a écrit :
> Have you addressed the issue of Indic scripts? There are
> discontiguous grapheme clusters composed of indecomposable code points
> (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g.
> U+0BCA TAMIL VOWEL SIGN OO),  

Not sure I understand what you mean here.

> and whether consonant + virama + consonant is one cell or two may even depend on the font (e.g.
> Devanagari).  

Well anything that is related to font metrics is out of scope from the point of view of a tty as I can't get the information. For example it seems that U+1F400 to U+1F579 have an east-asian width of N but will actually occupy two columns in the built-in osx terminal; of course these scalar values are not east asian text per se.

> How are you handling ligatures between grapheme clusters,
> e.g. English <f, i>?  

Here again I'd need font information for that, I expect the tty not to make ligatures between f and i.


Of course the best way would be to be able to hand out a string to the tty for it to measure. But then it already seems impossible to test whether a terminal is able to handle UTF-8 or not…

Maybe trying to use that east asian width property, was not a good idea to start with.

Best,

Daniel






More information about the Unicode mailing list