Proposed Expansion of Grapheme Clusters to Whole Aksharas - Implementation Issues

Richard Wordingham via Unicode unicode at unicode.org
Mon Jan 22 21:06:04 CST 2018


On Sun, 21 Jan 2018 22:34:12 -0800
Mark Davis ☕️ via Unicode <unicode at unicode.org> wrote:

> I was looking the feedback in http://www.unicode.org/review/pri355/,
> and didn't see yours there. Could you please file your feedback
> there? (Nothing on this list is tracked by the committee...)

This is the submission I have just made:

The major principled issue I have is that UAX#29 can no longer claim to
have a sound definition of the concept of a 'user-perceived character'.
Perhaps it never did.

Some of the claims would be better if there were
evidence to back them up.  For example, this evening I did a quick bit
of research and asked the Korean owner of the local Korean restaurant
how many letters there were in the hangul spelling of 'Gangnam'.  She
traced out the spelling of the word (강남) and came back with the
answer '6'. UAX#29 claims it has 2 user-perceived characters.  You
might also argue that she has spent too long in England to be a useful
informant.

The following old paragraph causes grief for me:

"As far as a user is concerned, the underlying representation of text
is not important, but it is important that an editing interface present
a uniform implementation of what the user thinks of as characters.
Grapheme clusters commonly behave as units in terms of mouse selection,
arrow key movement, backspacing, and so on. For example, when a grapheme
cluster is represented internally by a character sequence consisting of
base character + accents, then using the right arrow key would skip
from the start of the base character to the end of the last accent."

The problem is that many editors read this as saying that the arrow
keys should move by whole characters.  The result of this is that in
many applications, to replace the first character of a grapheme cluster
one must retype the entire grapheme cluster.  With a grapheme cluster
of three characters, as is common in Thai and Korean, this is
irritating.  With a grapheme cluster of four or five characters, as is
common in Northern Thai, it is annoying.

The prospect of the grapheme cluster being extended to include a whole
akshara fills me with dismay.  Consider the Northern Thai word ᩉ᩠ᨾᩰᩬᩫᩡ
<U+1A49 HIGH HA, U+1A60 SAKOT, U+1A3E MA, U+1A70 SIGN OO, U+1A6C SIGN OA
BELOW, U+1A6B SIGN O, U+1A61 SIGN A> /mɔʔ/ 'scrumptious'.  At present,
this 7 character word is split into three grapheme clusters, of lengths
2, 4 and 1.  However, it is clearly a single akshara.  To change the
first character, I would have to also retype the other 6 characters.

My first thought that changing software that way would breach the
UK's Equality Act 2010, by further restricting the ability of Northern
Thai users to do character by character editing.  (My wife's
protected characteristic extends to me for the purposes of the
Act.)  However, there may be a get-out in the form of Schedule 3 Section
30
(https://www.legislation.gov.uk/ukpga/2010/15/schedule/3/paragraph/30).
The supplier of the service can claim that they only supply a character
by character editing facility to the ethnic groups using simple scripts,
and that they are under no obligation to supply the service to members
of other ethnic groups. - 
"If a service is generally provided only for persons who share a
protected characteristic, a person (A) who normally provides the
service for persons who share that characteristic does not contravene
section 29(1) or (2)—

(a)by insisting on providing the service in the way A normally provides
it, or

(b)if A reasonably thinks it is impracticable to provide the service to
persons who do not share that characteristic, by refusing to provide
the service."

But what an embarrassing defence to offer!

However, there is another reason for rejecting the extension of
grapheme clusters to whole aksharas.  Currently, U+1A63 TAI THAM
VOWEL SIGN AA starts a grapheme cluster.  However, for non-defective
text, it is part of the same akshara as the preceding grapheme
cluster.  Now, the decision to make U+1A63 start a new grapheme cluster
is intrinsically reasonable.  It can have its own stack with a subscript
consonant and even a vowel, and it is not difficult to find manuscripts
showing a line break before it, e.g. L2/07-007 Figure 9b Leaf 2 lines
2/3, ᩈᨾᩮᩣᨴ᩠ᨴᨾ-ᩣᨶᩮᩉᩥ.

I believe that the akshara should be a level of text above the grapheme
cluster.  Ideally, it would be below the level of a word, but of course
in Sanskrit, word boundaries readily occur within present day grapheme
clusters.  (I made this recommendation in L2/17-122.)

Further comments apply to the definition of akshara boundaries,
regardless of whether they are to coincide with the boundaries of
grapheme clusters.

These rules do not work well where virama may fall back to visible
virama.  This is particularly the case with Tamil, where conjuncts are
restricted to K.SSA and SH.RII.  Johny Cibu provided an example where
the title துக்ளக் is broken as [ta-u,
ka-virama, lla, ka-virama]. However, as per the proposed algorithm it
would be: [ta-u, ka-virama-lla, ka-virama]

http://www.chennaispider.com/attachments/Resources/3486-7144-Thuglak-Tamil-Magazine-Chennai.jpg

For native intuition, I would cite the Tamil letter-counting account at
https://venkatarangan.com/blog/content/binary/Counting%20Letters%20in%20an%20Unicode%20String.pdf.
What the author counts is not spacing glyphs, but vowel letters and
consonant characters, with two significant modifications.  Firstly,
K.SSA counts as just one consonant, and SH.R.II is also counted as
containing a single consonant.  In other words, the Tamil virama
character works as a pure killer except in those two environments.
This is also the story the TUNE protagonists tell us.  It will be an
inelegant rule for UAX#29, but, unfortunately, reality is messy.


To quote Johny Cibu further:

"Malayalam could be a similar story. In case of Malayalam, it can be
font specific because of the existence of traditional and reformed
writing styles. A conjunct might be a ligature in traditional; and it
might get displayed with explicit virama in the reformed style. For
example see the poster with word ഉസ്താദ് broken as [u, sa-virama,
ta-aa, da-virama]
- as it is written in the reformed style. As per the proposed
algorithm, it would be [u, sa-virama-ta-aa, da-virama]. These breaks
would be used by the traditional style of writing.

https://upload.wikimedia.org/wikipedia/en/6/64/Ustad_Hotel_%282012%29_-_Poster.jpg

I believe there is a problem with the first two examples in Table
12-33.  If one suffixed <U+0D15 MALAYALAM LETTER KA, U+0D3E MALAYALAM
VOWEL SIGN AA> to the first two examples, yielding *പാലു്കാ and
 *എ്ന്നാകാ, one would have three Malayalam aksharas, not two extended
grapheme clusters as the proposed rules would say.




More information about the Unicode mailing list