Kirai Rat Decompositions, was Re: Compatibility decomposables that are not compatibility characters

Tue Feb 22 08:00:29 CST 2022

On 2/18/22 17:06, Richard Wordingham via Unicode wrote:
> On Fri, 18 Feb 2022 14:46:13 -0500
> "Mark E. Shoulson via Unicode" <unicode at corp.unicode.org> wrote:
>
>> Perhaps relevant to this thread, I was just reading in
>> https://www.unicode.org/L2/L2022/22043-kirat-rai.pdf L2/22-043,
>> proposal to encode Kirai Rat Script, where it remarks regarding the
>> vowels:
>>
>>> These should all be encoded atomically. This is because
>>> linguistically these vowels are not composed of two
>>> separatecharacters, they are single vowels in their own right. It
>>> is true that the custom encoded Kirat Rai font uses decomposedvowel
>>> signs as a matter of expediency, but this decision should not
>>> influence the right way to encode the script.Because the glyph for
>>> some of the vowels (aa and e) are part of the shape of the last 3
>>> vowels (ai, o, au) there shouldbe canonical decompositions for the
>>> last 3 vowels. With these decompositions, Do Not Use tables are not
>>> necessary.
>> If the vowels are to be encoded atomically, and it sounds like they
>> should be, shouldn't we *not* want to have canonical decompositions
>> for them?  I thought Unicode was trying to avoid precomposed
>> characters at this point.  I guess it's too late to hope for "only
>> one right way to spell it" out of Unicode, but is that still
>> something we try to approach?  It almost seems to me that canonical
>> decompositions also stem from cases of "things that wouldn't be
>> encoded if they were proposed now," and if so it would not really
>> make sense to propose anything with a canonical decomposition.  Or am
>> I misunderstanding the attitude towards canonical decompositions, or
>> the proposal's statement?
> X technology should obviously be opposed wherever possible.  We should
> make it impossible to enter these vowel symbols at a a single stroke
> when using a simple X keyboard or even an MSKLC keyboard creator.  We
> must keep professional keyboard writers in work.
>
> Your wording is confusing.  There are several different options:
>
> 1) Only allow encoding for single vowels (the Khmer model)
> 2) Do not encode visually compound vowels (the Myanmar model)
> 3) Allow visually compound vowels as sequences or as single characters
> (the south Indian model)
>
> The proposal argues for (3), which rather assumes that canonical
> equivalence will be taken seriously.  At least we don't have the
> problem presented by doubled multipart south Indian vowels.
>
> Model (1) calls forth a need for stop lists, and potential confusion
> when a compound vowel notation is later found to be needed.  (From
> the Southern Thai point of view, there seems to be a vowel missing from
> the Khmer script which it would be very tempting to just encode as
> <U+17C1, U+17B7>, though in *Khmer* usage it is arguably just a glyph
> variant of U+17BE KHMER VOWEL SIGN OE.)
>
> I think you're calling for (2), which with current technology seems to
> make keyboard creation unduly complicated or fragile if we want users
> to be able to treat KIRAT RAI VOWEL SIGN O as a single entity.  (Do
> users have such a perception?  We'll probably be told that it's not a
> user-perceived character.)

Sorry to have been confusing, and I'm not so much "calling for" one 
answer or another as asking what's more in line with what we do.  The 
text in the proposal says "These should all be encoded atomically. This 
is because linguistically these vowels are not composed of two separate 
characters, they are single vowels in their own right."  This would seem 
to me to be proposing that the seemingly-compound characters be encoded 
instead as single characters, because they are not viewed as being 
compound.  And that makes sense to me, as well, albeit we also go in the 
other direction, in not encoding compound letters like "ll" or "ch" in 
Welsh as separate letters.

But then the proposal goes on to say "Because the glyph for some of the 
vowels (aa and e) are part of the shape of the last 3 vowels (ai, o, au) 
there should be canonical decompositions for the last 3 vowels," which 
sounds to me like the atomic single "ai" vowel is to be given a 
canonical decomposition into its simpler components, i.e., "ai" is 
basically a precomposed character, like é, which has atomic existence 
but is canonically equivalent to e + ◌́.  As I understand it, that would 
be #3 in your list above.  And I thought that was considered a Bad Thing 
these days, that we were trying to avoid, when possible, having too many 
ways to represent the "same" (canonically equivalent) text.  Am I wrong 
about that, in general?

I guess if I were to be "calling for" anything, it would be... um, now 
I'm finding your wording unclear.  I think #1 in your list, by which I 
intend that aa and e and ai and o and au and everything would each be 
given its own code-point, and that none of those code-points would be 
canonically equivalent to a sequence of the others.  #2 sounds like 
encoding only the vowel-signs which don't look like sequences of others, 
and ai and o and au could only be represented as sequences, which seems 
to run counter to the proposal (not that decisions can't be made counter 
to proposals), and #3 sounds like encoding each vowel as its own 
character, as in #1, *and* the "compound" variables could be represented 
either by their own codepoints or by sequences of "simple" vowels, and 
the two representations would be canonically equivalent, and that 
situation, to me, seems undesirable.

Am I making sense?

~mark