Tai Tham Text Encoding

Richard Wordingham richard.wordingham at ntlworld.com
Sat Jul 23 11:12:44 CDT 2022


Most characters for writing words in the Tai Tham script in normal
texts have been encoded, though there are a few exceptions, of which
TAI THAM LETTER LAO LOW HA is the most prominent exception.  (This is
mostly handled by repurposing TAI THAM LETTER LOW HA, which is not used
in Lao.  Their relationship is like U+11034 BRAHMI LETTER LLA and
U+11075 BRAHMI LETTER OLD LETTER LLA.)  On close reading of the TUS,
perhaps we also need to disunify U+1A58 TAI THAM SIGN MAI KANG LAI
depending on how it may be positioned relative to a following syllable
with a preposed vowel.  (It was originally proposed as two separate
characters, distinguished by shape rather than positioning.)  We may
need some monstrosities such as 'INVISIBLE MAI SAM' (though I'd rather
use CGJ). 

However, I am having a hard time persuading people that there is a
defined encoding for combinations of characters that rendering engines
should respect.  What I regard as the basic definition of the encoding
of text is contained in the approved proposals, rather than in TUS or
any emanation thereof.

What should I call the specification of the encoding of text, as
opposed to the encoding of characters?  Would it be suitable to refer
to it as 'text encoding'?

I am trying to work out what in the way of Tai Tham text encoding is
laid down by the TUS and its emanations, such as the Unicode Character
Database. It is significant that the Indic syllabic category is
informative and by policy does not reflect sequencing requirements.
What I am left with is the general properties of marks, the principle
of canonical equivalence (which is still widely flouted) and the
specific text in the Tai Tham section.

Now, extracting specifications are a bit tricky.  For example, consider
"*Tone Marks*. Tai Tham has two combining tone marks, U+1A75 tai tham
sign tone-1 and U+1A76 tai tham sign tone-2, which are used in Tai Lue
and in Northern Thai. These are rendered above the vowel over the base
consonant."  In modern Tai Khuen, what I take to be TONE-1 is rendered
to the right of the larger vowels over the base consonant, such as
VOWEL SIGN I.  Should I therefore conclude that what I have taken to be
TONE-1 is something else?  That would be ridiculous.  We also have the
statement in TUS Section 2.11 that "all sequences of character codes
are permitted".

I think I can extract some meaning from the text in the same section:

"Tone marks are represented in logical order fol-
lowing the vowel over the base consonant or consonant stack. If there
is no vowel over a base consonant, then the tone is rendered directly
over the consonant; this is the same way tones are treated in the Thai
script."

Consider the word ᨠᩮᩬᩥ᩵ᩁ <HIGH KA, SIGN E, SIGN OA BELOW, SIGN I,
TONE-1> in a typical Northern Thai style.  The central stack, from top
to bottom, is TONE-1, SIGN I, HIGH KA, SIGN OA BELOW.  If there were 'no
vowel over the base consonant', then TONE-1 would be rendered directly
over the base consonant, which is not how it is written.  Therefore the
term 'vowel' refers to a vowel character rather than a complete
phonetic vowel.  Therefore the logical order of the marks above and
below is either <SIGN OA BELOW, SIGN I, TONE-1>, as in the
proposals, or <SIGN I, TONE-1, SIGN OA>.  The USE insists on <SIGN I,
SIGN OA, TONE-1>!  (The USE order could be corrected by its override
method.)

By contrast, there is some useful text on the position of U+1A7B TAI
THAM SIGN MAI SAM in character code sequences.

In summary, my main two questions are:

Is 'encoding of text' the correct phrase for the definition of the
correct arrangement?  Is it appropriate to submit a proposal for the
standardisation of Tai Tham text encoding?

Richard.





More information about the Unicode mailing list