Encoding Sequence for Tai Tham Matres Lectionis

Richard Wordingham richard.wordingham at ntlworld.com
Sat Jan 3 19:29:05 CST 2015


I have two questions, but I begin with some preliminaries in case I am
labouring under any misapprehensions.

Firstly, I assume that any legible text in the Tai Tham script with a
well-defined pronunciation in one of the main languages using the Tai
Tham script (Pali, Tai Khün, Tai Lue, Northern Tai and Lao) either:

1) Contains an unencoded character;
2) Has a unique (up to canonical equivalence) correct encoding;
3) Has a glyph with multiple encodings; or
4) Reveals a deficiency in the specification of the encoding of the
script.

Glyphs with multiple encodings most commonly occur with styles that do
not distinguish U+1A62 TAI THAM VOWEL SIGN MAI SAT and U+1A76 TAI THAM
SIGN TONE-2.  These can generally be resolved on the basis of the
pronunciation.

Secondly, what is the definition of the encoding?  Is it just the
Unicode standard, or is it the sequence of approved proposals plus the
Unicode standard (with the latest approval taking precedence)?  I
presume the proposals are relevant, as otherwise there might not be a
defined coding difference between the second syllable of /ɲaʔɲuʔ/
'shaky' and the word /hui/ 'to sprinkle'.  The proposals lead to the
encoding <U+1A49 TAI THAM LETTER HIGH HA, U+1A60 TAI THAM SIGN SAKOT,
U+1A3F TAI THAM LETTER LOW YA, U+1A69 TAI THAM VOWEL SIGN U> for the
former and <U+1A49, U+1A69, U+1A60, U+1A3F> for the latter.  The visual
difference lies in the positioning of the vowel; there is no visual
justification for claiming that the dependent consonant is subjoined
to the vowel in either case.

Similarly, there is nothing in TUS itself to specify whether /kuː/
(Lao /kʰuː/) 'pair' is spelt <U+1A23 TAI THAM LETTER LOW KA, U+1A6A TAI
THAM VOWEL SIGN UU, 1A76 TAI THAM SIGN TONE-2> or <U+1A23, U+1A76,
U+1A6A>.  Unlike Thai, these two sequences are not canonically
equivalent.

Before LANNA VOWEL SIGN AM and LANNA VOWEL SIGN TALL AM were rejected,
the basic syllable structure for encoding was
<pre-vowel_consonant_stack, vowels_before, vowels_below, vowels_above,
tones_etc, vowels_after, post-vowel_consonant_stack>.  Apart from the
first element of the pre-vowel consonant stack, each elements of the
consonant stacks was either a pair of SAKOT and consonant letter or a
consonant sign.

The script has made use of three consonant letters to indicate
vowels - U+1A3F LETTER LOW YA, U+1A45 LETTER WA and LETTER A.  The
subscript form of LETTER A has for most purposes evolved into a vowel
symbol, U+1A6C TAI THAM VOWEL SIGN OA BELOW, and presents no known
issues.  The combinations <U+1A60, U+1A3F> and <U+1A60, U+1A45>
represent vowels, generally /e/ and /o/ in Tai Khün and Tai Lue
and /i:a/ and /u:a/ in Northern Thai and Lao.  These may reasonably be
regarded as matres lectionis.  The question then arose of how to order
them with respect to any other vowels or tone marks.  Thai suggested
that the mater lectionis should come last, treating the syllable as a
pair of chained syllables, but because of Tai Khün feedback they were
included in the pre-vowel consonant stack. For interaction with other
vowel symbols, this decision in reflected in the 2007 proposal
http://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf .

I have been indexing the Lanna script spellings in the 'Northern Thai
Diction of Palm-Leaf Manuscripts', and I have encountered puzzles with
some very Siamese spellings.

Q1.  Should I treat the mater lectionis as part of the initial stack or
as starting a chained syllable when an unexpected written vowel
appears to proceed it? Specifically:

Q1a. Should I encode a certain writing of /kuaʔ/ 'a wooden or
woven-bamboo tray as <U+1A20 TAI THAM LETTER HIGH KA, U+1A60, U+1A45,
U+1A62 TAI THAM VOWEL SIGN MAI SAT, U+1A61 TAI THAM VOWEL SIGN A> or as
<U+1A20 TAI THAM LETTER HIGH KA, U+1A62 TAI THAM VOWEL SIGN MAI SAT,
U+1A60, U+1A45, U+1A61 TAI THAM VOWEL SIGN A>?  The usual spelling of
this word would be <U+1A20 TAI THAM LETTER HIGH KA, U+1A60, U+1A45,
U+1A6B TAI THAM VOWEL SIGN O, U+1A61 TAI THAM VOWEL SIGN A>.

Q1b. Should I encode a certain writing of /luːa/ 'firewood' other than
as <U+1A49 TAI THAM LETTER HIGH HA, U+1A56 TAI THAM CONSONANT SIGN
MEDIAL LA, U+1A62 TAI THAM VOWEL SIGN MAI SAT, U+1A60, U+1A45>, and if
so, how.  The usual writing of the word would be encoded as <U+1A49,
U+1A56, U+1A60, U+1A45, U+1A6B TAI THAM VOWEL SIGN O>.

Q1c. I see three reasonable encodings of the writing of /sawiːan/ 'a
large woven basketfor holding unhusked rice'.  The choice between (ii)
and (iii) depends on the answer to Q2.  The three choices are:
(i) <U+1A48 TAI THAM LETTER HIGH SA, U+1A60, U+1A45, U+1A7B TAI THAM
SIGN MAI SAM, U+1A66 TAI THAM VOWEL SIGN II, U+1A60, U+1A3F, U+1A41 TAI
THAM LETTER RA>
(ii) <U+1A48 TAI THAM LETTER HIGH SA, U+1A60, U+1A45, U+1A7B TAI THAM
SIGN MAI SAM, U+1A60, U+1A3F, U+1A66 TAI THAM VOWEL SIGN II, U+1A41 TAI
THAM LETTER RA> and
(iii) <U+1A48 TAI THAM LETTER HIGH SA, U+1A60, U+1A45, U+1A60, U+1A3F,
U+1A7B TAI THAM SIGN MAI SAM, U+1A66 TAI THAM VOWEL SIGN II, U+1A41 TAI
THAM LETTER RA>.
Which encoding should I choose?

Q2. Where should I put the MAI SAM in the encoding of the fuller usual
writing of /sawiːan/?  Should I write
(i) <U+1A48 TAI THAM LETTER HIGH SA, U+1A60, U+1A45, U+1A7B TAI THAM
SIGN MAI SAM, U+1A60, U+1A3F, U+1A41 TAI THAM LETTER RA> or
(ii) <U+1A48 TAI THAM LETTER HIGH SA, U+1A60, U+1A45, U+1A60, U+1A3F,
U+1A7B TAI THAM SIGN MAI SAM, U+1A41 TAI THAM LETTER RA>?

The TUS does not specify where the MAI SAM representing the typically
anaptyctic vowel /a/ should go.  (In this case, /swiːan/ *is* a possible
Northern Thai word.)  The previously cited 2007 proposal says, "it is
stored following the subjoined form to indicate the consonant being at
the start of a new syllable".  However, this moves a mark which is
positioned like a vowel or tone mark into the consonant cluster's
sequence of code points.

The Maefahluang dictionary (p719 of Revision 1) actually writes the mai
sam after the RA.  Should this be regarded as a typographical error?  I
have not been able to discern a pattern in the positioning in that
dictionary of mai sam used to indicate a hidden syllable boundary.

Richard.



More information about the Unicode mailing list