USE Indic Syllabic Category

Richard Wordingham via Unicode unicode at unicode.org
Sat Feb 23 05:39:44 CST 2019


On Sat, 23 Feb 2019 14:46:27 +0800
梁海 Liang Hai via Unicode <unicode at unicode.org> wrote:

> >>> once the USE acknowledges that subjoined consonants may follow
> >>> vowels    
> >> 
> >> I expect to update the USE spec to address this soon.  
> > 
> > That seems welcome news.  I still don't know what the problem with
> > supporting them has been.  
> 
> USE wasn’t designed to allow such a syllable structure. Tai Tham’s
> being supported by USE is kind of an oversight. And although it’s
> appropriate to allow conjoined consonants to follow post-base-spacing
> vowel signs, it’s not really a trivial debate whether USE should
> allow conjoined consonants to non-post-base-spacing (ie, pre-base,
> above-base, and below-base) vowel signs—considering the ambiguity.

1. "The goal of the clustering logic is to enable what is graphically
consistent with a given script’s rules, rather than enforcing
particular orthographic or linguistic rules. Such considerations should
be applied at another layer, such as a spelling checker." - USE
Specification.

There are very few cases that cannot be resolved by a spell-checker
once word boundaries are resolved.  Pali and Tai phonology (but
Lao is TBC) conspire to keep the numbers down.

2. The UTC membership had this discussion when discussing the proposals
on the Unicore list.

3. Ambiguity is often font-dependent with above- and below-base vowels,
and with tone marks.  Marks above are frequently positioned relative to
the phonetically preceding spacing consonant element - <SAKOT, BA/PA>,
<SAKOT, LA>, <SAKOT, YA> and <SAKOT, SA> are common coda ("sakot")
consonants that are spacing.  In Northern Thai, <SIGN U, SAKOT, YA> is
frequently and <SIGN UU, SAKOT, BA/PA> can be written with the vowel
largely to the left of the subscript consonant.  Apart from <SIGN U,
SAKOT, YA>, Northern Thai largely avoids <vowel below, subscript
consonant>, preferring the minor ambiguity of, for example, <RA,
SIGN UU, BA> being either /huːp/ or /luː paʔ/.  (These two forms are
a doublet.)

4. They're explicitly noted in the TUS for the Khmer script, and I
suspect they're important for Tai languages in the Khmer ('Khom')
script.

5. For visual proofing, one can use colour-coding - people are welcome
to copy the relevant logic from my Da Lekh Si font.  Word processor
support for colour distinctions is limited, but it is in place in
several browsers.  Most of each akshara is in the foreground colour, so
it works with syntax highlighting and similar existing uses of
colour-coding.

6. The Sanskrit clusters grv- and gvr- are ambiguous in several
Sanskrit-capable Indic scripts.  (I haven't yet had the chance to study
how Sanskrit is written in Tai Tham, though I do know of one
inscription.)

7. The ambiguity of <SAKOT, BA> and <SAKOT, PA> was called out when
<SAKOT, BA> was allowed as the usual subscript of U+1A37 TAI THAM
LETTER BA.

8. The biggest ambiguity issue is the use of <SAKOT, U+1A4B TAI THAM
LETTER A> for U+1A6C TAI THAM VOWEL SIGN OA BELOW.  The USE is
powerless to deal with this.  I wish someone would let me in on the
evidence that they are actually distinct.

9. There is actually a problem with CVC aksharas being wrongly encoded
paradoxically because of USE's poor support for Tai Tham.

HarfBuzz allows an OpenType font to shape Tai Tham text even if it does
not declare support for the script.  Such fonts have to do Indic
rearrangement themselves, and this is generally done by means of
ligatures for <preposed vowel, consonant>.  Consequently, a cluster
<HIGH HA, SAKOT, NA, SIGN E> gets encoded as <HIGH HA, SIGN E, SAKOT,
NA>, as there are scores of clusters and five preposed vowels.  I know
it is possible to do rearrangement properly given access to GSUB; I
have a Tai Tham via ASCII mode in my Da Lekh fonts, and I have to do
some rearrangement to clean up after the USE.

There was a brief, happy period when HarfBuzz's SEA shaping engine was
available for Tai Tham, but this was deleted in favour of an
implementation of the USE.  There are now two bunches of Tai Tham
fonts which simply don't work on Microsoft browsers - Graphite fonts
and the DIY OpenType Indic rearrangers.

Richard.



More information about the Unicode mailing list