Specification of Encoding of Plain Text

Richard Wordingham richard.wordingham at ntlworld.com
Fri Jan 13 19:18:09 CST 2017


On Fri, 13 Jan 2017 10:27:35 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> This points to another interesting issue. A number of languages have 
> seen orthographic reforms that affect the use of complex scripts.

> Now then, a decision: do you support both the old and the new style
> in the same rule-set? If vestiges remain in general use, you may not
> have a choice, but, what if the rules for old and new (or for
> different languages in the same script) actually conflict?

What we have seen in Khmer is a change that almost prohibits CVC
orthographic clusters.  (I don't count nikahits, visargas or fragments
of vowels as C.)  However, that is a rule of the language; it does not
need to be a rule of the script.

I am not sure that the old and new rules should conflict.  We are
presumably taking about a change made before the script was soundly
encoded; it seems unreasonable that renderers should suddenly refuse to
handle text that was previously valid.

Now, I can think of a potential problem with Northern Thai ᨴᩘ᩠ᩃᩣ᩠ᨿ
<U+1A34 TAI THAM LOW TA, U+1A58 MAI KANG LAI, U+1A60 SAKOT, U+1A43 LA,
U+1A63 SIGN AA, U+1A60 SAKOT, U+1A3F LOW YA> 'all'.  It is a single,
chained orthographic syllable.  This appears to be contrary to Tai Khün
grammar, and is not clear to me how a modern Tai Khün font should render
it. (It's also contrary to USE, but so is most of the language.)  The
problem is that U+1A58 is a final, spacing mark in Tai Khün, while
further east it is a repha-like mark - it corresponds to kinzi in
Burmese.

The solution I anticipate is that it must be rendered as a
non-spacing mark even in Tai Khün when it cannot be interpreted as a
spacing mark.  Has anyone handled this issue?  My intended solution will
allow a common sequence of code points for both the old style (U+1A58
as kinzi), the intermediate Northern Thai styles, and the new style
(U+1A58 as a final consonant).

> In the case that I cited, that combination of language/script was
> taken as out of scope for other reasons; now, for general text, are
> there situations where you'd want separate sets of rules for each
> language?

For determining which language a text might belong to, different rules
would be appropriate.  However, for deciding whether to render text,
that seems ridiculous.  Converting renderable multilingual text to plain
text would make it unrenderable, which is surely undesirable.

Having said that, there do appear to be potential problems in the Lanna
script arising from interactions of spelling and layout style.  In some
styles, the consonant (and vowel) stack turns right at a certain depth,
and therefore can reasonably contain more items that a strictly
vertical stack.  As both styles appear in material published in Chiang
Mai, I'd be loath to declare different validity rules.  I'd rather
treat any problems as the surfacing of a renderer limitation.

Richard.



More information about the Unicode mailing list