Specification of Encoding of Plain Text

Tue Jan 10 14:44:30 CST 2017

On Tue, 10 Jan 2017 00:06:05 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:
> On 1/9/2017 2:24 PM, Richard Wordingham wrote:

I'll take your last point first.

>> One might hope that the subsection about 'logical order' in TUS 9.0
>> Section 2.2 Unicode Design Principles would help, but:

>> 1) Section 3 'Conformance' says nothing about logical order; and
>> 2) The subsection about 'logical order' seems to assume that there
>> exists a common practice; it does not actually place any requirement
>> on this common practice.

> I don't think either of these general sections are intended to
> provide the correct or expected ordering of characters for complex
> scripts. Any preferred ordering that doesn't result by happenstance
> from normalization would presumably be describe in the text of the
> scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.

The key word here is 'preferred'.  Your reply, while not completely
clear, confirms my view that Unicode does not *specify* an overall
character ordering for Khmer, despite the section's having a BNF regexp
for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}.

>> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
>> "Ordering of Syllable Components" would lead one to believe that the
>> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
>> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
>> SIGN U, U+17C6 KHMER SIGN NIKAHIT>.

> Richard,
> the group of Khmer experts that developed the recent label generation
> rules for root zone domain names considers that ordering the only one
> supported,  a specification you find here:
> https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf

But as you acknowledge, the specification only covers a strict subset of
legitimate Khmer script text, even of text composed of encoded Khmer
characters. It excludes some text given in TUS Section 16.4.  Indeed,
Section 7.4 of the proposal to ICANN even excludes the new spelling of
the word ឱ្យ (ooy, give) - <U+17B1 KHMER INDEPENDENT VOWEL QOO TYPE ONE,
U+17D2 KHMER SIGN COENG, U+1799 KHMER LETTER YO>, for U+17B1 is not a
consonant!

I have ignored the logical gaps in your reply; nothing in the *Unicode*
standard prohibits or deprecates the sequence <U+1781, U+17C6, U+17D2,
U+1789, U+17BB>, even though it does not satisfy the regexp I quoted
above.

>> So, you are not alone in thinking that the COENG goes between
>> consonants. 

I do not support the heresy that COENG may only occur between
consonants.

I do wonder if the Khmer Generation Panel opened their Pali grammars.
How would they propose to write the accusative singular of nouns in
-i?  The accusative singular of non-neuter nouns ends in -iṁ, which I
would expect to be written <U+17B7 KHMER VOWEL SIGN I, U+17C6 KHMER SIGN
NIKAHIT>, which is what I perceive at the end of a line in the middle
of the second left-hand page at
http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php .  Do they
expect one to use U+17B9 KHMER VOWEL SIGN Y?  (Thai scholars once had
to resort to such an expedient.)

Richard.