Specification of Encoding of Plain Text

Asmus Freytag asmusf at ix.netcom.com
Tue Jan 10 15:12:47 CST 2017


On 1/10/2017 12:44 PM, Richard Wordingham wrote:
> On Tue, 10 Jan 2017 00:06:05 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> On 1/9/2017 2:24 PM, Richard Wordingham wrote:
> I'll take your last point first.
>
>>> One might hope that the subsection about 'logical order' in TUS 9.0
>>> Section 2.2 Unicode Design Principles would help, but:
>   
>>> 1) Section 3 'Conformance' says nothing about logical order; and
>>> 2) The subsection about 'logical order' seems to assume that there
>>> exists a common practice; it does not actually place any requirement
>>> on this common practice.
>   
>> I don't think either of these general sections are intended to
>> provide the correct or expected ordering of characters for complex
>> scripts. Any preferred ordering that doesn't result by happenstance
>> from normalization would presumably be describe in the text of the
>> scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.
> The key word here is 'preferred'.  Your reply, while not completely
> clear, confirms my view that Unicode does not *specify* an overall
> character ordering for Khmer, despite the section's having a BNF regexp
> for Khmer syllables - B{R|C}{S{R}}*{{Z}V}{O}{S}.

You are possibly misreading my use of the word "preferred".
>
>>> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
>>> "Ordering of Syllable Components" would lead one to believe that the
>>> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
>>> U+17D2 KHMER SIGN COENG, U+1789 KHMER LETTER NYO, U+17BB KHMER VOWEL
>>> SIGN U, U+17C6 KHMER SIGN NIKAHIT>.
>> Richard,
>> the group of Khmer experts that developed the recent label generation
>> rules for root zone domain names considers that ordering the only one
>> supported,  a specification you find here:
>> https://www.icann.org/en/system/files/files/proposal-khmer-lgr-15aug16-en.pdf
> But as you acknowledge, the specification only covers a strict subset of
> legitimate Khmer script text, even of text composed of encoded Khmer
> characters.

The advantage of the text I brought to your attention is the way it is 
formalized and that it was created with local expertise. The 
disadvantage from your perspective is that the scope does not match with 
your intended use case.

> It excludes some text given in TUS Section 16.4.  Indeed,
> Section 7.4 of the proposal to ICANN even excludes the new spelling of
> the word ឱ្យ (ooy, give) - <U+17B1 KHMER INDEPENDENT VOWEL QOO TYPE ONE,
> U+17D2 KHMER SIGN COENG, U+1799 KHMER LETTER YO>, for U+17B1 is not a
> consonant!
>
> I have ignored the logical gaps in your reply; nothing in the *Unicode*
> standard prohibits or deprecates the sequence <U+1781, U+17C6, U+17D2,
> U+1789, U+17BB>, even though it does not satisfy the regexp I quoted
> above.
Unicode clearly doesn't forbid most sequences in complex scripts, even 
if they cannot be expected to render properly and otherwise would stump 
the native reader.

However, the descriptions are reasonably detailed to let you find out 
whether you are using characters as intended, or not.
>
>>> So, you are not alone in thinking that the COENG goes between
>>> consonants.
> I do not support the heresy that COENG may only occur between
> consonants.
Remember, I gave you the scope for that. Your example is well taken, but 
from a different scope, where explicitly accounting for some other 
sequences is necessary. No disagreement.

A./
>
> I do wonder if the Khmer Generation Panel opened their Pali grammars.
> How would they propose to write the accusative singular of nouns in
> -i?  The accusative singular of non-neuter nouns ends in -iṁ, which I
> would expect to be written <U+17B7 KHMER VOWEL SIGN I, U+17C6 KHMER SIGN
> NIKAHIT>, which is what I perceive at the end of a line in the middle
> of the second left-hand page at
> http://watkhemararatanaram.org/tipitaka/viney_beidok_05b.php .  Do they
> expect one to use U+17B9 KHMER VOWEL SIGN Y?  (Thai scholars once had
> to resort to such an expedient.)
>
> Richard.
>
>



More information about the Unicode mailing list