Specification of Encoding of Plain Text

Mark Davis ☕️ mark at macchiato.com
Tue Jan 10 03:11:41 CST 2017

What I really wish we had would be a machine readable set of regexes for
each complex script (and for each language-script combination that is
different than the default for that script).

Such a regex R could be used for determining the well-formed ordering of
code points within words. The regex need not be for syllables, or grapheme
clusters, or any other formal construct. The *only* requirement it would
need to fulfill is that you could determine well-formed words with:

word := (R)+

That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV would pass
the text, but CCV would fail. Ideally R would be as simple as possible (but
no simpler).


On Tue, Jan 10, 2017 at 9:06 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 1/9/2017 2:24 PM, Richard Wordingham wrote:
> Where, if anywhere, is the encoding of plain text specified?  I am
> particularly concerned with the arrangement of the code sequences for
> non-spacing abstract characters once one has determined an encoding for
> the abstract characters.
> For example, a naive reading of TUS 9.0 Section 16.4 Subsection
> "Ordering of Syllable Components" would lead one to believe that the
> word _khnyom_ 'I' shall be encoded as <U+1781 KHMER LETTER KHA,
> Richard,
> the group of Khmer experts that developed the recent label generation
> rules for root zone domain names considers that ordering the only one
> supported,  a specification you find here: https://www.icann.org/en/
> system/files/files/proposal-khmer-lgr-15aug16-en.pdf
> That document states:
> *7.4 Context of COENG Sign (U+17D2)*
> The sign ្ KHMER SIGN COENG (U+17D2) used for subscripting consonants must
> occur between two consonants. If it occurs between any other categories, it
> is not in a valid context so the label is not well formed. Further, the
> consonant following it must not include ឡ KHMER LETTER LA (U+17A1), ...
> So, you are not alone in thinking that the COENG goes between consonants.
> Did they just make this up? No, they followed what is laid out in the
> standard:
> Page 621 in Unicode 9.0.0, you find (http://www.unicode.org/
> versions/Unicode9.0.0/ch16.pdf)
> *Subscript Consonants.* Subscript consonant signs differ from independent
> consonant
> characters and are called coeng (literally, “foot, leg”) after their
> subscript position. While a
> consonant character can constitute an orthographic syllable by itself, a
> subscript consonant
> sign cannot. Note that U+17A1 C khmer letter la does not have a
> corresponding subscript
> consonant sign in standard Khmer.... Subscript consonant signs are used to
> represent any
> consonant following the first consonant in an orthographic syllable.
> and on page 624:
> .... each of these [subscript consonant] signs is represented by the
> sequence of two characters: a
> special control character (U+17D2 khmer sign coeng) and a corresponding
> consonant
> character.
> with suffficient clarity (as do all the examples and tables).
>  However, on further investigation,
> I cannot find any text that says that <U+1781, U+17C6, U+17D2, U+1789,
> U+17BB> would not be compliant with the Unicode standard.  Have I
> missed anything?
> In this example, your coeng operator U+17D2 is out of order, while it is
> followed by
> a consonant, it does not in turn immediately follow the main consonant,
> because a
> sign NIKAHIT is inserted in your example.
> Again, from the Root Zone LGR document we find an explicit rule:
> *7.10 Context of NIKAHIT SIGN (U+17C6)*
> The sign ្ំ KHMER SIGN NIKAHIT (U+17C6) can only be preceded by a
> consonant or a shifter or one of the subset of dependent vowels tagged
> “dependent-vowel-1” in the repertoire table (្ ្ុ), i.e. vowel signs AA and
> U.
> That would allow the NIKAHIT to be placed where you suggest, if it were
> not for the
> rule on the coeng operator (7.4).
> Now, it is a known fact that the label generation rules are slightly more
> restrictive than the rules for general text. (See also section 5 in that
> document).
> See the text on p. 622 in TUS 9.0.0 where the following *exception* is
> noted:
> "The subscript consonant signs in the Khmer script can be used to denote a
> final consonant,
> although this practice is uncommon."
> The associated example shows MAIN CONSONANT + VOWEL + NIKHAHIT + COENG +
> Another exception that is noted on p. 623 is the following:
> "While these subscript consonant signs are usually attached to a consonant
> character, they
> can also be attached to an independent vowel character. Although this
> practice is relatively
> rare, it is used in one very common word, meaning “to give.”"
> Taken together, it would appear that, unless your example fits the first
> of these two exceptions,
> the NIKAHIT in it is out of order.
> (The label generation rules disallow both of these exceptions,
> in an attempt to streamline the rules, sacrificing a number of potential
> domain names. Equivelant
> rule sets for validating text would have to be more complete).
> One might hope that the subsection about 'logical order' in TUS 9.0
> Section 2.2 Unicode Design Principles would help, but:
> 1) Section 3 'Conformance' says nothing about logical order; and
> 2) The subsection about 'logical order' seems to assume that there
> exists a common practice; it does not actually place any requirement
> on this common practice.
> Richard.
> I don't think either of these general sections are intended to provide the
> correct
> or expected ordering of characters for complex scripts. Any preferred
> ordering that
> doesn't result by happenstance from normalization would presumably be
> describe
> in the text of the scrip section, such as Section 16.4 Khmer, in TUS 9.0.0.
> http://www.unicode.org/versions/Unicode9.0.0/ch16.pdf
> A./
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170110/c14855d7/attachment.html>

More information about the Unicode mailing list