Specification of Encoding of Plain Text

Fri Jan 13 11:47:24 CST 2017

On Fri, 13 Jan 2017 01:34:48 -0800
Asmus Freytag <asmusf at ix.netcom.com> wrote:

> I believe that any attempt to define a "regex" that describes *all
> legal text* in a given script is a-priori doomed to failure.
> 
> Part of the problem is that writing systems work not unlike human 
> grammars in a curious mixture of pretty firm rules coupled to lists
> of exceptions. (Many texts by competent authors will contain 
> "ungrammatical" sentences that somehow work despite or because of not 
> following the standard rules). The Khmer issue that started the 
> discussion showed that there can a be a single word that needs to be 
> handled exceptionally.

It's a single word in the *current* orthography for the Khmer language
in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
provisoire des caractères et divers signes des écritures khmères
pré-modernes et modernes employés pour la notation du khmer, du
siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
(http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
of writing was much commoner until it was largely eliminated by a
spelling reform in the first half of the 20th century.  The Thai
Wikipedia page on the use of the script for Thai
(https://th.wikipedia.org/wiki/อักษรขอมไทย) gives examples for final
consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
COENG NGO (ទ័្ង = ทั้ง).

> If you try to capture all the exceptions in the general rules, the
> set of rules gets complicated, but is also likely to be way too
> permissive to be useful.

If it is checking for proper use of code points, overgeneration is far
preferable to undergeneration.

> The Khmer LGR for the Root Zone, for example, deliberately disallows
> the exception (in the word for "give") so that it can be stated (a)
> more compactly and (b) does not allow the exceptional sequencing of
> certain characters to become applicable outside the single exception.
> 
> An LGR is concerned with *single* instances of each word. Even the
> most common word in a language can only be registered once in each
> zone.

A label does not have to be a single word.  For example, there are
several, if not many, domain names matching give*.com, where the first
element is clearly the word 'give'.

> Even if the BNFs did nothing more than capture succinctly the 
> information presented in text and tables, they would be useful.

> For scripts where things like ZWJ and CGJ are optional, it doesn't
> make sense to run them into the standard BNF - that just messes
> things up. It is much more useful to provide generic context
> information of how to add them to existing text.

> For example, the CGJ is really intended to go between letters. So, 
> describe that context.

It can be quite useful next to combining marks.  For example, it may be
used to distinguish a diaeresis from an umlaut mark in Fraktur.

Richard.