Specification of Encoding of Plain Text

Fri Jan 13 12:27:35 CST 2017

On 1/13/2017 9:47 AM, Richard Wordingham wrote:
> On Fri, 13 Jan 2017 01:34:48 -0800
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>
>> I believe that any attempt to define a "regex" that describes *all
>> legal text* in a given script is a-priori doomed to failure.
>>
>> Part of the problem is that writing systems work not unlike human
>> grammars in a curious mixture of pretty firm rules coupled to lists
>> of exceptions. (Many texts by competent authors will contain
>> "ungrammatical" sentences that somehow work despite or because of not
>> following the standard rules). The Khmer issue that started the
>> discussion showed that there can a be a single word that needs to be
>> handled exceptionally.
> It's a single word in the *current* orthography for the Khmer language
> in Cambodia. According to Michel Antelme, on pp20-1 of "Inventaire
> provisoire des caractères et divers signes des écritures khmères
> pré-modernes et modernes employés pour la notation du khmer, du
> siamois, des dialectes thaïs méridionaux, du sanskrit et du pāli"
> (http://aefek.free.fr/iso_album/antelme_bis.pdf), this manner
> of writing was much commoner until it was largely eliminated by a
> spelling reform in the first half of the 20th century.

This points to another interesting issue. A number of languages have 
seen orthographic reforms that affect the use of complex scripts.

Now then, a decision: do you support both the old and the new style in 
the same rule-set? If vestiges remain in general use, you may not have a 
choice, but, what if the rules for old and new (or for different 
languages in the same script) actually conflict?

>   The Thai
> Wikipedia page on the use of the script for Thai
> (https://th.wikipedia.org/wiki/อักษรขอมไทย) gives examples for final
> consonants with COENG VO (លែ្វ = แล้ว), COENG NO (បេ្ន = เป็น) and
> COENG NGO (ទ័្ង = ทั้ง).

In the case that I cited, that combination of language/script was taken 
as out of scope for other reasons; now, for general text, are there 
situations where you'd want separate sets of rules for each language?
>
>> If you try to capture all the exceptions in the general rules, the
>> set of rules gets complicated, but is also likely to be way too
>> permissive to be useful.
> If it is checking for proper use of code points, overgeneration is far
> preferable to undergeneration.

Agreed. For modeling general text you don't want to actually exclude 
anything that can occur. However, what can you exclude?

If you think of spell-checking as a scenario, overgeneration is not 
acceptable. Instead, you have a standard dictionary that deals with 
"general vocabulary" and there's a well defined mechanism to allow the 
user to add "exceptions".

My point is that you cannot design a ruleset without having a very 
well-defined use-case. If you divide the rule sets into "building 
blocks" then it may be easier to address different use cases than if you 
simply provide a "maximally permissive" set of rules.

I'm skeptical that a one size fits all sets of rules can be devised and 
be useful.

For rules that strongly err on the side of overgeneration, it might make 
more sense to simply define the few contexts that are deemed 
impermissible and set the rest to "anything goes".
>
>> The Khmer LGR for the Root Zone, for example, deliberately disallows
>> the exception (in the word for "give") so that it can be stated (a)
>> more compactly and (b) does not allow the exceptional sequencing of
>> certain characters to become applicable outside the single exception.
>>
>> An LGR is concerned with *single* instances of each word. Even the
>> most common word in a language can only be registered once in each
>> zone.
> A label does not have to be a single word.  For example, there are
> several, if not many, domain names matching give*.com, where the first
> element is clearly the word 'give'.

Correct, but each compound can still occur only once. I cite this 
example only because the local body that drafted the rules decided that 
there was a reasonable tradeoff (complexity vs. generality) for the 
purpose of top level domain names (i.e. ".give*" not "give*.com").

For that application, complexity has a relatively high negative weight 
associated with it, and complete coverage, while desirable, is not given 
the same high positive weight that it would have in describing ordinary 
text.

>> Even if the BNFs did nothing more than capture succinctly the
>> information presented in text and tables, they would be useful.
>> For scripts where things like ZWJ and CGJ are optional, it doesn't
>> make sense to run them into the standard BNF - that just messes
>> things up. It is much more useful to provide generic context
>> information of how to add them to existing text.
>> For example, the CGJ is really intended to go between letters. So,
>> describe that context.

(Forgot to make clear that this was a bit of a hypothetical)

> It can be quite useful next to combining marks.  For example, it may be
> used to distinguish a diaeresis from an umlaut mark in Fraktur.

Even if it is intended to go anywhere, even between digits, symbols and 
punctuation, it's much easier to describe that behavior separately 
rather than trying to insert it in every location in every regex. What 
I'm thinking is a description that gives a "skeleton word" and then you 
state, that this skeleton can be decorated (or whatever your preferred 
term) by inserting a CGJ anywhere.

The same goes for ZWJ /ZWNJ for any script where they don't have a 
recognized specific effect in particular sequences.