Specification of Encoding of Plain Text

Asmus Freytag asmusf at ix.netcom.com
Fri Jan 13 03:34:48 CST 2017

I believe that any attempt to define a "regex" that describes *all legal 
text* in a given script is a-priori doomed to failure.

Part of the problem is that writing systems work not unlike human 
grammars in a curious mixture of pretty firm rules coupled to lists of 
exceptions. (Many texts by competent authors will contain 
"ungrammatical" sentences that somehow work despite or because of not 
following the standard rules). The Khmer issue that started the 
discussion showed that there can a be a single word that needs to be 
handled exceptionally.

If you try to capture all the exceptions in the general rules, the set 
of rules gets complicated, but is also likely to be way too permissive 
to be useful.

The Khmer LGR for the Root Zone, for example, deliberately disallows the 
exception (in the word for "give") so that it can be stated (a) more 
compactly and (b) does not allow the exceptional sequencing of certain 
characters to become applicable outside the single exception.

An LGR is concerned with *single* instances of each word. Even the most 
common word in a language can only be registered once in each zone. 
Therefore, such a drastic treatment is a perfectly good solution. For a 
rendering engine, you'd want to be much more permissive, perhaps even 
attempt to display patently "wrong" sequences. For a validation tool 
(spell checker) you would strike for some other sweet spot. Finally, to 
determine "first word" or "first syllable" for formatting purposes (such 
as "drop caps") there may yet be a different selection.

As a result, I believe it would be most useful if a regex or BNF could 
be created for the "typical" / "idealized" description of a "word" in 
the various scripts.

Then, depending on the facts in question, the BNF could be augmented 
with more or less formalized descriptions of variations, exceptions, etc.

The idea would be to provide "building blocks" that can be used to 
assemble rules tailored to various scenarios by the reader of the 
standard. (Because of that, they should be part of the description 
section, not a data file...)

Even if the BNFs did nothing more than capture succinctly the 
information presented in text and tables, they would be useful.

For scripts where things like ZWJ and CGJ are optional, it doesn't make 
sense to run them into the standard BNF - that just messes things up. It 
is much more useful to provide generic context information of how to add 
them to existing text.

For example, the CGJ is really intended to go between letters. So, 
describe that context.

Overall, describing the local contexts for a given character or class of 
characters has proven to be more useful in the LGR project than 
attempting to write global rules.


On 1/13/2017 1:02 AM, Richard Wordingham wrote:
> On Thu, 12 Jan 2017 21:03:29 +0100
> Mark Davis ☕️ <mark at macchiato.com> wrote:
>> Latin is not a complex script,...
> Unlike the common script, which notably has U+2044 FRACTION SLASH.
> That statement is actually dubious from a typographical point of view.
>> ...so it was only an illustration.
> But it's good for looking for the non-obvious issues.
>> A more serious effort would look at some of the issues from
>> http://unicode.org/reports/tr29/, for example.
> I don't think we want to have to repeat them all for each script.
> Putting common-script punctuation and numbers in the regex will add
> obscurity, and possibly be a maintainability issue.
> Richard.

More information about the Unicode mailing list