Specification of Encoding of Plain Text

Richard Wordingham richard.wordingham at ntlworld.com
Tue Jan 10 13:40:13 CST 2017


On Tue, 10 Jan 2017 10:11:41 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> What I really wish we had would be a machine readable set of regexes
> for each complex script (and for each language-script combination
> that is different than the default for that script).

What would the status of these regexes be?  For example, the Khmer
script already has a regex for words sensu stricto, but there doesn't
seem to be any formal requirement to conform to it or, more
immediately usefully to users, attempt to support it if one claims to
support Khmer.

I like the idea, but it seems to have a lot of nits, which I shall now
pick.

The regexes should also be human-comprehensible.

I'm dubious of the concept of each language-script combination
potentially having a regex, or indeed of the script having a *default*
regex.  Would this be used to do the equivalent of saying that English
doesn't have the letter thorn, or, for example, prohibiting most complex
onsets from Lao? 

> Such a regex R could be used for determining the well-formed ordering
> of code points within words. The regex need not be for syllables, or
> grapheme clusters, or any other formal construct. The *only*
> requirement it would need to fulfill is that you could determine
> well-formed words with:

> word := (R)+
 
> That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV
> would pass the text, but CCV would fail. Ideally R would be as simple
> as possible (but no simpler).

Several Indian languages only allow independent vowels word initially.
You wouldn't be able to capture that with (R)+.

Would the regexes be on strings or on traces (strings modulo canonical
equivalence)?  The language recognised by the regex for the Universal
Shaping Engine (USE) is notoriously not closed under canonical
equivalence.

Most non-spacing marks should not occur double - though I think the
most significant trouble with them is with fonts that won't then show
them double.  Barring them could make for a tricky regex.  But, if we
applied that to the Latin script, should we allow f̂̂ (the Fourier
transform of the Fourier transform of f) as a word?.  Tibetan allows
some non-spacing marks to occur triple.

Richard.



More information about the Unicode mailing list