Specification of Encoding of Plain Text

Mark Davis ☕️ mark at macchiato.com
Thu Jan 12 07:12:09 CST 2017


On Tue, Jan 10, 2017 at 8:40 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Tue, 10 Jan 2017 10:11:41 +0100
> Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> > What I really wish we had would be a machine readable set of regexes
> > for each complex script (and for each language-script combination
> > that is different than the default for that script).
>
> What would the status of these regexes be?  For example, the Khmer
> script already has a regex for words sensu stricto, but there doesn't
> seem to be any formal requirement to conform to it or, more
> immediately usefully to users, attempt to support it if one claims to
> support Khmer.
>

​​I think the goal would be provide guidance on the preferred
ordering/choice of code points for representing a particular visual order
of glyphs. That is, help to guide the usage of characters in complex
scripts.

The target wouldn't even be all scripts, but rather complex ones, where it
may not be simple to determine the ordering of code points.

And as Asmus said, the goal would be sufficiently "detailed to let you find
out whether you are using characters as intended, or not"


> I like the idea, but it seems to have a lot of nits, which I shall now
> pick.
>

​I'm sure there are plenty; those are just an opener.
​

>
> The regexes should also be human-comprehensible.
>
​
I agree that comprehension is a goal. I'd imagine using a BNF regex, like
the following. This is simple, since I'm just doing Latin, but you can see
what I mean.

word = base* ;
base = (latinLetter latinMn*) ;
latinLetter = [[:scx=Latn:]&[:L:]] ;
latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;

which turns into the single regex expression:

([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*

See:
http://unicode.org/cldr/utility/bnf.jsp?a=word=base*;%0Dbase=(latinLetter+latinMn*);%0DlatinLetter=[[:scx=Latn:]%26[:L:]];%0DlatinMn=[[:scx=Latn:][:scx=Common:]%26[:Mn:]]
;

A more complex script might have:

word = prefix base* postfix ;
...

One could draw on the work done in Harfbuzz and the Universal Shaping
Engine to push this along for different scripts.


> I'm dubious of the concept of each language-script combination
> potentially having a regex,


​I think a language-script combination is only useful if it must vary from
the default for the script.


> or indeed of the script having a *default*
> regex.



> Would this be used to do the equivalent of saying that English
> doesn't have the letter thorn, or, for example, prohibiting most complex
> onsets from Lao?
>

And for those scripts, the goal would be to represent the core functioning
of the script. So it could be broader than what is needed for any
particular language using that script.



> > Such a regex R could be used for determining the well-formed ordering
> > of code points within words. The regex need not be for syllables, or
> > grapheme clusters, or any other formal construct. The *only*
> > requirement it would need to fulfill is that you could determine
> > well-formed words with:
>
> > word := (R)+​


> > That is, if R were (C V C? | V C?) then any of CVC CVCVC VC V CV
> > would pass the text, but CCV would fail. Ideally R would be as simple
> > as possible (but no simpler).
>
> Several Indian languages only allow independent vowels word initially.
> You wouldn't be able to capture that with (R)+.
>

​That was a typo, should have been just R (which could have more complex
internal structure with repetition, as above).


>
> Would the regexes be on strings or on traces (strings modulo canonical
> equivalence)?  The language recognised by the regex for the Universal
> Shaping Engine (USE) is notoriously not closed under canonical
> equivalence.
>

​Unclear as yet to me what would be the most useful.​


> Most non-spacing marks should not occur double - though I think the
> most significant trouble with them is with fonts that won't then show
> them double.  Barring them could make for a tricky regex.  But, if we
> applied that to the Latin script, should we allow f̂̂ (the Fourier
> transform of the Fourier transform of f) as a word?.  Tibetan allows
> some non-spacing marks to occur triple.
>

There is always a choice as to how strict to make them. The goal shouldn't
be so tight as to exclude legitimate words, and trying to be too
fine-grained can make the expressions overly complicated. Moreover there
isn't any question as to how "f̂̂ (the Fourier
transform of the Fourier transform of f)" would be spelled, so no need to
exclude it. But preventing spoofing wouldn't be the goal.


> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170112/dcc17012/attachment.html>


More information about the Unicode mailing list