Specification of Encoding of Plain Text

Thu Jan 12 14:03:29 CST 2017

That was just an example off the top of my head of the format for using
with regex; I don't pretend that it is vetted. Latin is not a complex
script, so it was only an illustration. However, it was just brain freeze
on my part to not also include Inherited or ZWJ. A more serious effort
would look at some of the issues from http://unicode.org/reports/tr29/, for
example. On the other hand, CGJ is not a problem: it is Mn
<http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say) U+064B
ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.

Mark

On Thu, Jan 12, 2017 at 7:42 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:

> On Thu, 12 Jan 2017 14:12:09 +0100
> Mark Davis ☕️ <mark at macchiato.com> wrote:
>
> > I agree that comprehension is a goal. I'd imagine using a BNF regex,
> > like the following. This is simple, since I'm just doing Latin, but
> > you can see what I mean.
>
> > word = base* ;
> > base = (latinLetter latinMn*) ;
> > latinLetter = [[:scx=Latn:]&[:L:]] ;
> > latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
> >
> > which turns into the single regex expression:
> >
> > ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*
>
> Ouch!  That's alarmingly wrong.  You've excluded the likes of
> English 'Ca‍esar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word
> doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
> of Thai SO SUEA as 's̄'.  Fixinɡ it isn't easy.  At least, I assume
> Arabic harakat don't attach to Latin letters in your conception of
> Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
> work well.
>
> The problem may be conflicting requirements on the Script_Extensions
> property.
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170112/31d2d6d1/attachment.html>