Specification of Encoding of Plain Text

Richard Wordingham richard.wordingham at ntlworld.com
Thu Jan 12 12:42:42 CST 2017


On Thu, 12 Jan 2017 14:12:09 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> I agree that comprehension is a goal. I'd imagine using a BNF regex,
> like the following. This is simple, since I'm just doing Latin, but
> you can see what I mean.

> word = base* ;
> base = (latinLetter latinMn*) ;
> latinLetter = [[:scx=Latn:]&[:L:]] ;
> latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
> 
> which turns into the single regex expression:
> 
> ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*

Ouch!  That's alarmingly wrong.  You've excluded the likes of
English 'Ca‍esar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word
doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
of Thai SO SUEA as 's̄'.  Fixinɡ it isn't easy.  At least, I assume
Arabic harakat don't attach to Latin letters in your conception of
Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
work well.

The problem may be conflicting requirements on the Script_Extensions
property.

Richard.



More information about the Unicode mailing list