Specification of Encoding of Plain Text

Richard Wordingham richard.wordingham at ntlworld.com
Thu Jan 12 15:26:02 CST 2017

On Thu, 12 Jan 2017 21:03:29 +0100
Mark Davis ☕️ <mark at macchiato.com> wrote:

> That was just an example off the top of my head of the format for
> using with regex; I don't pretend that it is vetted. Latin is not a
> complex script, so it was only an illustration. However, it was just
> brain freeze on my part to not also include Inherited or ZWJ. A more
> serious effort would look at some of the issues from
> http://unicode.org/reports/tr29/, for example. On the other hand, CGJ
> is not a problem: it is Mn
> <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say)
> U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.

Ah, I had not appreciated that sc=Inherited does not imply
scx=Inherited. Using Script_Extensions to document the international
combining characters that are used, for example, with Thai bases could
have all sorts of undesirable knock-on effects.


More information about the Unicode mailing list