Specification of Encoding of Plain Text
Mark Davis ☕️
mark at macchiato.com
Fri Jan 13 03:38:30 CST 2017
If you know of combining marks whose scx values should include Thai, please
let us know.
Also, by "Latin is not a complex script" I mean it in the narrow sense I
stated, where the goal is the ordering of characters. That is, nobody would
normally wonder whether 0.5 when expressed by a sequence with U+2044
FRACTION SLASH should be written as the sequence <2, U+2044 FRACTION SLASH,
There will always be some edge cases, but the target is Tibetan or Myanmar,
not Latin or Cyrillic.
On Thu, Jan 12, 2017 at 10:26 PM, Richard Wordingham <
richard.wordingham at ntlworld.com> wrote:
> On Thu, 12 Jan 2017 21:03:29 +0100
> Mark Davis ☕️ <mark at macchiato.com> wrote:
> > That was just an example off the top of my head of the format for
> > using with regex; I don't pretend that it is vetted. Latin is not a
> > complex script, so it was only an illustration. However, it was just
> > brain freeze on my part to not also include Inherited or ZWJ. A more
> > serious effort would look at some of the issues from
> > http://unicode.org/reports/tr29/, for example. On the other hand, CGJ
> > is not a problem: it is Mn
> > <http://unicode.org/cldr/utility/character.jsp?a=034F>. And (say)
> > U+064B ARABIC FATHATAN has scx=Arabic,Syriac, so wouldn't be included.
> Ah, I had not appreciated that sc=Inherited does not imply
> scx=Inherited. Using Script_Extensions to document the international
> combining characters that are used, for example, with Thai bases could
> have all sorts of undesirable knock-on effects.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode