Standardised Encoding of Text

Mark Davis ☕️ mark at
Sun Aug 9 10:10:01 CDT 2015

While it would be good to document more scripts, and more language options
per script, that is always subject to getting experts signed up to develop

What I'd really like to see instead of documentation is a data-based

For example, perhaps the addition of real data to CLDR for a
"basic-validity-check" on a language-by-language basis. It might be
possible to use a BNF grammar for the components, for which we are already
set up. For example, something like (this was a quick and dirty

$word := $syllable+;
$syllable := $B [R C] (S R?)* (Z? V)? $O? $S?;
# UnicodeSets
$R := [\u17CC];
$C := [<consonant shifter>];
$S := [<subscript consonant><independent vowel sign>];
$V := [<dependent vowel sign>]
$Z := [:joiner:]
$O := [...]
$B := [[:sc=khmer:]&[:L:]-$R-$C-$S-$V-$Z-$O]

The more these could use existing properties,
like Indic_Positional_Category or IndicSyllabicCategory, the better.

Doing this would have far more of an impact than just a textual
description, in that it could executed by code, for at least a reference

Mark <>

*— Il meglio è l’inimico del bene —*

On Sun, Aug 9, 2015 at 3:58 PM, Richard Wordingham <
richard.wordingham at> wrote:

> On Sun, 9 Aug 2015 14:46:31 +0300
> "Erkki I Kolehmainen" <eik at> wrote:
> > Sorry, but I find myself having a serious problem in understanding
> > what this is about.
> In some cases the TUS lays down in detail the order of characters and
> their interpretation.  While Europeans have canonical combining classes
> to standardise the order of combining marks, lesser breeds tend not to
> receive them.  It gets even worse when combining marks are defined by
> the combination of control character(s) and what appears to be a base
> character.  For example, the order for the Khmer script was laid
> down in great detail.  Similarly, the order for Burmese was laid out in
> great detail.  However, as support for other languages was added to
> the 'Myanmar' script, the ordering rules to cover the new characters
> were not promptly laid down.
> So the question is, how does one rectify the situation where the text
> in the Unicode Standard for a script is woefully inadequate.
> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list