Standardised Encoding of Text

Mark Davis ☕️ mark at
Sun Aug 9 14:14:38 CDT 2015

Mark <>

*— Il meglio è l’inimico del bene —*

On Sun, Aug 9, 2015 at 7:10 PM, Richard Wordingham <
richard.wordingham at> wrote:

> On Sun, 9 Aug 2015 17:10:01 +0200
> Mark Davis ☕️ <mark at> wrote:
> > While it would be good to document more scripts, and more language
> > options per script, that is always subject to getting experts signed
> > up to develop them.
> >
> > What I'd really like to see instead of documentation is a data-based
> > approach.
> >
> > For example, perhaps the addition of real data to CLDR for a
> > "basic-validity-check" on a language-by-language basis.
> CLDR is currently not useful.  Are you really going to get Mayan time
> formats when the script is encoded? Without them, there will be no CLDR
> data.

​That is a misunderstanding. CLDR provides both locale (language) specific
data for formatting, collation, etc., but also data about languages. It is
not limited to the first.

> > It might be
> > possible to use a BNF grammar for the components, for which we are
> > already set up.
> Are you sure?

I said "might be possible". That normally indicates that a degree of
uncertainty. That is, "no, I'm not sure".

here is no reason to be unnecessarily argumentative; it doesn't exactly
encourage people to explore solutions to a problem.

> Microsft's Universal Script Engine (USE) intended design
> has a rule for well-formed syllables which essentially contains a
> fragment, when just looking at dependent vowels:
> [:InPC=Top:]*[:InPC=Bottom]*
> Are you set up to say whether the following NFD Tibetan fragment
> conforms to it?
> The sequence of InPC values is <Bottom, Top>.  There are other examples
> around, but this is a pleasant one to think about.

> (The USE definition got more complicated when confronted with harsh
> reality.  That confrontation may have happened very early in the
> design.)
> > For example, something like (this was a quick and
> > dirty transcription):
> >
> > $word := $syllable+;
> <snip>
> Martin Hosken put something like that together for the Lanna script.
> On careful inspection:
> (a) It seemed to allow almost anything;
> (b) It was not too lax.
> Much later, I have realised that
> (c) It was too strict if read as it was meant to be read, i.e. not
> literally.
> (d) It overlooked a logogram for 'elephant' that contains a marginally
> dependent vowel.
> Though it might indeed be useful in general, the formal description
> would need to be accompanied by an explanation of what was happening.
> The problem with the Lanna script is that it allows a lot of
> abbreviation, and it makes sense to store the undeleted characters in
> their normal order.  The result of this is that one often can't say a
> sequence is non-standard unless you know roughly how to pronounce it.

I don't think any algorithmic description would get all and only those
strings that would be acceptable to writers of the language. What you'd end
up with is a mechanism that had three values: clearly ok (eg, cat), clearly
bogus (eg, a\u0308\u0308\u0308\u0308), and somewhere in between.

> > Doing this would have far more of an impact than just a textual
> > description, in that it could executed by code, for at least a
> > reference implementation.
> I don't like the idea of associating the description with language
> rather than script.  Imagine the trouble you'll have with Tamil
> purists.  They'll probably want to ban several consonants.  You'll end
> up needing a locale for Sanskrit in the Tamil script.

​Someone was just saying "
However, as support for other languages was added to
​ ​
the 'Myanmar' script, the ordering rules to cover the new characters
were not promptly laid down.

If the goal for the script rules is to cover all languages customarily
written with that script, one way to do that is to develop the language
rules as they come, and make sure that the script rules are broadened if
necessary for each language. But there is also utility to having the
language rules, especially for high-frequency languages.

> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list