Standardised Encoding of Text
Richard Wordingham
richard.wordingham at ntlworld.com
Sun Aug 9 12:10:14 CDT 2015
On Sun, 9 Aug 2015 17:10:01 +0200
Mark Davis ☕️ <mark at macchiato.com> wrote:
> While it would be good to document more scripts, and more language
> options per script, that is always subject to getting experts signed
> up to develop them.
>
> What I'd really like to see instead of documentation is a data-based
> approach.
>
> For example, perhaps the addition of real data to CLDR for a
> "basic-validity-check" on a language-by-language basis.
CLDR is currently not useful. Are you really going to get Mayan time
formats when the script is encoded? Without them, there will be no CLDR
data. I would like to add data to a Pali in Thai script locale (or two
- there are two Thai-script Pali writing systems, one with an implicit
vowel and another without) to get proper word- and line-breaking.
However, I'm stymied because the basic requirements for a locale are
beyond me.
It's telling that, the last time I looked, there was no Latin locale. I
don't know the usage of the administration of the Church of Rome, which
appears to be what CLDR wants for Latin. (My first degree
was conferred in Latin, and it wasn't conferred in Rome.) Fortunately,
one doesn't need that for a Latin spell-checker, and the default word-
and line-breaking work well-enough.
Until some sets up locale data for Tai Khuen (or Tai Lue), we
probably won't have a locale to store Lanna script rules with.
> It might be
> possible to use a BNF grammar for the components, for which we are
> already set up.
Are you sure? Microsft's Universal Script Engine (USE) intended design
has a rule for well-formed syllables which essentially contains a
fragment, when just looking at dependent vowels:
[:InPC=Top:]*[:InPC=Bottom]*
Are you set up to say whether the following NFD Tibetan fragment
conforms to it?
Example: <U+0F71 TIBETAN VOWEL SIGN AA, U+0F72 TIBETAN VOWEL SIGN I>
The sequence of InPC values is <Bottom, Top>. There are other examples
around, but this is a pleasant one to think about.
(The USE definition got more complicated when confronted with harsh
reality. That confrontation may have happened very early in the
design.)
> For example, something like (this was a quick and
> dirty transcription):
>
> $word := $syllable+;
<snip>
Martin Hosken put something like that together for the Lanna script.
On careful inspection:
(a) It seemed to allow almost anything;
(b) It was not too lax.
Much later, I have realised that
(c) It was too strict if read as it was meant to be read, i.e. not
literally.
(d) It overlooked a logogram for 'elephant' that contains a marginally
dependent vowel.
Though it might indeed be useful in general, the formal description
would need to be accompanied by an explanation of what was happening.
The problem with the Lanna script is that it allows a lot of
abbreviation, and it makes sense to store the undeleted characters in
their normal order. The result of this is that one often can't say a
sequence is non-standard unless you know roughly how to pronounce it.
> Doing this would have far more of an impact than just a textual
> description, in that it could executed by code, for at least a
> reference implementation.
I don't like the idea of associating the description with language
rather than script. Imagine the trouble you'll have with Tamil
purists. They'll probably want to ban several consonants. You'll end
up needing a locale for Sanskrit in the Tamil script.
Richard.
More information about the Unicode
mailing list