One encoding per shape (was Re: Long standing problem with Vedic tone markers and post-base visarga/anusvara)

Richard Wordingham via Unicode unicode at unicode.org
Wed Jan 1 19:05:26 CST 2020


On Wed, 1 Jan 2020 20:11:04 +0000
James Kass via Unicode <unicode at unicode.org> wrote:

> On 2020-01-01 11:17 AM, Richard Wordingham via Unicode wrote:
> 
>  > That's exactly the sort of mess that jack-booted renderers are
>  > trying to minimise.  Their principle is that there should be only
>  > one encoding per shape, though to be fair:
>  >
>  > 1) some renderers accept canonical equivalents.
>  > 2) tolerance may be allowed for ligating (ZWJ, ZWNJ, CGJ),
>  > collating (CGJ, SHY) and line-breaking controls (SHY, ZWSP, WJ).
>  > 3) Superseded chillu encodings are still supported.  
> 
> There was never any need for atomic chillu form characters.  

> The 
> principle of only one encoding per shape is best achieved when every 
> shape gets an atomic encoding.

I should have written per-word shape.  I should also have added that
most renderers attempt to handle Mongolian, despite its encoding
Middle Mongolian phonetics rather than characters. Also, they don't
attempt to sort the Arabic script per-language subsets out, which
leads to a bad mess at Wiktionary when Unicode characters differ only in
a few forms.

> Glyph-based encoding is incompatible 
> with Unicode character encoding principles.

Visual encoding sometimes works - phonetic order for Thai is so
complicated that it is unsurprising that its definition is partly
missing from Unicode 1.0.  The official history hides behind
incompatibility with the Thai national standard, but phonetic order was
simply too complicated for Thai.  Additionally, Thais don't agree on
where preposed vowels go relative to Pali consonant clusters - they
don't agree that all of them should appear in the middle of the
cluster.  (I suppose the positioning rule could have been made a
stylistic feature of fonts.)

An analogue is Lao collation.  While syllable boundaries can
overwhelmingly be discerned in modern Lao, Lao collations are too
complicated to be accepted for ICU if they are to support anything but
single syllables.  CLDR collation (interpreted as a specification with
the normal use of specification language for the form of definitions)
can just cope, whereas the UCA can't, but the tables are huge. 

Richard.



More information about the Unicode mailing list