Unicode is more than shapes (was: Tibetan Paluta)

Richard Wordingham via Unicode unicode at unicode.org
Mon May 1 07:14:18 CDT 2017


On Mon, 1 May 2017 07:17:05 +0200
Philippe Verdy via Unicode <unicode at unicode.org> wrote:

> 2017-04-29 21:21 GMT+02:00 Naena Guru via Unicode
> <unicode at unicode.org>:

> > Anyway, Unicode is only about DISPLAYING a script: There's a shape
> > here; Let's find how to get it by assembling other shapes or by
> > creating a code point for it. What is short, long or longer in
> > speech is no concern for Unicode.

When there is considerable variation in shape, describing the function
of a character can be of great help in determining the character code
to enter for some relatively obscure character.
 
> Wrong. Unicode is absolutely not about how to "display" any script
> (except symbols and notational symbols). Unicode does not encode
> glyphs. Unicode encodes "abstract characters" according to their
> semantics, in order to assign them properties allowing meaningful
> transformations of text and in order to allow perfoirming searches
> (with collation algorithms).

Of course, display is a very important transformation process!  However,
for many applications, an important part of display is knowing when to
split text between lines, and in easy cases that can be done using
knowledge of character properties.  In hard cases, the user has to
insert line-breaking permissions and even prohibitions.  There are
special characters for these functions.

It's somewhat misleading to say that searches use collation
algorithms.  What is true is that folding can use enough of the same
computational processes that much of the code for collation may be
re-used for search.  Different data tables are frequently appropriate.

> Anyway Unciode makes some exceptions to the logical model only for
> roundtrip comptaibility with other standards that used another
> encoding model widely used, notably in Thai: these are the exception
> where there are "prepended" letters.

What "logical" model?  I don't think you know how Thai works.  The key
feature is that the Indic consonant stack has no delimiter in Thai,
which makes the phonetic placement of preposed vowels ambiguous.  In
some of the other relevant features that I am aware of, Lao works quite
differently.  Tai Viet was encoded in visual order.

You forget one other change.  New Tai Lue switched from phonetic
order to visual order because it hadn't been worth Microsoft's while
to implement the simple rendering engine.  The Universal Shaping Engine
(USE) should prevent this happening again with straightforward
complex scripts, but good intentions (namely, replacing the working
renderer from HarfBuzz and thus Firefox, Chrome and LibreOffice with an
emulation of the USE) may unintentionally repeat the process with 'Old
Tai Lue'.  Using phonetic order in Tai Tham distinguishes homographs
(if I may use the term here) that would usually be collated
differently. 

> There was some havoc also for
> some scripts in India because of roundtrip compatiblity with an
> Indian standard (criticized by many users of Tamil and some other
> Southern Indic scripts that don't follow directly the paradigm
> created for getting some limited transliteration with Devanagari:
> that initial desire was abandoned but the legacy Indic scripts in
> India were imported as is to Unicode)

The havoc is because half-forms are a north Indian innovation, not an
ancient Indic feature.  Tamil suffered from the ISCII conflation of
combining and merely having no vowel, the Unicode virama.  Tibetan and
Khmer led the way in splitting the concepts, and the Unicode virama
in the Myanmar script was disunified into an invisible stacker and a
pure killer.  Many of the Tamil complaints arise because the implicit
vowel is ill-suited to Tamil, but an attempt to move away from that
system about two thousand years ago did not persist.

Richard.


More information about the Unicode mailing list