Variations and Unifications ?

Philippe Verdy verdy_p at wanadoo.fr
Thu Mar 17 02:47:26 CDT 2016


One problem caused by disunification is the complexification of algorithms
handling text.

I forgot an important case where disunification also occured : combining
sequences are the "normal" encoding, but legacy charsets encoded the
precomposed character separately and Unicode had to map them for round trip
compatibility purpose. This had a consequence : the creation of additional
properties (i.e. for "canonical equivalences") in order to conciliate the
two sets of encodings and allow some form for equivalence

In fact this is general: each time we disunify a character, we have to add
new properties, and possibly update the algorithms to take these properties
into account and find some form of equivalences.

So disunification solves one problem but creates others. We have to trade
the benefits and costs of using the disunified characters with those using
the "normal" characters (possibly in sequences).

But given the number of cases where we have to support sequences (even if
it's only combining sequences for canonical equivalences), we should really
defavor the real need of disunifying characters: if it's possible with
sequences, don't desunify.

A famous example (based on a legacydecision which was bad in my opinion as
the cost was not considered) was the desunification of Latin/Greek letters
for mathematical purpose, only to force a specific style. But the
alternative representation using sequences (using variation selectors for
example, as the addition of specific modifier for "styles" like "bold",
"italic" or "monospace" was rejected with good reasons) was not really
analyzed in terms of benefits and costs, using the algorithms we already
have (and that could have been updated). But mathemetical symbols are
(normally...) not used at all in the same context as plain alphabetic
letters (even if there's absolutely no warranty that they will be always
distinctable from them when they occur in some linguistic text rendered
with the same style...).

The naive thinking that disunification will make things simpler is
completely wrong (given that an application that would ignore all character
properties and would use only isolated characters would break legitime
rules in many cases, even for rendering purposes. It is in fact simpler to
keep the possible sequences that are already encoded (or that could be
extended to cover more cases: e.g. add new variation sequences, introduce
some new modiers, not just new combining characters, and so on).

We were strongly told : Unicode encodes characters, not glyphs. This should
be remembered (and the argument of costs caused by disunification of
distinct glyphs is also a good one against it).


2016-03-17 8:20 GMT+01:00 Asmus Freytag (t) <asmus-inc at ix.netcom.com>:

> On 3/16/2016 11:11 PM, Philippe Verdy wrote:
>
> "Disunification may be an answer?" We should avoid it as well.
>
> Disunification is only acceptable when
> - there's a complete disunification of concepts....
>
>
> I think answering this question depends on the understanding of "concept",
> and on understanding what it is that Unicode encodes.
>
> When it comes to *symbols*, which is where the discussion originated,
> it's not immediately obvious what Unicode encodes. For example, I posit
> that Unicode does not encode the "concept" for specific mathematical
> operators, but the individual "symbols" that are used for them.
>
> For example PRIME and DOUBLE PRIME can be used for minutes and seconds
> (both of time and arc) as well as for other purposes. Unicode correctly
> does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it
> up to the notational convention to relate the concept and the symbol.
>
> Thus we have a case where multiple concepts match a single symbol. For the
> converse, we take the well-known case of COMMA and FULL STOP which can both
> be used to separate a decimal fraction.
>
> Only in those cases where a single concept is associated so exclusively
> with a given symbol, do we find the situation that it makes sense to treat
> variations in shape of that symbol as the same symbol, but with different
> glyphs.
>
> For some astrological symbols that is the case, but for others it is not.
> Therefore, the encoding model for astrological text cannot be uniform.
> Where symbols have exclusive association with a concept, the natural
> encoding is to encode symbols with an understood set of variant glyphs.
> Where concepts are denoted with symbols that are also used otherwise, then
> the association of concept to symbol must become a matter of notational
> convention and cannot form the basis of encoding: the code elements have to
> be on a lower level, and by necessity represent specific symbol shapes.
>
> A./
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160317/4c4c7336/attachment.html>


More information about the Unicode mailing list