A sign/abbreviation for "magister"
Philippe Verdy via Unicode
unicode at unicode.org
Sat Nov 3 16:55:17 CDT 2018
I can give other interesting examples about why the Unicode "character
encoding model" is the best option
Just consider how the Hangul alphabet is (now) encoded: its consonnant
letters are encoded "twice" (leading and trailing jamos) because they carry
semantic distinctions for efficient processing of Korean text where
syllable boundaries are significant to disambiguate text ; this apparent
"double encoding" also has a visual model (still currently employed) to
*preferably* (not mandatorily) render syllables in a well defined square
layout. But the square layout causes significant rendering issues (notably
at small font sizes), so it is also possible to render the syllable by
aligning letters horizontally. This was done in the "compatibility jamos"
used in old terminals/printers (but unfortunately without marking the
syllable boundaries explicitly before groups of consonnants, or after them,
or in the middle of the group); due to the need to preserve roundtrip
compatiblity with the non-UCS encodings, the "compatibility jamos" had to
be encoded separately, even if their use is no longer recommanded for
normal Korean texts that should explicitly encode syllabic boundaries by
distinguishing leading and trailing consonnants (this is equivalent to the
distinction of letter case in Latin: leading jamos in Hangul are exactly
like our Latin capital consonnants, trailing jamos in Hangul are exactly
like our latin small letters, the vowel jamos in Hangul however are
unicameral... for now) But Hangul is still a true alphabet (it is in fact
much simpler than Greek or Cyrillic, and Latin is the most complex script
of the world!). Thanks to this new (recommanded) encoding of Hangul, which
adopts a **semantic** and **logical** model, it is possible to process
Korean text very efficiently (and in fact very simply). The earlier
attempts of encoding Korean was done while ISO 10646 goals were thought to
be enough (so it was a **visual** encoding: it failed even when this
earlier encoding entered in the first versions of Unicode, and has created
a severe precedent where preserving the stability of Unicode (and upward
compatibility) was broken.
I can also cite the case of Egyptian hieroglyphs: there's still no way to
render them correctly because we lack the development of a stable
orthography that would drive the encoding of the missing **semantic**
characters (for this reason Egyptian hieroglyphs still require an upper
layer protocol, as there's still no accepted orthographic norm that
successfully represents all possible semantic variations, but alsop because
the research on old Egyptian hieroglyphs is still aphic very incomplete).
The same can be saif about Mayan hieroglyphs. And because there's still no
semantic encoding of real texts, it's almost impossible to process text in
this script: the characters encoded are ONLY basic glyphs (we don't know
what can be their allowed variations, so we cannot use them safely to
compose combining sequences: they are merely a collection of symbols, not a
humane script). In my opinion, there was absolutely no emergency to encode
them in the UCS (except by not resisting to the pressure of allowing fonts
containing these glyphs to be interchanged, but it remains impossible to
encode and compose complete text with only these fonts: you still need an
orthographic convention and there's still no concensus about it; as well
the standard higher level protocols like HTML/CSS cannot compose them
correctly and efficiently). This encoding was not necessary as these fonts
containing collection of glyphs could have remained encoded with a private
use convention, i.e. with PUAs required by only the attempted (but not
agreed) protocols.
I think on the opposite that VisibleSpeech, or Duployé shorthands will
reach a point where they have developed a stable orthographic convention:
there will be a standard, and this standard will request to Unicode to
encode the missing **semantic** characters.
This path should also be followed now for encoding emojis (there's a early
development of an orthography for them, it is done by Unicode itself, but
I'm not sure this is part of its mission: Emoji orthographic conventions
should be made by a separate commity). Unfortunately Unicode is starting to
create this orthography without developing what should come with it : its
integration in the Unicode "character encoding model" (which should then be
reviewed to meet the goals wanted for the composition of emoji sequences):
a clear set of character properties for emojis needs to be developed, and
then the emojis subcommittee can work with it (like what the IRC does for
ideographic scripts). But for now any revision of emojis adds new
incompatibilities an inefficiencies to process text correctly (for example
it's nearly imposssible to define the boundaries between clusters of
emojis).
Just consider what is also still missing for Egyptian and Mayan hieroglyphs
or VisibleSpeech, or Duployé Shorthands: please resist to pressures, and
stop complexifying rules within Emojis. We need rules and these rules must
be integrated in the character encoding model, and the first chapters of
the Unicode standard !
But please don't resist so much to legitimate goals of adding a few simple
semantic characters that can greatly increase the usability and
"universality" of the UCS: this can be done without continuous adding new
duplicate encodings. The duplicate encodings can be kept, but should be
considered only like legacy, i.e. like other "compatiility characters", no
longer recommanded but still usable.
This should be just like the Hangul compatiblity "half-width" jamos in the
last block of the BMP, in which T and L consonnants are not distinguished
(only L consonnants are encoded and are ambiguously reused for T
consonnants) and only TL clusters are unambiguous (but cannot be safely
associated with surrounding T compatiblity jamos, so it's impossible to
compose them safely in syllabic squares, and impossible to determine some
semantic differences if syllabic boundaries can only be "guessed" with an
heuristic and some dictionary lookup to find only the most probable
meaning).
These legacy characters (introduced by Unicode itself, but for bad reasons
or because the UTC did not resist to some commercial pressures) have just
polluted the UCS needlessly and complexified everything (and for long):
they remain there as apparent duplicates but with no clear semantics and
cause various problems (including security problems): most of these
"compatibility characters" are now strongly discouraged, or even now
forbidden in uses where security is an issue. And this is the case for
almost all superscript/subscripts (not justified by roundtrip compatibility
with legacy standard). But now Unicode must keep these characters there in
its own standard to preserve the roundtrip compatiblity with its own
initial versions !
But this does not mean that these characters cannot be deprecated and
treated later as "compatibility characters", even if they are not part of
the current standard normalizations NFKD and NFKC (which have limited
legacy use). These NFKC and NFKD forms should now be replaced by two more
convenient "Legacy Normalization Forms", that I would abbreviate as "NFLC"
and "NFLD" very useful for example for default collations in the DUCET or
CLDR "root" locale, except that it will not be frozen like existing NFKC
and NFKD by the very limited "compatibility mappings" found in the historic
main file of the UCD and that cannot follow the evolution of the
recommanded best practices.
Unlike NFKC and NFKD, the NFLC and NFLD would be an extensible superset
based on MUTABLE character properties (this can also be "decompositions
mappings" except that once a character is added to the new property file,
they won't be removed, and can have some stability as well, where the
decision to "deprecate" old encodings can only be done if there's a new
recommandation, and that if ever this recommandation changes and is
deprecated, the previous "legacy decomposition mappings" can still be
decomposed again to the new decompositions recommanded): unlike NFKC, and
NFKD, a "legacy decomposition" is not "final" in all future versions, and a
future version may remap them by just adding new entries for the new
characters considered to be "legacy" and no longer recommended. This new
properties file would allow evolution and adaptation to humane languages,
and will allow correcting past errors in the standard. This file should
have this form:
# deprecated codepoint(s) ; new preferred sequence ; Unicode version ins
which it was deprecated
101234 ; 101230 0300... ; 10.0
This file can also be used to deprecate some old variation sequences, or
some old clusters made of multiple characters that are isolately not
deprecated.
Thanks.
Le sam. 3 nov. 2018 à 21:45, Philippe Verdy <verdy_p at wanadoo.fr> a écrit :
> As an additional remark, I find that Unicode is slowly abandoning its
> initial goals of encoding texts logically and semantically. This was
> contrasting to the initial ISO 106464 which wanted to produce a giant
> visual encoding, based only on code charts (without any character
> properties except glyph names and an almost mandatory "representative
> glyph" which allowed in fact no variation at all).
>
> The initial ISO 10646 goal failed to reach a global adoption. What proved
> to be extremely successful (and allowed easier processing of text, without
> limiting the variation of glyph designs needed and wanted for the
> orthography of human languages) was the Unicode character encoding model,
> based on logical semantic encoding. This drove the worldwide adoption (and
> now the fast abandon of legacy charsets, all based on visual appearance and
> basic code charts, like in ISO 10646 and all past 7-bit and 8-bit ISO
> standards, or other national standards, including in China, Japan, Europe,
> or made and promoted by private hardware manufacturers or software
> providers, frequently as well with legal restrictions such as MacRoman with
> its well known Apple logo)
>
> It is desesperating to see that Unicode does not resist to that, and even
> now refuses the idea of adding just a few simple combining characters (that
> fit perfectly in its character encoding model, and still allows efficient
> text processing, and rendering with reasonnable fallbacks) that will
> explicitly encode the semantics (a good example in Latin: look at why the
> lower case eth letter seems to have three codes: this is because theiy have
> different semantics but also map to different uppercase letters, and being
> able to transform letter cases, and being able to use collation for
> plain-text search is an extremely useful feature possible only because of
> Unicode character properties, but impossible to do with just the visual
> encoding and charts of ISO 10646; the same is true about Latin A versus
> Cyrillic A and Greek ALPHA: the semantics is the first goal to respect,
> thanks to Unicode character properties and the Unicode character model, but
> the visual encoding is definitely not a goal).
>
> So before encoding characters in Unicode, the glyph variation is not
> enough (this occurs everywhere in humane languages): you need a proof with
> contrasting pairs, showing that the glyph difference makes a semantic
> difference and requires different processing (different character
> properties).
>
> Unicode has succeeded everywhere ISO 10646 has failed: efficient
> processing of humane languages with their wide variation of orthographies
> and visual appearance. The other goals (supporting technical notations,
> like IPA, maths, music, and now emojis!), driven by glyph requirements
> everywhere (mandated in their own relevant standard) is where Unicode can
> and even should promote the use of variation sequences, and definitely not
> dual encoding as this was done (Unicode abandoning its most useful goal,
> not resisting to the pressure of some industries: this has just created
> more issues, with more difficulties to correctly and efficiently process
> texts written in humane languages).
>
> The more Unicode evolves, the more I see that it will turn the UCS in what
> the ISO 10646 attempted to do (and failed): turn the UCS into a visual
> encoding, refusing to encode **efficiently** any semantic differences. And
> this will become a severe problems later with the constant evolution of
> humane languages.
>
> I press Unicode to maintain its "character encoding model" as the path to
> follow, and that it should be driven by semantic goals. It has every
> features needed for that : combining sequences (including CGJ because of
> canonical equivalences that were needed due to roundtrip compatibility with
> legacy non-UCS charsets), variation selectors (ONLY to optionally add some
> *semantic* restrictions in the largely allowed variation of glyphs and
> still preserve distinction between contrasting pairs, but NOT as a way to
> encode non-semantic styles), and character properties to allow efficient
> processing.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20181103/97702dca/attachment.html>
More information about the Unicode
mailing list