On emoji and the two rightwards black arrows

Fri Oct 30 19:19:14 CDT 2015

IMHO, all mappings from other encodings are just best efforts but not
normative. In many cases, those mappings are ambiguous, including for some
legacy encodingfs that have been widely used since many decades and still
used today (such as CP437):

The reason for that is that the old registrations for legacy 8-bit charsets
only showed charts of glyphs with approximative glyphs (often with poor
quality, with low resolution rendering on printed papers, and various
polluting dots of inks, later scanned with poor resolution), but no actual
properties (and often without even listing any name for them). And for long
those charts have been interpreted differently by different vendors (such
as printer or screen manufacturers, in a time where dot-matrix printers or
displays had poor resolution), and sometimes with glyphs changing slightly
between devices models or versions from the same vendor.

So characters in those mapping tables were widely used to mean different
variants of characters that are now distinguished in the UCS (e.g. in
CP437, the symbol that looks either like an big epsilon or as a "is member
of" math symbol ; the mappings to the UCS for other symbols that look like
Greek letters in CP437 charsets and similar are not really in stone, it is
not even clear if they will map to UCS symbols or to UCS Greek letters ;
the same applies to various geometric symbols, including arrows, and
bullets).

Those mappings are just there to help converting some old documents to the
UCS, but the choice is sometimes questionable and some corrections may need
to be done to select another character, depending on the context of use.
Unfortunately, the existing mappings only document mappings of legacy code
positions to a single suggested codepoint, and not their other possible
alternatives.

Then we fall into the categories of characters that are easily confusable:
may be these mappings tables do not need to be changed, but used together
with the datafiles related to confusable characters (the list was initiated
during the development of IDNA). There are other data available (visible in
Unicode charts) that also indicate a few related/similar characters, but
these are mostly notes are not engraved in stone, and this data is
difficult ot use.

So in summary, those mapping tables are just suggestions and implementers
may still map legacy encodings to different subsets of the UCS. But we
should be concerned by the conversion to the other direction, from the UCS
to legacy mappings : all candidate UCS code points should be reversed
mapped to the same legacy code position (as much as possible). Those
mapping tables are then not part of the stable standard and there's no
stability policy about them (IMHO, such policy should not be adopted). They
are just contributions in order to help the transition to the UCS, and they
are also subject to updates when needed if there are better mappings
developed later, and some applications or vendors will still develop their
own preferences.

If you consider the two UCS characters in question, my opinion is that they
are basically the same and mappings from Zapf Dingbats or DPRK or
Windings/Webdings are just kept for historical reasons, but not necessarily
the best ones. And I would see no violation of the standard if a font was
made that mapped both UCS characters to exactly the same glyph, using
metrics that create a coherent set of black arrows using either the DPRK
metrics for all 4 arrows, or the Zapf Dingbats metrics for all 4 arrows.
Their disunification is not really justified, except to work with
applications or documents that used fonts not mapping all of them but made
to work only with DPRK-encoded documents, or with Dingbats-encoded
documents: the disunification is based only on those specific old
(defective) fonts, and modern fonts should not be defective and should map
all of these characters as if they were aliased, without any need to
distinguish them visually.

But because they are not canonically equivalent, these characters should be
explicitly listed in the list of confusables (which version will be
preferred, and which versions will be aliased to the prefered form, for
applications like IDNA, is a question to develop as this is a possible
security concern if some of these characters are allowed in identifiers
intended to be secured).

2015-10-30 19:51 GMT+01:00 J.S. Choi <js_choi at icloud.com>:

> # On emoji and the two rightwards black arrows
>
> This is a long post, and I apologize for that; it’s a somewhat complicated
> topic. The post is about two encoded characters:
> U+27A1 Black Rightwards Arrow <http://www.unicode.org/charts/PDF/U2700.pdf
> >
> and U+2B95 Rightwards Black Arrow <
> http://www.unicode.org/charts/PDF/U2B00.pdf>.
>
> (...)

In any case, I might make a formal proposal in the future, but I first want
> to determine here how probable that such a proposal would be discussed.
> What would you say the answers to those three questions are?
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20151031/1ffec65e/attachment.html>