On emoji and the two rightwards black arrows
"J. S. Choi"
js_choi at icloud.com
Tue Nov 3 12:59:40 CST 2015
Thanks for the reply!
> IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today…
> …these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured).
If the compatibility mappings are not normative or guaranteed to be stable, then that would weaken one of the two objections to the changes proposed in my questions 1 and 2. The compatibility-mapping and IDNA issues are merely supplemental to my main questions, though.
> Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually.
Perhaps this is true, but regardless of whether the disunification in 2014 (of the Zapf Dingbat U+27A1 from the DPRK/Wingding arrows U+2B05–U+2B07) was justified, or whether the creation in 2014 of U+2B95 was justified, they happened nonetheless; the opportunity to object to it seems to have already passed.
U+2B95 now exists—and it exists with the express purpose to complete U+2B05–U+2B07, based on Michel Suignard’s new representative glyphs and Mark Davis’ comments from earlier this year. However, U+2B95’s current absence from UTR #51 and emoji_data.txt—and its lack of text/emoji standardized variation sequences—are perhaps inconsistent with that purpose. The three questions remain:
1. Should U+B295 be added to the set of emoji characters as given by UTR #51 and emoji-data.txt, in order to complete the harmonization with U+2B05–U+2B07 from 2014?
2. If question 1’s answer is yes, then should U+B295 be given text/emoji standardized variation sequences, just as U+2B05–U+2B07 already do?
3. Regardless of the answers to the above, should notes clarifying the differences in intended usage between U+B295 (the right black arrow completing U+2B05–U+2B07) and U+27A1 (the Zapf Dingbat) be added to their entries in the Standard’s code charts? This might clear up a lot of confusion from users and font creators, and would only make clearer what has already been made explicit by 7.0’s glyph changes.
……I’m also uncertain as to the way I’d even initiate a formal process on this. This isn’t even a proposal for a new character; it’s a proposal the for inclusion of an already added character and for the addition of clarifying information in the code charts. The forms at http://www.unicode.org/L2/summary.html <http://www.unicode.org/L2/summary.html> wouldn’t seem to fit this kind of change.
J. S. Choi
> On Oct 30, 2015, at 7:19 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today (such as CP437):
> The reason for that is that the old registrations for legacy 8-bit charsets only showed charts of glyphs with approximative glyphs (often with poor quality, with low resolution rendering on printed papers, and various polluting dots of inks, later scanned with poor resolution), but no actual properties (and often without even listing any name for them). And for long those charts have been interpreted differently by different vendors (such as printer or screen manufacturers, in a time where dot-matrix printers or displays had poor resolution), and sometimes with glyphs changing slightly between devices models or versions from the same vendor.
> So characters in those mapping tables were widely used to mean different variants of characters that are now distinguished in the UCS (e.g. in CP437, the symbol that looks either like an big epsilon or as a "is member of" math symbol ; the mappings to the UCS for other symbols that look like Greek letters in CP437 charsets and similar are not really in stone, it is not even clear if they will map to UCS symbols or to UCS Greek letters ; the same applies to various geometric symbols, including arrows, and bullets).
> Those mappings are just there to help converting some old documents to the UCS, but the choice is sometimes questionable and some corrections may need to be done to select another character, depending on the context of use. Unfortunately, the existing mappings only document mappings of legacy code positions to a single suggested codepoint, and not their other possible alternatives.
> Then we fall into the categories of characters that are easily confusable: may be these mappings tables do not need to be changed, but used together with the datafiles related to confusable characters (the list was initiated during the development of IDNA). There are other data available (visible in Unicode charts) that also indicate a few related/similar characters, but these are mostly notes are not engraved in stone, and this data is difficult ot use.
> So in summary, those mapping tables are just suggestions and implementers may still map legacy encodings to different subsets of the UCS. But we should be concerned by the conversion to the other direction, from the UCS to legacy mappings : all candidate UCS code points should be reversed mapped to the same legacy code position (as much as possible). Those mapping tables are then not part of the stable standard and there's no stability policy about them (IMHO, such policy should not be adopted). They are just contributions in order to help the transition to the UCS, and they are also subject to updates when needed if there are better mappings developed later, and some applications or vendors will still develop their own preferences.
> If you consider the two UCS characters in question, my opinion is that they are basically the same and mappings from Zapf Dingbats or DPRK or Windings/Webdings are just kept for historical reasons, but not necessarily the best ones. And I would see no violation of the standard if a font was made that mapped both UCS characters to exactly the same glyph, using metrics that create a coherent set of black arrows using either the DPRK metrics for all 4 arrows, or the Zapf Dingbats metrics for all 4 arrows. Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually.
> But because they are not canonically equivalent, these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured).
> 2015-10-30 19:51 GMT+01:00 J.S. Choi <js_choi at icloud.com <mailto:js_choi at icloud.com>>:
> # On emoji and the two rightwards black arrows
> (…) The post is about two encoded characters:
> U+27A1 Black Rightwards Arrow <http://www.unicode.org/charts/PDF/U2700.pdf <http://www.unicode.org/charts/PDF/U2700.pdf>>
> and U+2B95 Rightwards Black Arrow <http://www.unicode.org/charts/PDF/U2B00.pdf <http://www.unicode.org/charts/PDF/U2B00.pdf>>.
> In any case, I might make a formal proposal in the future, but I first want to determine here how probable that such a proposal would be discussed. What would you say the answers to those three questions are?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode