Dealing with Unencodeable Characters

Philippe Verdy verdy_p at wanadoo.fr
Thu Oct 6 13:06:56 CDT 2016


PUA characters are still used when mapping corporate logos (from Windows
and Apple/MacOS) in fonts for the relevant systems.

Microsoft then opted to include these corporate logos (and specific UI
icons) in a separate font, also with PUA mappings, and then added new PUA
fonts as needed.

E.g.:
* "Segoe MDL2 Assets" on Windows 10, even if many of theses characters are
symbols are also encoded separately with standard codes, only to make sure
they have a coherent design and metrics instead of taking them from various
random fonts). There are for example icons representing battery levels,
wifi reception levels with bars, status icons for muting on/off some
devices or UI services for talks, cameras, selection of screen,
enabling/disabling the touch interface, displaying the state of headphones,
presenting incoming phone calls or keeping them silent... and several
variants of common arrows and common geometric symbols, or even some
characters for the Windows calculator such as common arithmetic signs.
You'll note many variants of arrow heads. May be these characters are aslo
used internally for being used as internal fallbacks in IE/Edge, but all
this is left completely undocumented (colutarily in my opinion to make sure
that other users will not create and exchange documents intended to be
interoperable).
* "Webdings" contain various elaborate icons that are designed to be
realistic rather than symbolic, sometimes in several locale-sensitive
variants (e.g. the Earth globe, centered on America, or Europe/Africa, or
on Asia/Australia). Here again you'll find various arrow heads for
displaying UI buttons.
* "Wingdings", "Wingdings 2", are here again maaping various forms of
arrows and arrow heads, plus some emojis or enclosed characters, or
decorative characters. "Wingdings" also includes another Windows logo at
position 0xFF; these fonts are not mapped to Unicode but to 8-bit code
positions 0x21..0xFF.
* "Wingdings 3" uses a mix of non-Unicode mappings in 0x21..0xFF and some
characters and other regular Unicode positions (in 0x2000..0X9FFF) multiple
times (every block of 0x100 code positions, i.e. each glyph is mapped 128
or 129 times in that font). None of these characters have a Unicode mapping.
* You probably remember the case of the "Marlett" font created to support
the UI of Windows 7 (but most positions are assigned to .notdef/"tofu") and
that has a position 0x57 mapped to a Windows logo. There's also an old font
"MT Extra" made by Math Type (in 1996 according to its details), containing
some maths symbols (probably still used by some modules in the Equations
edit for compatibility of documents created with old versions of Office).
These two fonts are using only 8-bit code mappings (in 0x21..0xFF, but most
of them are mapped to a .notdef/"tofu" glyph).

Such fonts are installed and used by specific software modules, and at
discrete font sizes and not even hinted (they could as well use collections
of scalable vector graphics, but a single font allows these symbols to be
more efficiently loaded and to be hinted for low resolution display at
small font sizes). They may still be used in other applications but without
any warranty of interoperability or support for upgrades/downgrades across
Windows versions. In fact these fonts are not relaly supported outside of
the specific software modules needing them to render their UI. They may
disppear or change significantly at any time.

2016-10-06 16:54 GMT+02:00 Charlotte Buff <irgendeinbenutzername at gmail.com>:

> One of Unicode's goals is round-trip compatibility with old legacy
> character sets, which is why we gathered many compatibility characters over
> time that would normally have been out of scope for the standard. It's why
> Zapf Dingbats and arabic presentation forms are in Unicode for example.
> However, there are some characters that form part of these sets yet are
> deliberately not encoded in Unicode because they were considered unsuitable
> for inclusion. The two that come to mind are the Windows logo from
> Wingdings and the Shibuya 109 emoji from the original Japanese vendor sets.
>
> Given that these two have no Unicode equivalents, their source character
> sets are not fully compatible with Unicode, i.e. there is going to be data
> loss and confusion when trying to convert into or from Unicode.
>
> If theoretically I wanted to convert an old Shift JIS document containing
> emoji to Unicode, how should I ideally handle Shibuya 109?
>
> I remember the early emoji proposal documents originally contained "emoji
> compatibility symbols" which where used to map to source characters that
> weren't meant to be included with a specified semantic. I believe STATUE OF
> LIBERTY was one of those characters and was simply called EMOJI
> COMPATIBILITY SYMBOL-XX so that that specific landmark wouldn't strictly be
> part of Unicode. Obviously this approach ultimatively wasn't implemented,
> but I wonder whether there could be designated compatibility characters for
> this kind of issue. Private use characters are an obvious choice but of
> course their meaning is user-defined, so while all other emoji in my Shift
> JIS document would receive an unambiguous Unicode mapping, Shibuya 109
> would remain vague and very limited in interchange options.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20161006/7ffc3c5a/attachment.html>


More information about the Unicode mailing list