Emoji mappings in Shift JIS / CP932/943

Christoph Päper christoph.paeper at crissov.de
Sat Dec 3 16:37:12 CST 2016

Markus Scherer <markus.icu at gmail.com>:
> On Fri, Dec 2, 2016 at 4:35 AM, Christoph Päper <christoph.paeper at crissov.de> wrote:
>> Could and should custom vendor extensions like the ones documented in [EmojiSources.txt] be included in these mappings?
> They could, but it would be best for vendors to publish their actual mappings rather than others guessing them.

If an existing character encoding forms the (sole) base of an addition to Unicode, shouldn’t it be part of the UTC’s job to document these sources? This was obviously done in the case of Japanese emoji, hence the existence of EmojiSources.txt, but for some reason that’s been kept separate from related mapping data files. 

I’m not sure the documentation is equally well available for emojis (also) taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/801740535073361920) and I have never seen an authoritative mapping from ASCII emoticons and line-art or from kaomojis to Unicode emojis. (There are plenty implementations of conversion routines, some open-source or well documented, others not.)

> At this point, the Emoji vendor mappings are not very relevant any more because Unicode has added many Emoji symbols that are not in the old vendor charsets.

Sure, but hardly anybody will ever want to convert Unicode emojis to Shift JIS, just (still rarely) the other way around.

>> Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained at all as characters get added to subsequent releases of Unicode?
> I am not aware of anyone working on them. If there is one that you think would be valuable to add or update, you can propose specific data.

For __ML at least, there seem to be more up-to-date mappings available at <https://www.w3.org/2003/entities/2007/htmlmathml.ent> or <https://html.spec.whatwg.org/multipage/entities.json>, but not in a CSV format as preferred at Unicode.

I haven’t gone through all of them, but I think most entries claiming a missing equivalent character in Unicode are outdated. Then there are some edge cases, e.g. Apple could easily have claimed that U+1F34E or U+1F34F maps to their company logo in their typefaces/charsets/encodings. (There’s no Window emoji, by the way, just a Door or a Frame with Picture and ❖.)

> https://w3techs.com/technologies/history_overview/character_encoding

Sure, the conversion to UTF-8 on the Internet is finally happening, but there’ll always be someone who’s tasked with rescuing or investigating some obscure files from a floppy or mainframe.

More information about the Unicode mailing list