From christoph.paeper at crissov.de Fri Dec 2 06:35:41 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 2 Dec 2016 13:35:41 +0100 Subject: Emoji mappings in Shift JIS / CP932/943 Message-ID: I understand from - http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/readme.txt that Windows codepage 932 (IBM CP943) is basically (a superset of) Shift-JIS (JIS X 0208 A1). There are at least 3 related mapping files: - http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT - http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt - http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT I don?t know much about Shift-JIS, so this question may sound stupid: Could and should custom vendor extensions like the ones documented in - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt be included in these mappings? Related English Wikipedia articles: - https://en.wikipedia.org/wiki/JIS_X_0208 - https://en.wikipedia.org/wiki/Shift_JIS - https://en.wikipedia.org/wiki/Code_page_932 - https://en.wikipedia.org/wiki/Code_page_943 ____ Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained at all as characters get added to subsequent releases of Unicode? For instance, I think that - http://www.unicode.org/Public/MAPPINGS/VENDORS/MISC/SGML.TXT (dated 25 July 1997, last modified 8 April 2002) includes several `????` that could be specified nowadays, e.g.: - epsiv ISOgrk3 0x???? # variant epsilon + epsiv ISOgrk3 0x03F5 # GREEK LUNATE EPSILON SYMBOL From markus.icu at gmail.com Fri Dec 2 10:46:14 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 2 Dec 2016 08:46:14 -0800 Subject: Emoji mappings in Shift JIS / CP932/943 In-Reply-To: References: Message-ID: On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per wrote: > Could and should custom vendor extensions like the ones documented in > > - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt > > be included in these mappings? > They could, but it would be best for vendors to publish their actual mappings rather than others guessing them. At this point, the Emoji vendor mappings are not very relevant any more because Unicode has added many Emoji symbols that are not in the old vendor charsets. In general, the biggest value of the Unicode mapping tables was for cross-reference with existing standards when Unicode was being established. Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained > at all as characters get added to subsequent releases of Unicode? I am not aware of anyone working on them. If there is one that you think would be valuable to add or update, you can propose specific data. Viele Gr??e, markus PS: One of my favorite charts: https://w3techs.com/technologies/history_overview/character_encoding -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Dec 2 12:15:22 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 2 Dec 2016 19:15:22 +0100 Subject: Emoji mappings in Shift JIS / CP932/943 In-Reply-To: References: Message-ID: 2016-12-02 17:46 GMT+01:00 Markus Scherer : > On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per < > christoph.paeper at crissov.de> wrote: > >> Could and should custom vendor extensions like the ones documented in >> >> - http://unicode.org/Public/UCD/latest/ucd/EmojiSources.txt >> >> be included in these mappings? >> > > They could, but it would be best for vendors to publish their actual > mappings rather than others guessing them. > Sometimes these vendors no longer exist, or have be bought by another comany that has have stopped any earlier developments and supports for their legacy systems. Then it just remains a community of users that will have documents or data to adapt to the newer standard: they'll try to "guess" some best fit mappings so they can still use these data and encoded documents that remain in their archives (and not always in an easily printable form such as PDFs). However the most important documents to save are unlikely to contain emojis, which are basically used in interactive talks between individual users that have not archived them (and probably don't want these talks to be archived for long, if we consider that most of these talks are private). -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Sat Dec 3 16:37:12 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 3 Dec 2016 23:37:12 +0100 Subject: Emoji mappings in Shift JIS / CP932/943 In-Reply-To: References: Message-ID: Markus Scherer : > > > On Fri, Dec 2, 2016 at 4:35 AM, Christoph P?per wrote: >> Could and should custom vendor extensions like the ones documented in [EmojiSources.txt] be included in these mappings? > > They could, but it would be best for vendors to publish their actual mappings rather than others guessing them. If an existing character encoding forms the (sole) base of an addition to Unicode, shouldn?t it be part of the UTC?s job to document these sources? This was obviously done in the case of Japanese emoji, hence the existence of EmojiSources.txt, but for some reason that?s been kept separate from related mapping data files. I?m not sure the documentation is equally well available for emojis (also) taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/801740535073361920) and I have never seen an authoritative mapping from ASCII emoticons and line-art or from kaomojis to Unicode emojis. (There are plenty implementations of conversion routines, some open-source or well documented, others not.) > At this point, the Emoji vendor mappings are not very relevant any more because Unicode has added many Emoji symbols that are not in the old vendor charsets. Sure, but hardly anybody will ever want to convert Unicode emojis to Shift JIS, just (still rarely) the other way around. >> Furthermore, are the files in /Public/MAPPINGS/ supposed to be maintained at all as characters get added to subsequent releases of Unicode? > > I am not aware of anyone working on them. If there is one that you think would be valuable to add or update, you can propose specific data. For __ML at least, there seem to be more up-to-date mappings available at or , but not in a CSV format as preferred at Unicode. I haven?t gone through all of them, but I think most entries claiming a missing equivalent character in Unicode are outdated. Then there are some edge cases, e.g. Apple could easily have claimed that U+1F34E or U+1F34F maps to their company logo in their typefaces/charsets/encodings. (There?s no Window emoji, by the way, just a Door or a Frame with Picture and ?.) > https://w3techs.com/technologies/history_overview/character_encoding Sure, the conversion to UTF-8 on the Internet is finally happening, but there?ll always be someone who?s tasked with rescuing or investigating some obscure files from a floppy or mainframe. From markus.icu at gmail.com Sat Dec 3 17:21:25 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Sat, 3 Dec 2016 15:21:25 -0800 Subject: Emoji mappings in Shift JIS / CP932/943 In-Reply-To: References: Message-ID: On Sat, Dec 3, 2016 at 2:37 PM, Christoph P?per wrote: > If an existing character encoding forms the (sole) base of an addition to > Unicode, shouldn?t it be part of the UTC?s job to document these sources? > This was obviously done in the case of Japanese emoji, hence the existence > of EmojiSources.txt, but for some reason that?s been kept separate from > related mapping data files. > For the Japanese carriers, we had information about their Shift-JIS VDC assignments (Vendor-Defined Characters) but not about their non-VDC Shift-JIS usage. We only documented the VDC assignments, in a form that documented our decisions on unifying symbols across the three main carriers (which was in turn based on their 2006 cross-mapping agreement) and encoding of Unicode code points. But I think you are right, there probably was not really a good reason to put EmojiSources.txt into the UCD rather than into MAPPINGS. You could submit a proposal to move EmojiSources.txt to the MAPPINGS. I?m not sure the documentation is equally well available for emojis (also) > taken from ARIB, W*dings etc. (cf. https://twitter.com/FakeUnicode/status/ > 801740535073361920) The W*dings are not charsets but symbol fonts which were used with a generic Unicode PUA range. (After standardization, they may have gained mappings for the new assignments.) I think the ARIB symbols were lists of characters that wanted to be encoded in Unicode so that PUA and VDCs could be avoided, so there probably was no charset to map to either. In any case, there might well be examples of characters from other charsets whose mappings are documented in the proposal docs rather than in MAPPINGS. If you find examples of such, you could collect the data and propose their additions to MAPPINGS. Remember that the Unicode Consortium is run by volunteers. Yes, many of us work for member companies, but we often do Unicode work in addition to our "main jobs". (Some continue to contribute even after retirement!) and I have never seen an authoritative mapping from ASCII emoticons and > line-art or from kaomojis to Unicode emojis. (There are plenty > implementations of conversion routines, some open-source or well > documented, others not.) > I would say that, as far as Unicode is concerned, the canonical "mapping" for those are the Unicode *sequences* that are used to represent them. > At this point, the Emoji vendor mappings are not very relevant any more > because Unicode has added many Emoji symbols that are not in the old vendor > charsets. > > Sure, but hardly anybody will ever want to convert Unicode emojis to Shift > JIS, just (still rarely) the other way around. > Good point. I assume most do something like what we (Google) do: Take a base Shift-JIS mapping (we use windows-932 I think), remove the VDC-range mappings that conflict with a vendor's emoji range, and add the vendor's emoji mappings. You can see examples for this in Android's ICU source tree. For __ML at least, there seem to be more up-to-date mappings available at < > https://www.w3.org/2003/entities/2007/htmlmathml.ent> or < > https://html.spec.whatwg.org/multipage/entities.json>, but not in a CSV > format as preferred at Unicode. > > I haven?t gone through all of them, but I think most entries claiming a > missing equivalent character in Unicode are outdated. Maybe the user community is better served via w3.org and/or whatwg.org; if so, we could add a link from the MAPPINGS files to there. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Dec 3 17:26:47 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 4 Dec 2016 00:26:47 +0100 Subject: Emoji mappings in Shift JIS / CP932/943 In-Reply-To: References: Message-ID: 2016-12-03 23:37 GMT+01:00 Christoph P?per : > Markus Scherer : > I haven?t gone through all of them, but I think most entries claiming a > missing equivalent character in Unicode are outdated. Then there are some > edge cases, e.g. Apple could easily have claimed that U+1F34E or U+1F34F > maps to their company logo in their typefaces/charsets/encodings. (There?s > no Window emoji, by the way, just a Door or a Frame with Picture and ?.) > There's also U+229E "plus in a square" ? (from mathematical operators) which currently best approximates the Windows symbol ; it is used for example is documents needing to show keytrokes used by an UI. Notably on several wikis -- where it is also decorated (like all other function keys or alphanumeric keys) by some CSS generated frames, background colors with linear gradients and shadows to simulate the form of a physical key. For the key on Apple keyboards, the Apple logo is usually replaced by the U+2318 "fleuron"-like symbol ? (from technical symbols), initially intended for meaning "point of interest" (used now on Mac keyboards and used since long in documents originating from Apple itself and in its UI), so that the Apple logo is not needed for documenting or impelmenting an UI for MacOS. -------------- next part -------------- An HTML attachment was scrubbed... URL: From reini at cpanel.net Sun Dec 4 05:09:36 2016 From: reini at cpanel.net (Reini Urban) Date: Sun, 4 Dec 2016 12:09:36 +0100 Subject: Mixed-Script confusables in prog.languages Message-ID: <2BC2D526-7AAE-4D42-A915-D024F32EEC01@cpanel.net> I?m working on adding Mixed-Script confusable protection to a programming language, cperl a perl5 fork, for security reasons, for its identifiers. i.e. variable names, package names, function names, literals. This is a bit different to the typical use cases of libidna, in email or browsers. Is anybody aware of any other language implementation, which does confusable or mixed-script protection? I think R has something, because it has this header: https://cran.r-project.org/bin/windows/extsoft/3.4/include/unicode/uspoof.h but I found nothing else, which is quite annoying. My approach is as following: * normalize identifiers (NFC) and only store normalized variants. this should catch bidi spoofs, combining characters and such. * check each unicode code point for its Script property and besides Latin, Common and Inherited only allow the first script, but error on any other mixed script. Additional scripts need to be declared. https://github.com/perl11/cperl/issues/229 in perl like this: use utf8 ?Greek?, ?Cyrillic?; utf8 is a pragma to allow unicode identifiers, not strings, to be added to the symbol table. Obviously this has risks when reviewing a codebase, which might even bypass test suites. This is fast enough, and has no measurable costs in the parser. unicode has a nice security/confusable.txt table which could be used for more fine-grained checks, yes. But I fear this is too much overhead for the generic parser, and I think that avoiding the problem by forbidding/need to declare mixed scripts is much easier, and more declarative. Of course there exist several languages which require more than one script, like Japanese = Hiragana and Katakana and maybe Han, Korean = Hangul + Han, ? or african languages as some have other than Latin roots, e.g. Ethiopian from Semitic. Indian languages also sound problematic, and all the Old_