From davidj_faulks at yahoo.ca Sat Feb 6 08:11:27 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Sat, 6 Feb 2016 14:11:27 +0000 (UTC) Subject: Uranian Astrology Symbols References: <1705355011.185822.1454767887083.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1705355011.185822.1454767887083.JavaMail.yahoo@mail.yahoo.com> Hello, I'm investigating the possibility of adding more astrology symbols to Unicode. There is a branch of Western Astrology known as ?Uranium Astrology?, or the ?Hamburg School?, which among other things uses a set of 8 ?astrological planets? (Cupido, Hades, Zeus, Kronos, Apollon, Admetos, Vulcanus, Poseidon). These ?planets? have well defined symbols. Here are some sites on Uranian Astrology: http://theuranianastrologer.com/ http://uraniansociety.com/ http://arlenekramer.net/uranian.asp http://www.uranian-institute.org/ http://eastrologer.net/uranian-astrology/ https://uranianastrologybooks.com/ The last one reveals that there are many published books on this type of astrology. However, blindly buying books just on the chance they might contain in-text examples of these symbols?to use for examples in the proposal?is not something I feel inclined to do. Therefore, I am hoping I can use pdf examples found on the internet, such as ... http://uraniansociety.com/USIG_articles/article_history_of_uranian_astrology_michael_feist.pdf (page 13) http://www.astrology-x-files.com/report/Johnny%20Carson-Asteroids.pdf (scattered use of symbols) http://www.witte-verlag.com/media/djcatalog/TNE_1870-2070-Pages_4_5_7_199.pdf (not a very good example) http://www.tonybonin.de/IQ-Jauch.PDF (Page 10) http://holestoheavens.com/wp-content/uploads/2012/02/astro-copy.pdf (page 2 lists a bunch of symbols) http://www.rojn-info.com/images/1172753867/urephem2004.pdf (tabular data) http://ridoux.fr/spip/IMG/pdf/-31.pdf (tables inside charts, not a very good example) http://ridoux.fr/spip/IMG/pdf/-33.pdf (better tables, like on page 7) In particular, I have found this page : http://www.astrax.de/download.html, which contains many downloads for what seems to be a German astrology magazine, and a quick check reveals that at least most of them contain in-text examples of the uranian planet symbols. This magazine might even have been printed ? I can't really tell, since I don't speak german. Also, I've found a description of a TEX package which has them: http://ansuz.sooke.bc.ca/astrology/starfont/starfont.pdf I would like some advice and input on whether this is okay, and if so, which block should receive these symbols. Also, Eris: http://www.moreplutos.com/AstroJournal-SeptOct2014_Eris-corr-opt.pdf (The symbol used there can be unified with U+29EC) From asmusf at ix.netcom.com Sat Feb 6 15:54:11 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sat, 6 Feb 2016 13:54:11 -0800 Subject: Uranian Astrology Symbols In-Reply-To: <1705355011.185822.1454767887083.JavaMail.yahoo@mail.yahoo.com> References: <1705355011.185822.1454767887083.JavaMail.yahoo.ref@mail.yahoo.com> <1705355011.185822.1454767887083.JavaMail.yahoo@mail.yahoo.com> Message-ID: <56B66B83.1020007@ix.netcom.com> An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Sun Feb 7 14:20:03 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Sun, 7 Feb 2016 20:20:03 +0000 (UTC) Subject: Uranian Astrology Symbols References: <425299532.510655.1454876403563.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <425299532.510655.1454876403563.JavaMail.yahoo@mail.yahoo.com> > On Sun, 2/7/16, Asmus Freytag wrote: >>On 2/7/2016 4:02 AM, David Faulks wrote: >> 29EC ? WHITE CIRCLE WITH DOWN ARROW is in >> *Miscellaneous Mathematical Symbols-B* and has the >> category Sm, and all of the fonts I have which display it use >> a glyph identical to the unicode code charts. The Eris >> symbol?the one people are using?has a circle relativly >> smaller, but I thought that that was not considered a good >> enough reason to justify a new codepoint. > Yes and no. For mathematical fonts, it's often important > that different circles relate in size. How does Eris relate to > Earth in?astrological fonts? Is there a clear relation, whether > same size or?one always being smaller? Imagine what would > happen for a font that covers both mathematical use and > astrology? > Would a designer be forced to choose which user > community to ?accommodate? This is somewhat difficult to judge. I don't think astrologers would find the current glyph for U+29EC unacceptable, but the glyph typically being used has the circle smaller than Mars, Venus, and either of the two Earth symbols. Sometimes, an oval is used instead of a circle. However, some styles for astrology symbols have very large circles. > So, depending on the facts of how this symbol is used, > there may well be good reasons to not equate it with the > mathematical character - but that also means you'll need to > understand what the latter was encoded for (which you can > find by searching the document register). I've found the proposals (from 2000), but many symbols there have no explained use, including U+29EC. If the members of this mailing list think a proposal including a separate Eris symbol is acceptable, I will include it in my proposal. Along with, perhaps, some additional symbols... >A./ David From chris.jacobs at xs4all.nl Sun Feb 7 15:02:54 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Sun, 07 Feb 2016 22:02:54 +0100 Subject: Uranian Astrology Symbols In-Reply-To: <425299532.510655.1454876403563.JavaMail.yahoo@mail.yahoo.com> References: <425299532.510655.1454876403563.JavaMail.yahoo.ref@mail.yahoo.com> <425299532.510655.1454876403563.JavaMail.yahoo@mail.yahoo.com> Message-ID: <104b51c4753a1fd5920912bfd32f8661@xs4all.nl> David Faulks schreef op 2016-02-07 21:20: > > If the members of this mailing list think a proposal including a > separate Eris symbol is acceptable, I will include it in my proposal. > > Along with, perhaps, some additional symbols... > >> A./ > > David Seems there is no agreement what the Eris symbol should look like. This website gives four different shapes, not counting the Discordian one. http://www.zanestein.com/Trans-pluto.htm#UB313 Chris From davidj_faulks at yahoo.ca Sun Feb 7 15:51:48 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Sun, 7 Feb 2016 21:51:48 +0000 (UTC) Subject: Uranian Astrology Symbols References: <67654298.510268.1454881908960.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <67654298.510268.1454881908960.JavaMail.yahoo@mail.yahoo.com> (making sure my response goes to the mailing list) > On Sun, 2/7/16, Chris Jacobs wrote: > Seems there is no agreement what the Eris symbol should > look like. This website gives four different shapes, not counting > the Discordian one. > http://www.zanestein.com/Trans-pluto.htm#UB313 > Chris The situation seems to have settled down now. I have looked at plenty of astrological charts, and the circle/oval with downwards arrow is all over the place, including the covers of books. None of the charts used Zane Stein's symbol, one early chart used the ?Hand of Eris?, and one Polish chart used the ?Polish Symbol?. All of the others used the circle with downwards arrow, and I have read it described as ?now-standardized?. David From frederic.grosshans at gmail.com Sun Feb 7 17:09:33 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Mon, 8 Feb 2016 00:09:33 +0100 Subject: =?UTF-8?Q?Shouldn=e2=80=99t_the_proposed_U+23FF_OBSERVER_EYE_SYMBOL?= =?UTF-8?Q?_be_an_emoji_=3f?= Message-ID: <56B7CEAD.2040107@gmail.com> Dear Unicode list readers (cc Simon Griffee, Rick McGowan), I have some problems with the proposed *U+23FF OBSERVER EYE SYMBOL (named so in the pipeline http://www.unicode.org/alloc/Pipeline.html and in the Draft additional repertoire for ISO/IEC 10646:2016 (5th edition) CD.2 http://www.unicode.org/L2/L2015/15339-n4705.pdf) As far as I understood, this character is intended to be added to Unicode to represent the eye which is frequently represented in optics schematics, to represent the observer. Simon Griffee as proposed this symbol in L2/15-031R (http://www.unicode.org/L2/L2015/15031r-observer.pdf) with some more examples provided by Rick McGowan in L2/15-095 (http://www.unicode.org/L2/L2015/15095-observer-examples.pdf). In a few words (more details below), I think this character is actually used beyond optics should be encoded as an emoji with properties (and aspect) similar to ?? U+1F441 EYE, with a name like EYE SIDE VIEW. I also think it would be better if it were moved to an emoji block (1F900?1F9FF Supplemental Symbols and Pictographs ?) I intend to write and submit a formal document later, and I write this mail in order to gather advices on the best way to advance further. Fr?d?ric === The details of my objections == I agree with Simon Griffee is a standard symbol used in optics an related fields, and it is attested from the 16th to 21st century. It is clearly needed, and I have seen other characters, like e.g. ?U+2222 SPHERICAL ANGLE, used on diagrams to replace it. However, I have not seen intermixed with plain text, and I don?t find the example of L2/15-031 convincing but I?m not sure whether this kind of criterion is relevant in the current ?emoji-era?. I think this symbol is better represented by an emoji named EYE SIDE VIEW (or a similar name) I. Other common representation of the observer in optical context are encoded as emojis. Three examples are ?? ?? ? ??U+1F3A5 MOVIE CAMERA, (e.g. fig 8 of http://arxiv.org/abs/1502.03809 , http://alexrodgers.co.uk/wp-content/uploads/2014/08/raytracing.png ) ?? U+1F441 EYE (e.g. fig K page 124 of http://www.e-rara.ch/zut/content/titleinfo/290294, or http://653fb62b3a129d296422-3019ba142970aa3e5db9c4ca20cb2da4.r64.cf1.rackcdn.com/images/Nioo9nlDeaYP.878x0.Z-Z96KYq.jpg) ? U+263A WHITE SMILING FACE (e.g. http://hevi.info/img/dissertation-images/MSc_Dissertation_Umut_ERTURK_0703851_Ray_Tracing_On_Cell_html_7af52a91.png) II. This symbol is often used together with other emoji-like symbols on schematics For example ?? U+1F323 WHITE SUN, ??U+1F4A1 ELECTRIC LIGHT BULB, but also ?? U+1F334 PALM TREE http://sciences-physiques.ac-dijon.fr/archives/astronomie/Mirages/images/Mirage_chaud1.gif III. In centuries-old printed publications as well on recent website, it appears both in a really schematic way (like ?) as well as in a detailed graphical drawing, including lashes and brows. This remembers the text vs emoji variants induced by VS15 and VS16 on emojis. IV. The symbol itself is an eye seen from the side, hence the name EYE SIDE VIEW I propose. I think the name OBSERVER SYMBOL is not adapted, since this symbol is sometimes used with other semantics, and a disunification is probably not worth the trouble. Some examples of other meanings include a) The eye itself http://thumb1.shutterstock.com/display_pic_with_logo/169/169,1197707547,8/stock-vector-eye-drops-symbol-7816558.jpg http://www.bigstockphoto.com/fr/image-21796349/stock-vector-contact-lens-and-eye-symbol-sign-and-button b) Ophtalmology https://en.wikisource.org/wiki/Portal:Medicine c) Sight http://johncrowhurst.me/wp-content/uploads/2011/04/istockphoto_5622861-five-senses-icons11.jpg https://ehumanbiofield.wikispaces.com/file/view/istockphoto_2307885_senses.jpg/32748849/istockphoto_2307885_senses.jpg d) The ?eye of the mind?, as in this 14th century book reproduced here https://twitter.com/Jean_no/status/613387284356964352 e) If another eye-looking symbol is encoded, one can be almost sure it will be used as emoji, and it is probably safer to anticipate this use. From asmusf at ix.netcom.com Sun Feb 7 17:37:49 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 7 Feb 2016 15:37:49 -0800 Subject: Uranian Astrology Symbols In-Reply-To: <104b51c4753a1fd5920912bfd32f8661@xs4all.nl> References: <425299532.510655.1454876403563.JavaMail.yahoo.ref@mail.yahoo.com> <425299532.510655.1454876403563.JavaMail.yahoo@mail.yahoo.com> <104b51c4753a1fd5920912bfd32f8661@xs4all.nl> Message-ID: <56B7D54D.6070703@ix.netcom.com> On 2/7/2016 1:02 PM, Chris Jacobs wrote: > > > David Faulks schreef op 2016-02-07 21:20: >> > >> If the members of this mailing list think a proposal including a >> separate Eris symbol is acceptable, I will include it in my proposal. >> >> Along with, perhaps, some additional symbols... >> >>> A./ >> >> David > > Seems there is no agreement what the Eris symbol should look like. > This website gives four different shapes, not counting the Discordian > one. > http://www.zanestein.com/Trans-pluto.htm#UB313 > > Chris > Unicode does not so much encode concepts. Neither does it (normally) attempt to encode for precise shapes. What it tries to do is to encode elements that are sufficient to represent text. If there are many different conventions for representing a concept, that's similar to different spellings. Unicode normally supplies all the element and lets the users choose the spelling. The big exception is when the unit of spelling (for example, in normal text, that would be a letter) itself has a range of appearances. So, the question here would be: are these different shapes of the same symbol, or different symbols used for the same purpose. From the website I would think we have at least 4 distinct symbols. Two of the shapes look like they might be alternate representations of the same symbol. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sun Feb 7 18:57:15 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 7 Feb 2016 16:57:15 -0800 Subject: =?UTF-8?Q?Re:_Shouldn=e2=80=99t_the_proposed_U+23FF_OBSERVER_EYE_SY?= =?UTF-8?Q?MBOL_be_an_emoji_=3f?= In-Reply-To: <56B7CEAD.2040107@gmail.com> References: <56B7CEAD.2040107@gmail.com> Message-ID: <56B7E7EB.8090607@ix.netcom.com> An HTML attachment was scrubbed... URL: From jtauber at jtauber.com Mon Feb 8 11:10:55 2016 From: jtauber at jtauber.com (James Tauber) Date: Mon, 8 Feb 2016 11:10:55 -0600 Subject: precomposed polytonic Greek characters with macrons and other diacritics Message-ID: The Greek Extended block includes precomposed characters for vowels with all known combinations of accents, breathing and iota subscript. It also includes precomposed characters for the vowels alpha, iota and upsilon with macron. (Those three vowels are ambiguously short or long hence the need to mark length in some contexts). However, there is no precomposition of vowels with accents and/or breathing PLUS macron. (Vowels with iota subscript are always long so don't need a macron to indicate length). This isn't normally an issue in running polytonic Greek text where vowel length is rarely shown but is does occur in lexicons, grammars, etc. I'm wondering what potential objections / problems I should be aware of before trying to put together a proposal for these extra precomposed characters to be included. I wrote a blog post about this issue more broadly (not all of which has to do with Unicode) but which still might be of interest: http://jktauber.com/2016/01/28/polytonic-greek-unicode-is-still-not-perfect/ James -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Mon Feb 8 11:26:44 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Mon, 8 Feb 2016 18:26:44 +0100 Subject: transliteration of mjagkij znak (Cyrillic soft sign) Message-ID: <56B8CFD4.1070105@uni-konstanz.de> Hello, I am wondering how U+02B9 MOFIFIER LETTER PRIME made its way into the Unicode repertoire, and how it acquired its comment ?transliteration of mjagkij znak (Cyrillic soft sign: palatalization)?. ISO/R 9:1954 through ISO/R 9:1986 map the mjagkij znak ??? to the apostrophe, and so does DIN 1460:1982. The latter clearly depicts the apostrophe that later became U+02BC, while I am not sure whether also ISO/R 9 does so or rather depicts a glyph like U+0027. (All of these standards predate Unicode, so they just depict glyphs.) ISO/R 9:1995 maps the mjagkij znak ??? to the prime, particularly to the modifier letter U+02B9, in accordance with the comment in the Unicode charts. Unicode archeologists, can you shed some light on the history of both U+02B9 and the mjagkij znak? And linguists, can you tell me how the mjagkij znak is transliterated normally, as an apostrophe or as a prime? Thanks for any comments, Otto From doug at ewellic.org Mon Feb 8 12:30:49 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 08 Feb 2016 11:30:49 -0700 Subject: precomposed polytonic Greek characters with macrons and other diacritics Message-ID: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> James Tauber wrote: > I'm wondering what potential objections / problems I should be aware > of before trying to put together a proposal for these extra > precomposed characters to be included. It sounds from the blog post that the basic rationale for adding precomposed characters is that existing fonts, input methods, and other tools don't always work correctly with the combining sequences. I suppose one potential challenge you might face is to explain why the following FAQ items, though phrased in terms of Latin base letters, don't apply equally to Greek: http://www.unicode.org/faq/char_combmark.html#11 http://www.unicode.org/faq/char_combmark.html#12b -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jtauber at jtauber.com Mon Feb 8 12:47:30 2016 From: jtauber at jtauber.com (James Tauber) Date: Mon, 8 Feb 2016 12:47:30 -0600 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> References: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> Message-ID: On Mon, Feb 8, 2016 at 12:30 PM, Doug Ewell wrote: > James Tauber wrote: > > > I'm wondering what potential objections / problems I should be aware > > of before trying to put together a proposal for these extra > > precomposed characters to be included. > > It sounds from the blog post that the basic rationale for adding > precomposed characters is that existing fonts, input methods, and other > tools don't always work correctly with the combining sequences. > > I suppose one potential challenge you might face is to explain why the > following FAQ items, though phrased in terms of Latin base letters, > don't apply equally to Greek: > > http://www.unicode.org/faq/char_combmark.html#11 > http://www.unicode.org/faq/char_combmark.html#12b > Yes, I read those FAQs and hesitated before even posting because of them. The Greek Extended block already somewhat contradicts that by having the precomposed characters it does but I presume that was largely for legacy reasons and existing font encodings. There's no doubt the font and input methods can be improved right now regardless of any change to Unicode. That said, I still have questions around relative ordering of combining characters and also interaction of combining characters and precomposed characters. At the very least I'd like to put together some best practices for those dealing with polytonic Greek, even before I go to font foundries and keyboard software developers. Even with all this, though, my own work includes accentuation and syllabification algorithms, all of which are made more cumbersome by the lack of precomposed characters indicating vowel length. I'm currently leaning towards adding a layer of "character" processing on top of Python 3's otherwise decent support that effectively treats the relevant character sequences as single characters even if they aren't (and can't be precomposed). I'd be interested if others have tackled similar issues outside of Greek. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Feb 8 13:10:20 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 8 Feb 2016 11:10:20 -0800 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: References: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> Message-ID: On Mon, Feb 8, 2016 at 10:47 AM, James Tauber wrote: > Even with all this, though, my own work includes accentuation and > syllabification algorithms, all of which are made more cumbersome by the > lack of precomposed characters indicating vowel length. I'm currently > leaning towards adding a layer of "character" processing on top of Python > 3's otherwise decent support that effectively treats the relevant character > sequences as single characters even if they aren't (and can't be > precomposed). > I suggest you normalize the text (NFC or NFD), and then look for "grapheme clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries In C++ and Java, you could use an ICU BreakIterator for the latter. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Mon Feb 8 13:34:14 2016 From: leob at mailcom.com (Leo Broukhis) Date: Mon, 8 Feb 2016 11:34:14 -0800 Subject: Enclosing BANKNOTE emoji? Message-ID: There are ?? U+01F4B4 Banknote With Yen Sign ?? U+01F4B5 Banknote With Dollar Sign ?? U+01F4B6 Banknote With Euro Sign ?? U+01F4B7 Banknote With Pound Sign This is clearly an incomplete set. It makes sense to have a generic "enclosing banknote" emoji character which, when combined with a currency sign, would produce the corresponding banknote, to forestall requests for individual emoji for banknotes with remaining currency signs. Leo From davidj_faulks at yahoo.ca Mon Feb 8 13:44:27 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Mon, 8 Feb 2016 19:44:27 +0000 (UTC) Subject: Uranian Astrology Symbols References: <1475789143.886571.1454960667619.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1475789143.886571.1454960667619.JavaMail.yahoo@mail.yahoo.com> > On Sun, 2/7/16, Asmus Freytag wrote: > Subject: Re: Uranian Astrology Symbols > To: "Chris Jacobs" , "David Faulks" > Cc: "Unicode Mailing List" >Received: Sunday, February 7, 2016, 6:37 PM >>On 2/7/2016 1:02 PM, Chris Jacobs wrote: [ text cut ] >> This website gives four different shapes, not counting >> the Discordian one. >> http://www.zanestein.com/Trans-pluto.htm#UB313 >> >> Chris [ text cut ] > So, the question here would be: are these different shapes of the > same symbol, or different symbols used for the same purpose. > From the website I would think we have at least 4 > distinct symbols. Two of the shapes look like they might be > alternate representations of the same symbol. In addition to Eris, there is also a related issue for Pluto. The encoding of U+26E2 ?, separate from U+2645 ?, for Uranus, seems to set a precedent, and there are at least 3 extra symbols for Pluto that are in use. This has been discussed before. Should these (or at least the most common one) be encoded as well? > A./ David From liz at dijkmat.nl Mon Feb 8 13:29:35 2016 From: liz at dijkmat.nl (Elizabeth Mattijsen) Date: Mon, 8 Feb 2016 20:29:35 +0100 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: References: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> Message-ID: > On 08 Feb 2016, at 20:10, Markus Scherer wrote: > > On Mon, Feb 8, 2016 at 10:47 AM, James Tauber wrote: > Even with all this, though, my own work includes accentuation and syllabification algorithms, all of which are made more cumbersome by the lack of precomposed characters indicating vowel length. I'm currently leaning towards adding a layer of "character" processing on top of Python 3's otherwise decent support that effectively treats the relevant character sequences as single characters even if they aren't (and can't be precomposed). > > I suggest you normalize the text (NFC or NFD), and then look for "grapheme clusters". http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries > > In C++ and Java, you could use an ICU BreakIterator for the latter. Might I suggest looking at Rakudo Perl 6?s implementation of NFG (Normalization Form Grapheme) which will generate synthetic codepoints on the fly under the hood. For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf Liz From leob at mailcom.com Mon Feb 8 17:33:50 2016 From: leob at mailcom.com (Leo Broukhis) Date: Mon, 8 Feb 2016 15:33:50 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: I don't see why it is an "emoji exception", and I don't see any implementation issues given that replacing pairs of regional indicator symbols with the corresponding flags already works on many platforms. The rationale for COMBINING BANKNOTE is specifically to avoid the need for individual banknote emoji for every extant or future currency. On Mon, Feb 8, 2016 at 12:29 PM, Roozbeh Pournader wrote: > What's usually ignored in these discussions is how hard it is to actually > implement such "new" mechanisms implemented in software. I would be against > such a new mechanism, simply because it's yet another emoji "exception". > > If you think there's a need for such emojis (banknote with other > currencies), please write a proposal for the UTC. I for one would consider > a proposal with an atomic BANKNOTE WITH RIAL SIGN in a much more positive > light than one for a COMBINING BANKNOTE. > > On Mon, Feb 8, 2016 at 11:34 AM, Leo Broukhis wrote: > >> There are >> >> ?? U+01F4B4 Banknote With Yen Sign >> ?? U+01F4B5 Banknote With Dollar Sign >> ?? U+01F4B6 Banknote With Euro Sign >> ?? U+01F4B7 Banknote With Pound Sign >> >> This is clearly an incomplete set. It makes sense to have a generic >> "enclosing banknote" emoji character which, when combined with a >> currency sign, would produce the corresponding banknote, to forestall >> requests for individual emoji for banknotes with remaining currency >> signs. >> >> Leo >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtauber at jtauber.com Mon Feb 8 17:59:10 2016 From: jtauber at jtauber.com (James Tauber) Date: Mon, 8 Feb 2016 17:59:10 -0600 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: References: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> Message-ID: On Mon, Feb 8, 2016 at 1:29 PM, Elizabeth Mattijsen wrote: > > On 08 Feb 2016, at 20:10, Markus Scherer wrote: > > > > On Mon, Feb 8, 2016 at 10:47 AM, James Tauber > wrote: > > Even with all this, though, my own work includes accentuation and > syllabification algorithms, all of which are made more cumbersome by the > lack of precomposed characters indicating vowel length. I'm currently > leaning towards adding a layer of "character" processing on top of Python > 3's otherwise decent support that effectively treats the relevant character > sequences as single characters even if they aren't (and can't be > precomposed). > > > > I suggest you normalize the text (NFC or NFD), and then look for > "grapheme clusters". > http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries > > > > In C++ and Java, you could use an ICU BreakIterator for the latter. > > Might I suggest looking at Rakudo Perl 6?s implementation of NFG > (Normalization Form Grapheme) which will generate synthetic codepoints on > the fly under the hood. > > For an introduction, see http://jnthn.net/papers/2015-spw-nfg.pdf > Thanks very much, I'll look into this. Having done a Python implementation of the UCA, I'm quite looking forward to doing more Unicode tools for Python. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Mon Feb 8 18:26:00 2016 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 8 Feb 2016 19:26:00 -0500 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: References: <20160208113049.665a7a7059d7ee80bb4d670165c8327d.c746940bc1.wbe@email03.secureserver.net> Message-ID: <56B93218.5050401@kli.org> On 02/08/2016 01:47 PM, James Tauber wrote: > > I'd be interested if others have tackled similar issues outside of Greek. > > James > > Keep in mind that in pointed Hebrew (or Arabic (or for that matter Devanagari)), practically every letter is like this, since each vowel is a diacritical, from a typographical point of view. Though perhaps not considered in the same way that Greek considers its accented letters. ~mark From everson at evertype.com Mon Feb 8 19:47:17 2016 From: everson at evertype.com (Michael Everson) Date: Tue, 9 Feb 2016 01:47:17 +0000 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B8CFD4.1070105@uni-konstanz.de> References: <56B8CFD4.1070105@uni-konstanz.de> Message-ID: <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> It?s what I was taught as the scientific romanization for Russian and Slavic in general. Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Mon Feb 8 19:59:55 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 8 Feb 2016 17:59:55 -0800 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> References: <56B8CFD4.1070105@uni-konstanz.de> <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> Message-ID: <56B9481B.2030109@ix.netcom.com> An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Feb 8 20:26:36 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 9 Feb 2016 11:26:36 +0900 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: References: Message-ID: <56B94E5C.7020101@it.aoyama.ac.jp> On 2016/02/09 02:10, James Tauber wrote: > http://jktauber.com/2016/01/28/polytonic-greek-unicode-is-still-not-perfect/ Hello James, I read your article. I just wanted to point out that in your problem 3, the two sequences aren't normalized because if the acute accent is first, that would be considered as a different character, namely with the macron *on top of* the accent. Regards, Martin. From ruland at luckymail.com Mon Feb 8 20:39:38 2016 From: ruland at luckymail.com (Charlie Ruland) Date: Tue, 9 Feb 2016 03:39:38 +0100 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B9481B.2030109@ix.netcom.com> References: <56B8CFD4.1070105@uni-konstanz.de> <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> <56B9481B.2030109@ix.netcom.com> Message-ID: <56B9516A.8090607@luckymail.com> Am 09.02.2016 schrieb Asmus Freytag (t): > On 2/8/2016 5:47 PM, Michael Everson wrote: >> It?s what I was taught as the scientific romanization for Russian and Slavic in general. >> >> Michael Everson *http://www.evertype.com/ >> >> >> > Source? > > A./ Look at tables 27.1 (p. 348) and 27.2 (p. 351) of Paul Cubberley?s /The Slavic Alphabets/ (=Peter T. Daniels and William Bright (eds.): /The Word?s Writing Systems/, pp. 346?355). Obviously the soft sign is transliterated as a prime , and the hard sign as a double prime . Also note that [g?] is Romanized as which can hardly be considered an apostrophe above . Charlie -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Feb 8 23:31:13 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 8 Feb 2016 21:31:13 -0800 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B9516A.8090607@luckymail.com> References: <56B8CFD4.1070105@uni-konstanz.de> <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> <56B9481B.2030109@ix.netcom.com> <56B9516A.8090607@luckymail.com> Message-ID: <56B979A1.6060700@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Feb 9 00:25:59 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 9 Feb 2016 07:25:59 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: I would suggest that you first gather statistics and present statistics on how often the current combinations are used compared to other emoji, eg by consulting sources such as: http://www.emojixpress.com/stats/ or http://emojitracker.com/ Mark On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: > There are > > ?? U+01F4B4 Banknote With Yen Sign > ?? U+01F4B5 Banknote With Dollar Sign > ?? U+01F4B6 Banknote With Euro Sign > ?? U+01F4B7 Banknote With Pound Sign > > This is clearly an incomplete set. It makes sense to have a generic > "enclosing banknote" emoji character which, when combined with a > currency sign, would produce the corresponding banknote, to forestall > requests for individual emoji for banknotes with remaining currency > signs. > > Leo > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Tue Feb 9 04:00:55 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 9 Feb 2016 02:00:55 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: Thank you for the links, quite mesmerizing! On emojitracker.com (cumulative counts, but only on twitter, AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but 10x more than the lowest counts, and about the same frequency as various individual clock faces). It is quite evident that the dollar banknote emoji serves as a stand-in for at least half a dozen of various currencies. On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? wrote: > I would suggest that you first gather statistics and present statistics on > how often the current combinations are used compared to other emoji, eg by > consulting sources such as: > > http://www.emojixpress.com/stats/ > or > http://emojitracker.com/ > > Mark > > On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: > >> There are >> >> ?? U+01F4B4 Banknote With Yen Sign >> ?? U+01F4B5 Banknote With Dollar Sign >> ?? U+01F4B6 Banknote With Euro Sign >> ?? U+01F4B7 Banknote With Pound Sign >> >> This is clearly an incomplete set. It makes sense to have a generic >> "enclosing banknote" emoji character which, when combined with a >> currency sign, would produce the corresponding banknote, to forestall >> requests for individual emoji for banknotes with remaining currency >> signs. >> >> Leo >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Tue Feb 9 07:27:05 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Tue, 9 Feb 2016 13:27:05 +0000 (UTC) Subject: More Astrology Symbols References: <404271703.1156845.1455024425861.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <404271703.1156845.1455024425861.JavaMail.yahoo@mail.yahoo.com> I feel pretty confident in proposing the Uranian Planet symbols, but I am now wondering how far I can go. Astrological symbols are mostly used in charts. Rarely, you will also see a tabular listing of aspects, positions, or midpoints accompanying the chart, and these will have symbols. Even more rarely, astrologers will discuss or mention aspects using symbols instead of words. However, many astrology programs used to produce charts nowadays can also produce tables (in image format because of the symbols) automatically, so any symbol appearing in charts can potentially appear in tables (text). These tables can rarely be found in PDFs ( http://www.tonybonin.de/IQ-Jauch.PDF has a good example on page 9 and 10 ), but you can also find somewhat similar tables embedded inside images on the internet (easy to find using Google image search) There are plenty of extra symbols I've seen in charts, but for which I otherwise lack text examples (except for one or two)?in use?, as opposed to merely showing what they are. Transpluto : An ?Astrological Planet? invented in 1972, also called ?Isis?, ?Bacchus?, and so on. Has a well defined symbol. I do have one example from a table, but the other examples are just for showing the symbol or from charts. Vulcan : This hypothetical intra-mercurian planet may have been disproved by General Relativity, but that has not stopped some astrologers from using it to this day. The symbol is simple enough, but I haven't foundanything to unify it with. Sedna: The only trans-Neptunian object other than Pluto and Eris that has a symbol that astrologers commonly use. People have devised symbols for the other Dwarf Plants and some of the smaller TNO's, but I have not seen them in charts (even when I looked). The following images : https://wegoastrology.files.wordpress.com/2014/10/sfpage.jpg http://www.the-dreamweaver.net/portal/images/Lunar%20eclipse%202013%20apr%2025.png have some info outside the chart proper that include the Sedna symbol. Extra Asteroids: Astrologers have devised symbols for asteroids other than Ceres, Pallas, Juna, and Vesta, but the only ones I've seen in charts are Hygeia, Astraea, Lillith, and Sappho. The Sappho symbol is usually identical to U+26A2 DOUBLED FEMALE SIGN (unless you replace the circles with hearts, which is probably just a stylistic variation), but the others are not in Unicode. ?Waldemath?s Moon? aka ?Dark Moon Lilith?: Not to be confused with Black Moon Lilith. This is an ?Astrological Moon? of Earth. There is no need for a separate symbol, since it looks like U+2205 EMPTY SET or U+2300 DIAMETER SIGN. Centaurs : Small Planetoids that orbit between the orbits of Jupiter and Neptune. Chiron (? U+26B7) is one of them, so when other such objects began to be discovered in the 90's, some astrologers started using them. The only ones I have actually come across in charts are symbols for Pholus and Nessus. Finally, there is some confusion caused by the orbit of the Moon. Astrology uses virtual points calculated from this orbit ( ? ? ? ). Thanks to the sun and the barycentre, the orbit of the moon is rather wobbly, and before the 90's, astrologers typically (but not always) used an approximation. With the advent of astrology software and downloadable NASA/JPL information, accurate virtual points became easy. Versions of ? and ? with ?T? inside them can be used to indicate the ?true? nodes. Also, there is ?, a reversed glyph is sometimes used to indicate the ?True? Black Moon Lilith. I have seen charts with both the regular (mean) and reversed (true) Liliths. David From everson at evertype.com Tue Feb 9 07:43:25 2016 From: everson at evertype.com (Michael Everson) Date: Tue, 9 Feb 2016 13:43:25 +0000 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B979A1.6060700@ix.netcom.com> References: <56B8CFD4.1070105@uni-konstanz.de> <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> <56B9481B.2030109@ix.netcom.com> <56B9516A.8090607@luckymail.com> <56B979A1.6060700@ix.netcom.com> Message-ID: On 9 Feb 2016, at 05:31, Asmus Freytag (t) wrote: > Without scouring the book I don't know whether there's another place in it where something's unquestioningly the prime. In that case we could figure out whether its appearance is simply the way that font does it. Alternatively, if making double prime look different from two single primes, perhaps that's common enough across fonts, and would help to lay any doubts to rest - but so far, what I see is a spacing acute. Well, Asmus, it isn?t one. We linguists have been taught it?s the prime. https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics Michael Everson * http://www.evertype.com/ From mheijdra at Princeton.EDU Tue Feb 9 08:14:51 2016 From: mheijdra at Princeton.EDU (Martin Heijdra) Date: Tue, 9 Feb 2016 14:14:51 +0000 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: References: <56B8CFD4.1070105@uni-konstanz.de> <8E675D4C-0F35-4FBC-8AD6-3FEE8197472E@evertype.com> <56B9481B.2030109@ix.netcom.com> <56B9516A.8090607@luckymail.com> <56B979A1.6060700@ix.netcom.com> Message-ID: <0001012FBBD4FE40857959B0B65DE95B6E5EC59E@CSGMBX202W.pu.win.princeton.edu> And so it is, also in the library world both before and after Unicode: for miagkii znak the prime is prescribed. The prime is also prescribed for some uses for standard transliteration in Tibetan and Hebrew/Arabic/Persian/Pushto: See:e.g. the relevant tables on https://www.loc.gov/catdir/cpso/roman.html: Tibetan: When two full forms of letters are stacked, as in Sanskritized Tibetan, there is no need to indicate the stacking. However, in the two cases noted here a modified letter prime should be inserted between the two consonants for the purpose of disambiguation. ??? t?sa ?? tsa ??? n?ya ?? nya Hebrew: A single prime ( ? ) is placed between two letters representing two distinct consonantal sounds when the combination might otherwise be read as a digraph. his?hid Persian: When the affix and the word with which it is connected grammatically are written separately in Persian, the two are separated in romanization by a single prime ( ? ). kh?nah?h? Martin Heijdra -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Everson Sent: Tuesday, February 09, 2016 8:43 AM To: Unicode Discussion Subject: Re: transliteration of mjagkij znak (Cyrillic soft sign) On 9 Feb 2016, at 05:31, Asmus Freytag (t) > wrote: > Without scouring the book I don't know whether there's another place in it where something's unquestioningly the prime. In that case we could figure out whether its appearance is simply the way that font does it. Alternatively, if making double prime look different from two single primes, perhaps that's common enough across fonts, and would help to lay any doubts to rest - but so far, what I see is a spacing acute. Well, Asmus, it isn?t one. We linguists have been taught it?s the prime. https://en.wikipedia.org/wiki/Prime_(symbol)#Use_in_linguistics Michael Everson * http://www.evertype.com/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtauber at jtauber.com Tue Feb 9 09:13:04 2016 From: jtauber at jtauber.com (James Tauber) Date: Tue, 9 Feb 2016 09:13:04 -0600 Subject: precomposed polytonic Greek characters with macrons and other diacritics In-Reply-To: <56B94E5C.7020101@it.aoyama.ac.jp> References: <56B94E5C.7020101@it.aoyama.ac.jp> Message-ID: On Mon, Feb 8, 2016 at 8:26 PM, Martin J. D?rst wrote: > On 2016/02/09 02:10, James Tauber wrote: > >> >> http://jktauber.com/2016/01/28/polytonic-greek-unicode-is-still-not-perfect/ >> > > Hello James, > > I read your article. I just wanted to point out that in your problem 3, > the two sequences aren't normalized because if the acute accent is first, > that would be considered as a different character, namely with the macron > *on top of* the accent. Thanks. I've updated the post to clarify it's not a problem with Unicode per se. James -------------- next part -------------- An HTML attachment was scrubbed... URL: From unicode at acjs.net Tue Feb 9 05:18:33 2016 From: unicode at acjs.net (ACJ Unicode) Date: Tue, 9 Feb 2016 12:18:33 +0100 Subject: Case for letters j and J with acute Message-ID: <56B9CB09.8060906@acjs.net> Hello, This is my first time posting here, so please forgive me if I don?t get all the ethics right. I would like to make a case for an aspect of my native language (Dutch) that has always been problematic in the digital realm. Some context: I?m a (typo)graphic designer with a background in interaction design. In the Dutch language, acute accents are used to indicate stressed vowels. [1] Also in the Dutch language, the digraph IJ (lowercase ij) is considered a separate letter and a vowel. [2] Hence, when putting emphasis on a word that contains ij, one would put acute accents over the i and the j. [3] This is taught in writing in primary school in the Netherlands (or at least it was 30 years ago), but this practice is often abandoned soon afterwards, probably because of the technical difficulty. The only way to achieve this digitally appears to have LATIN SMALL LETTER I WITH ACUTE (U+00ED) be followed by LATIN SMALL LETTER DOTLESS J (U+0237) /and/ COMBINING ACUTE ACCENT (U+0301). This poses several problems: * It makes casual user input highly impractical; * it adds complexity to automating the process of adding emphasis to vowels; * technical support is understandably lacking; * it makes it virtually impossible for type designers to address properly and consistently. To me, the obvious solution to these problems would be to at least add the following characters to the Unicode standard: * LATIN SMALL LETTER J WITH ACUTE; * LATIN CAPITAL LETTER J WITH ACUTE. For completeness sake, one could also make a case for the following: * LATIN SMALL LIGATURE IJ WITH ACUTES; * LATIN CAPITAL LIGATURE IJ WITH ACUTES... but since the use of the original Unicode ligatures is already discouraged, we could probably go without those. Sincerely, Alexander Dekker deidee [1] https://en.wikipedia.org/wiki/Acute_accent#Stress [2] https://en.wikipedia.org/wiki/IJ_(digraph) [3] https://en.wikipedia.org/wiki/IJ_(digraph)#Stress -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Feb 9 09:58:52 2016 From: everson at evertype.com (Michael Everson) Date: Tue, 9 Feb 2016 15:58:52 +0000 Subject: Case for letters j and J with acute In-Reply-To: <56B9CB09.8060906@acjs.net> References: <56B9CB09.8060906@acjs.net> Message-ID: <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> On 9 Feb 2016, at 11:18, ACJ Unicode wrote: > This is taught in writing in primary school in the Netherlands (or at least it was 30 years ago), but this practice is often abandoned soon afterwards, probably because of the technical difficulty. The only way to achieve this digitally appears to have LATIN SMALL LETTER I WITH ACUTE (U+00ED) be followed by LATIN SMALL LETTER DOTLESS J (U+0237) and COMBINING ACUTE ACCENT (U+0301). It is a font rendering issue. A pre-composed j? will not be added to the standard. > ? It makes casual user input highly impractical; This is dependent on the keyboard layout, not the encoding. > ? it adds complexity to automating the process of adding emphasis to vowels; > ? technical support is understandably lacking; True, but for technical reasons pre-composed characters will NOT be added to the standard. > ? LATIN SMALL LETTER J WITH ACUTE; > ? LATIN CAPITAL LETTER J WITH ACUTE. This just won?t ever happen. > ? it makes it virtually impossible for type designers to address properly and consistently. Well, the specification should be ? (or i + combining acute) + j + combining acute. Neither dotless i nor dotless j would be correct. > For completeness sake, one could also make a case for the following: > > ? LATIN SMALL LIGATURE IJ WITH ACUTES; > ? LATIN CAPITAL LIGATURE IJ WITH ACUTES. Or ? (or ?) + combining double acute. Michael Everson * http://www.evertype.com/ From markus.icu at gmail.com Tue Feb 9 10:05:40 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 9 Feb 2016 08:05:40 -0800 Subject: Case for letters j and J with acute In-Reply-To: <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> References: <56B9CB09.8060906@acjs.net> <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> Message-ID: On Tue, Feb 9, 2016 at 7:58 AM, Michael Everson wrote: > On 9 Feb 2016, at 11:18, ACJ Unicode wrote: > > > This is taught in writing in primary school in the Netherlands (or at > least it was 30 years ago), but this practice is often abandoned soon > afterwards, probably because of the technical difficulty. The only way to > achieve this digitally appears to have LATIN SMALL LETTER I WITH ACUTE > (U+00ED) be followed by LATIN SMALL LETTER DOTLESS J (U+0237) and COMBINING > ACUTE ACCENT (U+0301). > > It is a font rendering issue. A pre-composed j? will not be added to the > standard. > The regular 'j' has the Soft_Dotted property, which means that when you add a diacritic-above, the dot should go away. http://www.unicode.org/reports/tr44/#Soft_Dotted When the dot does not disappear, please submit an error report for the platform/browser you are using. > ? it adds complexity to automating the process of adding emphasis > to vowels; > > ? technical support is understandably lacking; > > True, but for technical reasons pre-composed characters will NOT be added > to the standard. > > > ? LATIN SMALL LETTER J WITH ACUTE; > > ? LATIN CAPITAL LETTER J WITH ACUTE. > > This just won?t ever happen. > Technical reasons include http://unicode.org/policies/stability_policy.html#Normalization markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Feb 9 10:05:46 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 9 Feb 2016 17:05:46 +0100 Subject: Case for letters j and J with acute In-Reply-To: <56B9CB09.8060906@acjs.net> References: <56B9CB09.8060906@acjs.net> Message-ID: <56BA0E5A.2020402@gmail.com> Le 09/02/2016 12:18, ACJ Unicode a ?crit : > [...] > To me, the obvious solution to these problems would be to at least add > the following characters to the Unicode standard: > > * LATIN SMALL LETTER J WITH ACUTE; > * LATIN CAPITAL LETTER J WITH ACUTE. > > [...] Adding new composition of existing characters in Unicode is not done anymore since the introduction of NFC and NFD in the 1990?s . You should read http://www.unicode.org/faq/char_combmark.html#11 and following. Cheers, Fr?d?ric From frederic.grosshans at gmail.com Tue Feb 9 10:16:10 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 9 Feb 2016 17:16:10 +0100 Subject: Case for letters j and J with acute In-Reply-To: <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> References: <56B9CB09.8060906@acjs.net> <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> Message-ID: <56BA10CA.7080507@gmail.com> Le 09/02/2016 16:58, Michael Everson a ?crit : >> For completeness sake, one could also make a case for the following: >> > >> > ? LATIN SMALL LIGATURE IJ WITH ACUTES; >> > ? LATIN CAPITAL LIGATURE IJ WITH ACUTES. > Or ? (or ?) + combining double acute. The rendering of these in a standard font (????) is usually quite bad. While non ligated character should render correctly (I?J?i?j?). Fr?d?ric From leob at mailcom.com Tue Feb 9 10:19:48 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 9 Feb 2016 08:19:48 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: A caveat about using emojitracker.com : it doesn't count newer emoji yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are added, their counts will be skewed. Leo On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: > Thank you for the links, quite mesmerizing! > > On emojitracker.com (cumulative counts, but only on twitter, AFAICS), > U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle > of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), > and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but > 10x more than the lowest counts, and about the same frequency as various > individual clock faces). > > It is quite evident that the dollar banknote emoji serves as a stand-in > for at least half a dozen of various currencies. > > On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? wrote: > >> I would suggest that you first gather statistics and present statistics >> on how often the current combinations are used compared to other emoji, eg >> by consulting sources such as: >> >> http://www.emojixpress.com/stats/ >> or >> http://emojitracker.com/ >> >> Mark >> >> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: >> >>> There are >>> >>> ?? U+01F4B4 Banknote With Yen Sign >>> ?? U+01F4B5 Banknote With Dollar Sign >>> ?? U+01F4B6 Banknote With Euro Sign >>> ?? U+01F4B7 Banknote With Pound Sign >>> >>> This is clearly an incomplete set. It makes sense to have a generic >>> "enclosing banknote" emoji character which, when combined with a >>> currency sign, would produce the corresponding banknote, to forestall >>> requests for individual emoji for banknotes with remaining currency >>> signs. >>> >>> Leo >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Feb 9 10:51:04 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 9 Feb 2016 17:51:04 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: Look at http://www.emojixpress.com/stats/. The stats are different, since they collect data from keyboards not twitter posts, but they have a nice button to view only the news emoji. (The numbers on the new ones will be smaller, just because it takes time for systems to support them, and people to start using them. However, they bear out my predication that the most popular would be the eyes-rolling face). Mark On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: > A caveat about using emojitracker.com : it doesn't count newer emoji yet > (e.g. U+1F37E bottle with popping cork is absent), thus, when they are > added, their counts will be skewed. > > Leo > > On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: > >> Thank you for the links, quite mesmerizing! >> >> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >> 10x more than the lowest counts, and about the same frequency as various >> individual clock faces). >> >> It is quite evident that the dollar banknote emoji serves as a stand-in >> for at least half a dozen of various currencies. >> >> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >> wrote: >> >>> I would suggest that you first gather statistics and present statistics >>> on how often the current combinations are used compared to other emoji, eg >>> by consulting sources such as: >>> >>> http://www.emojixpress.com/stats/ >>> or >>> http://emojitracker.com/ >>> >>> Mark >>> >>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: >>> >>>> There are >>>> >>>> ?? U+01F4B4 Banknote With Yen Sign >>>> ?? U+01F4B5 Banknote With Dollar Sign >>>> ?? U+01F4B6 Banknote With Euro Sign >>>> ?? U+01F4B7 Banknote With Pound Sign >>>> >>>> This is clearly an incomplete set. It makes sense to have a generic >>>> "enclosing banknote" emoji character which, when combined with a >>>> currency sign, would produce the corresponding banknote, to forestall >>>> requests for individual emoji for banknotes with remaining currency >>>> signs. >>>> >>>> Leo >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Feb 9 13:29:38 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 9 Feb 2016 11:29:38 -0800 Subject: Case for letters j and J with acute In-Reply-To: <56B9CB09.8060906@acjs.net> References: <56B9CB09.8060906@acjs.net> Message-ID: On Tue, Feb 9, 2016 at 3:18 AM, ACJ Unicode wrote: > [3] https://en.wikipedia.org/wiki/IJ_(digraph)#Stress > This says "in Unicode it is possible to combine characters into a *j* with an acute accent ? "b???na" ? though this might not be supported or rendered correctly by some fonts or systems. This *??* is the result of the combination of the dotless *?* (U+0237) and the combining acute accent ? (U+0301)." which I am pretty sure is wrong. It should read "in Unicode it is possible to combine characters into a *j* with an acute accent ? "b?j?na" ? though this might not be supported or rendered correctly by some fonts or systems. This *j?* is the result of the combination of the regular *j* and the combining acute accent ? (U+0301)." Could someone with Wikipedia edit experience please fix this? (3 edits in the sentence) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 9 13:38:15 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 9 Feb 2016 20:38:15 +0100 Subject: Case for letters j and J with acute In-Reply-To: <56BA10CA.7080507@gmail.com> References: <56B9CB09.8060906@acjs.net> <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> <56BA10CA.7080507@gmail.com> Message-ID: 2016-02-09 17:16 GMT+01:00 Fr?d?ric Grosshans : > Le 09/02/2016 16:58, Michael Everson a ?crit : > >> For completeness sake, one could also make a case for the following: >>> > >>> > ? LATIN SMALL LIGATURE IJ WITH ACUTES; >>> > ? LATIN CAPITAL LIGATURE IJ WITH ACUTES. >>> >> Or ? (or ?) + combining double acute. >> > The rendering of these in a standard font (????) is usually quite bad. > While non ligated character should render correctly (I?J?i?j?). This is only a font problem, not an Unicode problem. For me the IJ (or ij) with combining double accent is correct. Tell this to font authors so they fix their common fonts in later versions (here Microsoft, Adobe, Apple and Google, possibly others, should be hearing your issue for popular OS'es and applications). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 9 13:48:32 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 9 Feb 2016 20:48:32 +0100 Subject: Case for letters j and J with acute In-Reply-To: References: <56B9CB09.8060906@acjs.net> Message-ID: Fixed it in Wikipedia (I used "canonically equivalent" and linked it to the relevant article, instead of the imprecise expression "the result of"). 2016-02-09 20:29 GMT+01:00 Markus Scherer : > On Tue, Feb 9, 2016 at 3:18 AM, ACJ Unicode wrote: > >> [3] https://en.wikipedia.org/wiki/IJ_(digraph)#Stress >> > > This says "in Unicode it is > possible to combine characters > into a *j* with an > acute accent ? "b???na" ? though this might not be supported or rendered > correctly by some fonts or > systems. This *??* is the result of the combination of the dotless *?* (U+0237) > and the combining acute accent ? (U+0301)." > > which I am pretty sure is wrong. It should read "in Unicode > it is possible to combine > characters into a *j* with > an acute accent ? "b?j?na" ? though this might not be supported or > rendered correctly by some fonts > or systems. This *j?* is > the result of the combination of the regular *j* and the combining acute > accent ? (U+0301)." > > Could someone with Wikipedia edit experience please fix this? (3 edits in > the sentence) > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mheijdra at Princeton.EDU Tue Feb 9 14:47:05 2016 From: mheijdra at Princeton.EDU (Martin Heijdra) Date: Tue, 9 Feb 2016 20:47:05 +0000 Subject: Case for letters j and J with acute In-Reply-To: References: <56B9CB09.8060906@acjs.net> <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> <56BA10CA.7080507@gmail.com> Message-ID: <0001012FBBD4FE40857959B0B65DE95B6E5ED23F@CSGMBX202W.pu.win.princeton.edu> Actually, current use (e.g. the Brill font made by John Hudson) says: [cid:image001.png at 01D16351.1F1BC730] The double acute is for languages such as Hungarian etc. \ Martin Heijdra From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy Sent: Tuesday, February 09, 2016 2:38 PM To: Fr?d?ric Grosshans Cc: unicode Unicode Discussion Subject: Re: Case for letters j and J with acute 2016-02-09 17:16 GMT+01:00 Fr?d?ric Grosshans >: Le 09/02/2016 16:58, Michael Everson a ?crit : For completeness sake, one could also make a case for the following: > > ? LATIN SMALL LIGATURE IJ WITH ACUTES; > ? LATIN CAPITAL LIGATURE IJ WITH ACUTES. Or ? (or ?) + combining double acute. The rendering of these in a standard font (????) is usually quite bad. While non ligated character should render correctly (I?J?i?j?). This is only a font problem, not an Unicode problem. For me the IJ (or ij) with combining double accent is correct. Tell this to font authors so they fix their common fonts in later versions (here Microsoft, Adobe, Apple and Google, possibly others, should be hearing your issue for popular OS'es and applications). -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 12976 bytes Desc: image001.png URL: From davidj_faulks at yahoo.ca Tue Feb 9 15:23:36 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Tue, 9 Feb 2016 21:23:36 +0000 (UTC) Subject: Case for letters j and J with acute References: <1700995758.1356675.1455053016022.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> >On Tue, 2/9/16, Philippe Verdy wrote: > This is only a font problem, not an Unicode problem. For > me the IJ (or ij) with combining double accent is correct. > Tell this to font authors so they fix their common fonts in > later versions (here Microsoft, Adobe, Apple and Google, > possibly others, should be hearing your issue for popular > OS'es and applications). Perhaps Unicode could create a ?default position? property for combining characters, and encourage OpenType and other font engines to adopt it for automatic use when no other font information is provided. Adoption would take a while, but I cannot help but think that otherwise, this issue will never go away. David From leob at mailcom.com Tue Feb 9 15:33:58 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 9 Feb 2016 13:33:58 -0800 Subject: Case for letters j and J with acute In-Reply-To: <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> References: <56B9CB09.8060906@acjs.net> <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> Message-ID: It isn't just a font rendering issue. U+0133 LATIN SMALL LIGATURE IJ doesn't have Soft_Dotted property according to http://www.unicode.org/Public/UCD/latest/ucd/PropList.txt On Tue, Feb 9, 2016 at 7:58 AM, Michael Everson wrote: > On 9 Feb 2016, at 11:18, ACJ Unicode wrote: > >> This is taught in writing in primary school in the Netherlands (or at least it was 30 years ago), but this practice is often abandoned soon afterwards, probably because of the technical difficulty. The only way to achieve this digitally appears to have LATIN SMALL LETTER I WITH ACUTE (U+00ED) be followed by LATIN SMALL LETTER DOTLESS J (U+0237) and COMBINING ACUTE ACCENT (U+0301). > > It is a font rendering issue. A pre-composed j? will not be added to the standard. > >> ? It makes casual user input highly impractical; > > This is dependent on the keyboard layout, not the encoding. > >> ? it adds complexity to automating the process of adding emphasis to vowels; >> ? technical support is understandably lacking; > > True, but for technical reasons pre-composed characters will NOT be added to the standard. > >> ? LATIN SMALL LETTER J WITH ACUTE; >> ? LATIN CAPITAL LETTER J WITH ACUTE. > > This just won?t ever happen. > >> ? it makes it virtually impossible for type designers to address properly and consistently. > > Well, the specification should be ? (or i + combining acute) + j + combining acute. Neither dotless i nor dotless j would be correct. > >> For completeness sake, one could also make a case for the following: >> >> ? LATIN SMALL LIGATURE IJ WITH ACUTES; >> ? LATIN CAPITAL LIGATURE IJ WITH ACUTES. > > Or ? (or ?) + combining double acute. > > Michael Everson * http://www.evertype.com/ > > From kent.karlsson14 at telia.com Tue Feb 9 15:34:03 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Tue, 09 Feb 2016 22:34:03 +0100 Subject: Case for letters j and J with acute In-Reply-To: <9BC33AA8-A390-4A5D-8C65-EC6DD7681372@evertype.com> Message-ID: Den 2016-02-09 16:58, skrev "Michael Everson" : > Well, the specification should be ? (or i + combining acute) + j + > combining acute. Neither dotless i nor dotless j would be correct. While true, using the latter (the dotless ones) tend to render better than the dotted ones. (I.e., the Soft_dotted property is still not well supported.) > Or IJ (or ij) + combining double acute. While I agree that that maybe SHOULD be fine, the ij character has not been given the Soft_dotted property. Although, as a different matter, using the ij character tends to make automatic case mapping work better for the ij in Dutch... /Kent K From kenwhistler at att.net Tue Feb 9 15:36:10 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 9 Feb 2016 13:36:10 -0800 Subject: Case for letters j and J with acute In-Reply-To: <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> References: <1700995758.1356675.1455053016022.JavaMail.yahoo.ref@mail.yahoo.com> <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> Message-ID: <56BA5BCA.7060509@att.net> On 2/9/2016 1:23 PM, David Faulks wrote: > Perhaps Unicode could create a ?default position? property for combining characters, and encourage OpenType and other font engines to adopt it for automatic use when no other font information is provided. Adoption would take a while, but I cannot help but think that otherwise, this issue will never go away. > > It does. General_Category=Mn and ccc=230 indicates that a character is a non-spacing mark positioned *above* its base. Attempting to get more precise that that with a *character* property would be a mistake. Such interaction in detail between a mark and its base is an attribute of glyphs and their design, and properly belongs to the realm of rendering and fonts. --Ken From asmus-inc at ix.netcom.com Tue Feb 9 16:19:34 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 9 Feb 2016 14:19:34 -0800 Subject: Case for letters j and J with acute In-Reply-To: <56BA5BCA.7060509@att.net> References: <1700995758.1356675.1455053016022.JavaMail.yahoo.ref@mail.yahoo.com> <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> <56BA5BCA.7060509@att.net> Message-ID: <56BA65F6.10808@ix.netcom.com> An HTML attachment was scrubbed... URL: From leob at mailcom.com Tue Feb 9 16:46:51 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 9 Feb 2016 14:46:51 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: The emojiexpress.com site is useful to check which new emoji or combinations people actually use, but the stats are likely skewed by only measuring input from one platform. Another way to look at the emojitracker.com stats: 339M people in the Eurozone : 389K uses of Euro emoji 126M people in Japan : 354K uses of Yen emoji 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in for Lira) : 515K uses of pound emoji The total is 605M people : 1258K uses of non-dollar emoji Assuming the same average frequency of use, 2933K uses of the dollar emoji would be produced by 1411M people, out of which us + canada + mexico + australia (500M) + other countries using $ as (part of) the sign for their currency are way less than a half. This means that substantially more than 500M people are using the dollar emoji by default, instead of emoji of their national currencies. Assuming a lesser frequency of use will result in a greater estimate of the affected population. Leo On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? wrote: > Look at http://www.emojixpress.com/stats/. The stats are different, since > they collect data from keyboards not twitter posts, but they have a nice > button to view only the news emoji. > > (The numbers on the new ones will be smaller, just because it takes time > for systems to support them, and people to start using them. However, they > bear out my predication that the most popular would be the eyes-rolling > face). > > Mark > > On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: > >> A caveat about using emojitracker.com : it doesn't count newer emoji yet >> (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >> added, their counts will be skewed. >> >> Leo >> >> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: >> >>> Thank you for the links, quite mesmerizing! >>> >>> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >>> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >>> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >>> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >>> 10x more than the lowest counts, and about the same frequency as various >>> individual clock faces). >>> >>> It is quite evident that the dollar banknote emoji serves as a stand-in >>> for at least half a dozen of various currencies. >>> >>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>> wrote: >>> >>>> I would suggest that you first gather statistics and present statistics >>>> on how often the current combinations are used compared to other emoji, eg >>>> by consulting sources such as: >>>> >>>> http://www.emojixpress.com/stats/ >>>> or >>>> http://emojitracker.com/ >>>> >>>> Mark >>>> >>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: >>>> >>>>> There are >>>>> >>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>> >>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>> "enclosing banknote" emoji character which, when combined with a >>>>> currency sign, would produce the corresponding banknote, to forestall >>>>> requests for individual emoji for banknotes with remaining currency >>>>> signs. >>>>> >>>>> Leo >>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Tue Feb 9 17:01:08 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 9 Feb 2016 15:01:08 -0800 Subject: Case for letters j and J with acute In-Reply-To: <56BA65F6.10808@ix.netcom.com> References: <1700995758.1356675.1455053016022.JavaMail.yahoo.ref@mail.yahoo.com> <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> <56BA5BCA.7060509@att.net> <56BA65F6.10808@ix.netcom.com> Message-ID: <56BA6FB4.3020402@att.net> Asmus, On 2/9/2016 2:19 PM, Asmus Freytag (t) wrote: > On 2/9/2016 1:36 PM, Ken Whistler wrote: >> >> >> On 2/9/2016 1:23 PM, David Faulks wrote: >>> Perhaps Unicode could create a ?default position? property for >>> combining characters, and encourage OpenType and other font engines >>> to adopt it for automatic use when no other font information is >>> provided. Adoption would take a while, but I cannot help but think >>> that otherwise, this issue will never go away. >>> >>> >> >> It does. General_Category=Mn and ccc=230 indicates that a character is >> a non-spacing mark positioned *above* its base. >> >> Attempting to get more precise that that with a *character* property >> would >> be a mistake. Such interaction in detail between a mark and its base is >> an attribute of glyphs and their design, and properly belongs to the >> realm >> of rendering and fonts. > > What about GC=Mn and ccc=0? The *overwhelming* majority of those are for Indic scripts. > > For those, an actual positional property would make sense. And, ta da!, we have one: http://www.unicode.org/Public/8.0.0/ucd/IndicPositionalCategory.txt That also encompasses the positional classes for gc=Mc, as well as gc=Mn. > > It wouldn't need to be overly specific. It isn't -- it is designed (and being used) for Indic rendering engines. The outliers which are gc=Mn and ccc=0 but which are not covered by IndicPositionalCategory.txt include: CGJ and variation selectors and one shorthand control: irrelevant, because these aren't displayable marks. Thaana vowels Miao tone marks: irrelevant, because Miao has a very idiosyncratic encoding. Signwriting marks: Irrelevant, because Signwriting has a very idiosyncratic encoding. And I don't think adding a new positional property just to keep track of the fact that two Thaana vowels display below their consonant instead of on top makes sense. If it came to that, Thaana could just be added to IndicPositionalCategory.txt, instead. > > For example, for Unibook, I allow a convention to supply this > information to place a glyph in relation to the dotted circle; it's > described in the help file. There are some special wrinkles there, > because the values are tweaks that get applied to known fonts (that > just happen to not do the right thing when combined with an the > standard dotted circle in the charts). Just adapt IndicPositionalCategory.txt for Unibook, and you've got what you need. --Ken > > However, this approach would seem to indicate that such a scheme is > possible and with just a few values sufficiently differentiated to be > of practical use (= immensely improve on the fallback). > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Tue Feb 9 17:45:20 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 9 Feb 2016 15:45:20 -0800 Subject: Case for letters j and J with acute In-Reply-To: <56BA6FB4.3020402@att.net> References: <1700995758.1356675.1455053016022.JavaMail.yahoo.ref@mail.yahoo.com> <1700995758.1356675.1455053016022.JavaMail.yahoo@mail.yahoo.com> <56BA5BCA.7060509@att.net> <56BA65F6.10808@ix.netcom.com> <56BA6FB4.3020402@att.net> Message-ID: <56BA7A10.1000604@ix.netcom.com> On 2/9/2016 3:01 PM, Ken Whistler wrote: > Just adapt IndicPositionalCategory.txt for Unibook, and you've got > what you need. I see. Not quite as simple; Unibook needs overrides that are specifically able to correct bad fonts, not just "dumb" ones. We may want to honor some part of the positioning. But it would be interesting to see whether we ended up duplicating the IPC values more or less. Next chance I get. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed Feb 10 00:26:31 2016 From: petercon at microsoft.com (Peter Constable) Date: Wed, 10 Feb 2016 06:26:31 +0000 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: I wish emojitracker had an option to see cumulative stats spanning only the last (say) 7 days, rather than (I assume) all time. This would be more representative of current usage, fixing the problem of recent introductions. Also, comparing the recent and long-term stats would highlight shifting trends. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Leo Broukhis Sent: Tuesday, February 9, 2016 2:47 PM To: Mark Davis ?? Cc: unicode Unicode Discussion Subject: Re: Enclosing BANKNOTE emoji? The emojiexpress.com site is useful to check which new emoji or combinations people actually use, but the stats are likely skewed by only measuring input from one platform. Another way to look at the emojitracker.com stats: 339M people in the Eurozone : 389K uses of Euro emoji 126M people in Japan : 354K uses of Yen emoji 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in for Lira) : 515K uses of pound emoji The total is 605M people : 1258K uses of non-dollar emoji Assuming the same average frequency of use, 2933K uses of the dollar emoji would be produced by 1411M people, out of which us + canada + mexico + australia (500M) + other countries using $ as (part of) the sign for their currency are way less than a half. This means that substantially more than 500M people are using the dollar emoji by default, instead of emoji of their national currencies. Assuming a lesser frequency of use will result in a greater estimate of the affected population. Leo On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? > wrote: Look at http://www.emojixpress.com/stats/. The stats are different, since they collect data from keyboards not twitter posts, but they have a nice button to view only the news emoji. (The numbers on the new ones will be smaller, just because it takes time for systems to support them, and people to start using them. However, they bear out my predication that the most popular would be the eyes-rolling face). Mark On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis > wrote: A caveat about using emojitracker.com : it doesn't count newer emoji yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are added, their counts will be skewed. Leo On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis > wrote: Thank you for the links, quite mesmerizing! On emojitracker.com (cumulative counts, but only on twitter, AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but 10x more than the lowest counts, and about the same frequency as various individual clock faces). It is quite evident that the dollar banknote emoji serves as a stand-in for at least half a dozen of various currencies. [https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif] On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? > wrote: I would suggest that you first gather statistics and present statistics on how often the current combinations are used compared to other emoji, eg by consulting sources such as: http://www.emojixpress.com/stats/ or http://emojitracker.com/ Mark On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis > wrote: There are ?? U+01F4B4 Banknote With Yen Sign ?? U+01F4B5 Banknote With Dollar Sign ?? U+01F4B6 Banknote With Euro Sign ?? U+01F4B7 Banknote With Pound Sign This is clearly an incomplete set. It makes sense to have a generic "enclosing banknote" emoji character which, when combined with a currency sign, would produce the corresponding banknote, to forestall requests for individual emoji for banknotes with remaining currency signs. Leo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Wed Feb 10 03:38:04 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Wed, 10 Feb 2016 10:38:04 +0100 Subject: Aw: Re: Enclosing BANKNOTE emoji? In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From tim at shilohmediainc.com Wed Feb 10 03:58:08 2016 From: tim at shilohmediainc.com (Tim) Date: Wed, 10 Feb 2016 20:58:08 +1100 Subject: Unicode line break issue Message-ID: I have a problem with Unicode in RTF. The Syriac unicode set characters that I am using seem to be breaking characters. If I follow the unicode character word group with \~ as a non breaking space, then the word will still break on the last unicode character. Is there a way that I can stop this from happening? Can I do this by using some character before or after? Can I change this by using some placeholder character other than "?" I have tried to use other placeholder characters other than "?" after the unicode number "\u1808?" but it seems that I haven't found the correct placeholder character to stop the line break at this point. Here is a sample of the data string that I need to keep together: \u1823?\u1836?\u1810?\u1808?\~{\cf11\~S10762}\~{\cf2\~Book} The unicode character word breaks after \u1808? even with the nonbreaking space \~ however the letters in the Syriac word do not break. What placeholder character can I use (or other characters) to prevent a line break after \u1808?, or is there another way that I can code this so that it will stay together? Any thoughts and help that you may be able to offer is appreciated, -------------- next part -------------- An HTML attachment was scrubbed... URL: From qsjn4ukr at gmail.com Thu Feb 11 08:05:30 2016 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Thu, 11 Feb 2016 16:05:30 +0200 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B8CFD4.1070105@uni-konstanz.de> References: <56B8CFD4.1070105@uni-konstanz.de> Message-ID: I can show an example of use both, prime (as soft sign) and apostroph (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic Dictionary of Ukrainian, http://padaread.com/?book=84816&pg=6 http://padaread.com/?book=84816&pg=7) From ritt.ks at gmail.com Thu Feb 11 08:36:51 2016 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Thu, 11 Feb 2016 18:36:51 +0400 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: References: <56B8CFD4.1070105@uni-konstanz.de> Message-ID: In Ukrainian, for example, both ??? and ?`? are used. ??? is used for softer pronounce of the preceding consonant ( ???????? ), whilst ?`? is used for splitting them, like if they were the first letter in a word, even when the next vowel sounds soft otherwise ( ???`??????? -- the last ??? sounds softer the former one ). Regards, Konstantin 2016-02-11 18:05 GMT+04:00 QSJN 4 UKR : > I can show an example of use both, prime (as soft sign) and apostroph > (hemisoft) in Cyrilic-based phonetic transcription (Orthoepic > Dictionary of Ukrainian, http://padaread.com/?book=84816&pg=6 > http://padaread.com/?book=84816&pg=7) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From qsjn4ukr at gmail.com Thu Feb 11 08:38:45 2016 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Thu, 11 Feb 2016 16:38:45 +0200 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: <56B8CFD4.1070105@uni-konstanz.de> References: <56B8CFD4.1070105@uni-konstanz.de> Message-ID: Prime for soft sign transliteration used to avoid ambiguty: apostroph is used for apostroph itself, common sign in Ukrainian or Belarusian. From asmus-inc at ix.netcom.com Thu Feb 11 09:59:25 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 11 Feb 2016 07:59:25 -0800 Subject: transliteration of mjagkij znak (Cyrillic soft sign) In-Reply-To: References: <56B8CFD4.1070105@uni-konstanz.de> Message-ID: <56BCAFDD.3030307@ix.netcom.com> An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Sun Feb 14 17:36:37 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Sun, 14 Feb 2016 23:36:37 +0000 (UTC) Subject: Copyleft Symbol References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> Hello, This subject has been discussed before, but I am somehwat uncertain about something: If the copyleft (reversed ?) symbol was proposed for encoding, with examples (from PDF files) showing it being used in a similar way to the copyright ? symbol, it is likely to be accepted for encoding? Thanks for any opinions. David From asmus-inc at ix.netcom.com Sun Feb 14 18:53:33 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 14 Feb 2016 16:53:33 -0800 Subject: Copyleft Symbol In-Reply-To: <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> Message-ID: <56C1218D.3080605@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Feb 14 19:18:04 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 15 Feb 2016 01:18:04 +0000 Subject: Copyleft Symbol In-Reply-To: <56C1218D.3080605@ix.netcom.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> <56C1218D.3080605@ix.netcom.com> Message-ID: <69FE572B-7286-4D18-994E-E5EFAE469871@evertype.com> On 15 Feb 2016, at 00:53, Asmus Freytag (t) wrote: > > On 2/14/2016 3:36 PM, David Faulks wrote: >> Hello, >> >> This subject has been discussed before, but I am somehwat uncertain about something: >> >> If the copyleft (reversed ?) symbol was proposed for encoding, with examples (from PDF files) showing it being used in a similar way to the copyright ? symbol, it is likely to be accepted for encoding? >> > > The key issue is whether this usage is "established". > > Showing that it has been used a few times is less useful than a good estimate of how widely it is used. No emoji for bacon was ever shown in use. People just wanted it. Michael Everson * http://www.evertype.com/ From asmus-inc at ix.netcom.com Sun Feb 14 21:02:52 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 14 Feb 2016 19:02:52 -0800 Subject: Copyleft Symbol In-Reply-To: <69FE572B-7286-4D18-994E-E5EFAE469871@evertype.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> <56C1218D.3080605@ix.netcom.com> <69FE572B-7286-4D18-994E-E5EFAE469871@evertype.com> Message-ID: <56C13FDC.7050809@ix.netcom.com> An HTML attachment was scrubbed... URL: From tuvalkin at gmail.com Sun Feb 14 21:42:52 2016 From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=) Date: Mon, 15 Feb 2016 03:42:52 +0000 Subject: Copyleft Symbol In-Reply-To: <56C1218D.3080605@ix.netcom.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> <56C1218D.3080605@ix.netcom.com> Message-ID: <56C1493C.9040405@gmail.com> On 2016.02.15 00:53, Asmus Freytag (t) wrote: > The key issue is whether this usage is "established". You can always make the case that what ever need is felt/expressed by a community is not enough. While it would be useless to point out that copyleft is more needed (i.e., if encoded would be used way more often) than 99% of the the whole reportoire of Unicode (like U+A66E, which is used in one single word, a weird one, too, and only optionally?), its usage is less massive than the symbols of the Creative Commons licences: the cc-ring symbol itself, and the symbols for its clauses: "share alike", "non-commercial", "attribution", and "no derivative works". See: http://en.wikipedia.org/wiki/Creative_Commons_license#Types_of_licenses I don?t miss these symbols terribly, but then again I never cared for the disunification (or non-unification) of "?" and "?", "?" and "?", and "?" and "?" ? so I calmly use instead "??" (copyleft), "??" (creative commons), "??" (share alke), "$?" (non-commercial), "???" (attribution), and "?" (no derivative works), in spite of the inadequate semantics. -- ____. Ant?nio MARTINS-Tuv?lkin | ()| N?o me invejo de quem tem|####| PT-2695-010 Bobadela LRS carros, parelhas e montes | +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | --------------------------------------------------------------------- De sable uma fonte e bordadura escaqueada de jalde e goles por timbre bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" --------------------------------------------------------------------- From asmus-inc at ix.netcom.com Sun Feb 14 23:33:22 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 14 Feb 2016 21:33:22 -0800 Subject: Copyleft Symbol In-Reply-To: <56C1493C.9040405@gmail.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> <56C1218D.3080605@ix.netcom.com> <56C1493C.9040405@gmail.com> Message-ID: <56C16322.1040709@ix.netcom.com> An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Mon Feb 15 05:18:25 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Mon, 15 Feb 2016 11:18:25 +0000 (UTC) Subject: Copyleft Symbol References: <254948940.3422556.1455535105174.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <254948940.3422556.1455535105174.JavaMail.yahoo@mail.yahoo.com> > Sun, 2/14/16, Asmus Freytag (t) wrote: > Subject: Re: Copyleft Symbol > To: unicode at unicode.org > Received: Sunday, February 14, 2016, 7:53 PM >> On 2/14/2016 3:36 PM, David Faulks wrote: < text cut> >> If the copyleft (reversed ?) symbol was proposed >> for encoding, with examples (from PDF files) >> showing it being used in a similar way to the >> copyright ? symbol, it is likely to be accepted >> for encoding? > The key issue is whether this usage is "established". > > Showing that it has been used a few times is less > useful than a good estimate of how widely it is > used. > > A./ An estimate is difficult, other than usage being rare. The symbol itself is widely known and here to stay (there was actually a discussion about encoding it back in 2000 on this mailing list). A google search for ?copyleft symbol? reveals many results (such as ?[ubuntu] Using copyleft symbol in text - Ubuntu Forums?), so I would say there is demand for this. The samples I have seem to be from people who want to make a statement via an anti-copyright message, are familiar with the term ?copyleft? and the associated symbol, are playful enough to want to use the symbol instead of a more formal message like creative commons (after all, the copyleft symbol has no legal standing),but are willing to go to the extra effort of using a non-standard symbol for a small message that might not even be noticed. David From chris.fynn at gmail.com Mon Feb 15 06:04:42 2016 From: chris.fynn at gmail.com (Christopher Fynn) Date: Mon, 15 Feb 2016 17:49:42 +0545 Subject: Copyleft Symbol In-Reply-To: <254948940.3422556.1455535105174.JavaMail.yahoo@mail.yahoo.com> References: <254948940.3422556.1455535105174.JavaMail.yahoo.ref@mail.yahoo.com> <254948940.3422556.1455535105174.JavaMail.yahoo@mail.yahoo.com> Message-ID: On 15/02/2016, David Faulks wrote: > .....(there was actually a discussion about encoding it back in 2000 on this mailing list). Presumably that indicates at least 15 years of usage - far longer than most emoji. - Chris From johannes at bergerhausen.com Mon Feb 15 06:07:37 2016 From: johannes at bergerhausen.com (Johannes Bergerhausen) Date: Mon, 15 Feb 2016 13:07:37 +0100 Subject: Copyleft Symbol In-Reply-To: <56C1218D.3080605@ix.netcom.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> <56C1218D.3080605@ix.netcom.com> Message-ID: Am 15.02.2016 um 01:53 schrieb Asmus Freytag: > The key issue is whether this usage is "established". It is established as soon as it is part of Unicode :) > Showing that it has been used a few times is less useful than a good estimate of how widely it is used. iOS has about 1 billon active products; Android more than that?so i guess this are 2 billion possible users. Johannes From ken.shirriff at gmail.com Mon Feb 15 10:15:57 2016 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Mon, 15 Feb 2016 08:15:57 -0800 Subject: Copyleft Symbol In-Reply-To: <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> References: <2061096046.3293196.1455492997304.JavaMail.yahoo.ref@mail.yahoo.com> <2061096046.3293196.1455492997304.JavaMail.yahoo@mail.yahoo.com> Message-ID: My advice: The most important thing is to have enough examples of the symbol in use in running text (i.e. not an icon or logo). Real published documents that demonstrate a user community are important. I recommend studying Unicode's Criteria for Encoding Symbols carefully. The rules for emoji are totally different, so saying "but emoji..." is meaningless. The proposal to add the power symbol to Unicode is a good proposal example that you can use as a model. As far as the copyleft symbol, it's well-defined (has a wikipedia page) and a web search shows demand for the symbol. It is used in running text and has semantic meaning. You found it goes back to 2000, so it's not a transient fad. I think a proposal would have a good chance of success if you can find a number of good examples of usage. This is my personal advice - I don't speak for anyone - but I've had a couple symbols accepted so these guidelines work for me. Ken On Sun, Feb 14, 2016 at 3:36 PM, David Faulks wrote: > Hello, > > This subject has been discussed before, but I am somehwat uncertain about > something: > > If the copyleft (reversed ?) symbol was proposed for encoding, with > examples (from PDF files) showing it being used in a similar way to the > copyright ? symbol, it is likely to be accepted for encoding? > > Thanks for any opinions. > > David > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Feb 15 11:32:01 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Feb 2016 10:32:01 -0700 Subject: Copyleft Symbol Message-ID: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> Asmus Freytag wrote: > with the non-standard symbols like the copyleft, there's the desire to > not encode stuff based on "passing activism". David Faulks wrote: > The samples I have seem to be from people who want to make a statement > via an anti-copyright message The lengthy thread from 2000, and the shorter one from 2012, show that the objections at those times fell into three main categories: (1) Lack of (sufficient) evidence of use as an element of running text, as opposed to a logo. There's an interesting passage on the FSF page "What is Copyleft?" about this symbol: "It is a legal mistake to use a backwards C in a circle instead of a copyright symbol. Copyleft is based legally on copyright, so the work should have a copyright notice. A copyright notice requires either the copyright symbol (a C in a circle) or the word 'Copyright'. [ ... ] A backwards C in a circle has no special legal significance, so it doesn't make a copyright notice." (2) Concern that the symbol was a passing fad. Christopher and Ken noted that the fact we are talking about it again 15 years later probably answers that concern. (3) The social-statement aspect. Ant?nio wrote in 2012, referring to the copyleft symbol plus the others he just cited (e.g. Creative Commons): "I am convinced that they were not accepted for encoding (if they were ever even formally proposed) due purely to ideological reasons." However, I checked the UTC document register going back to 2000 and could not find a proposal with the word "copyleft" in its title, so perhaps these have not been proposed after all. The recent acceptance by UTC of BITCOIN SIGN, which is also often perceived as a logo and also sometimes associated with a social movement, might indicate greater willingness of UTC to encode the copyleft symbol, even discounting the effects of the Emoji Revolution. But as always, at least for non-emoji characters, a formal proposal is probably mandatory. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Feb 15 13:29:25 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 15 Feb 2016 11:29:25 -0800 Subject: Copyleft Symbol In-Reply-To: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> References: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> Message-ID: <56C22715.7060406@ix.netcom.com> An HTML attachment was scrubbed... URL: From rwhlk142 at gmail.com Mon Feb 15 19:08:01 2016 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Mon, 15 Feb 2016 20:08:01 -0500 Subject: Copyleft Symbol In-Reply-To: <56C22715.7060406@ix.netcom.com> References: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> <56C22715.7060406@ix.netcom.com> Message-ID: Hi! Shouldn?t the COPYLEFT SIGN be a small circled L?! It?s something to think about... Thank You! On Mon, Feb 15, 2016 at 2:29 PM, Asmus Freytag (t) wrote: > On 2/15/2016 9:32 AM, Doug Ewell wrote: > > Asmus Freytag wrote: > > > with the non-standard symbols like the copyleft, there's the desire to > not encode stuff based on "passing activism". > > David Faulks wrote: > > > The samples I have seem to be from people who want to make a statement > via an anti-copyright message > > The lengthy thread from 2000, and the shorter one from 2012, show that > the objections at those times fell into three main categories: > > (1) Lack of (sufficient) evidence of use as an element of running text, > as opposed to a logo. > > I take it that this has been addressed (modulo the usual difficulties > about proving that for > unencoded symbols). > > There's an interesting passage on the FSF page "What is Copyleft?" about > this symbol: > > "It is a legal mistake to use a backwards C in a circle instead of a > copyright symbol. Copyleft is based legally on copyright, so the work > should have a copyright notice. A copyright notice requires either the > copyright symbol (a C in a circle) or the word 'Copyright'. [ ... ] A > backwards C in a circle has no special legal significance, so it doesn't > make a copyright notice." > > > Unicode has always recognized usage over official status. So this should > not be an issue. > > > (2) Concern that the symbol was a passing fad. Christopher and Ken noted > that the fact we are talking about it again 15 years later probably > answers that concern. > > Very good point. > > > (3) The social-statement aspect. > > Ant?nio wrote in 2012, referring to the copyleft symbol plus the others > he just cited (e.g. Creative Commons): "I am convinced that they were > not accepted for encoding (if they were ever even formally proposed) due > purely to ideological reasons." However, I checked the UTC document > register going back to 2000 and could not find a proposal with the word > "copyleft" in its title, so perhaps these have not been proposed after > all. > > A proposal is needed, discussion on this list is useful only as far as a > proposer wants to get some suggestions on how to proceed. > > > The recent acceptance by UTC of BITCOIN SIGN, which is also often > perceived as a logo and also sometimes associated with a social > movement, might indicate greater willingness of UTC to encode the > copyleft symbol, even discounting the effects of the Emoji Revolution. > > But as always, at least for non-emoji characters, a formal proposal is > probably mandatory. > > Delete "probably". > > A./ > > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Feb 15 20:29:04 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 15 Feb 2016 18:29:04 -0800 Subject: Copyleft Symbol In-Reply-To: References: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> <56C22715.7060406@ix.netcom.com> Message-ID: <56C28970.4000902@ix.netcom.com> An HTML attachment was scrubbed... URL: From mats.gbproject at gmail.com Mon Feb 15 17:32:49 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Tue, 16 Feb 2016 00:32:49 +0100 Subject: Possible to add new precomposed characters for local language in Togo? Message-ID: I've worked to upload a keyboard for local languages in Togo to XKB project, it is a combination keyboard based on French keyboard and extended to make it possible to write all the local languages in Togo. However many of the languages have several tones and even use combined tones. However when I tried to update the composer to make it work it seems like the composer only can give back a precomposed character and not a string with combined characters. I now wonder, generally, is it best to add new precomposed characters to Unicode? Should there be a unicode symbol for each combination used? What is best practise? I ask because I see some unicodes are precomposed characters, I'm not sure why they are useful, but if they are maybe we also should add these? For reference here are the combinations needed, as you can see there are many! I've tried to check over, I don't think there exists precomposed characters for any of them. ? / epsilon = U025B : "??" LATIN SMALL LETTER EPSILON WITH ACUTE : "??" LATIN SMALL LETTER EPSILON WITH GRAVE : "??" LATIN SMALL LETTER EPSILON WITH CIRCUMFLEX : "??" LATIN SMALL LETTER EPSILON WITH CARON : "??" LATIN SMALL LETTER EPSILON WITH MACRON : "??" LATIN SMALL LETTER EPSILON WITH TILDE : "???" LATIN SMALL LETTER EPSILON WITH TILDE AND ACUTE : "???" LATIN SMALL LETTER EPSILON WITH TILDE AND GRAVE ? / EPSILON = U0190 : "??" LATIN CAPITAL LETTER EPSILON WITH ACUTE : "??" LATIN CAPITAL LETTER EPSILON WITH GRAVE : "??" LATIN CAPITAL LETTER EPSILON WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER EPSILON WITH CARON : "??" LATIN CAPITAL LETTER EPSILON WITH MACRON : "??" LATIN CAPITAL LETTER EPSILON WITH TILDE : "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND ACUTE : "???" LATIN CAPITAL LETTER EPSILON WITH TILDE AND GRAVE ? / iota = U0269 : "??" LATIN SMALL LETTER IOTA WITH ACUTE : "??" LATIN SMALL LETTER IOTA WITH GRAVE : "??" LATIN SMALL LETTER IOTA WITH CIRCUMFLEX : "??" LATIN SMALL LETTER IOTA WITH CARON : "??" LATIN SMALL LETTER IOTA WITH MACRON ? / IOTA = U0196 : "??" LATIN CAPITAL LETTER IOTA WITH ACUTE : "??" LATIN CAPITAL LETTER IOTA WITH GRAVE : "??" LATIN CAPITAL LETTER IOTA WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER IOTA WITH CARON : "??" LATIN CAPITAL LETTER IOTA WITH MACRON ? / open o = U0254 : "??" LATIN SMALL LETTER OPEN O WITH ACUTE : "??" LATIN SMALL LETTER OPEN O WITH GRAVE : "??" LATIN SMALL LETTER OPEN O WITH CIRCUMFLEX : "??" LATIN SMALL LETTER OPEN O WITH CARON : "??" LATIN SMALL LETTER OPEN O WITH MACRON : "??" LATIN SMALL LETTER OPEN O WITH TILDE : "???" LATIN SMALL LETTER OPEN O WITH TILDE AND ACUTE : "???" LATIN SMALL LETTER OPEN O WITH TILDE AND GRAVE ? / OPEN O = U0186 : "??" LATIN CAPITAL LETTER OPEN O WITH ACUTE : "??" LATIN CAPITAL LETTER OPEN O WITH GRAVE : "??" LATIN CAPITAL LETTER OPEN O WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER OPEN O WITH CARON : "??" LATIN CAPITAL LETTER OPEN O WITH MACRON : "??" LATIN CAPITAL LETTER OPEN O WITH TILDE : "???" LATIN CAPITAL LETTER OPEN O WITH TILDE AND ACUTE : "???" LATIN CAPITAL LETTER OPEN O WITH TILDE AND GRAVE ? / turned e = U01DD : "??" LATIN SMALL LETTER TURNED E WITH ACUTE : "??" LATIN SMALL LETTER TURNED E WITH GRAVE : "??" LATIN SMALL LETTER TURNED E WITH CIRCUMFLEX : "??" LATIN SMALL LETTER TURNED E WITH CARON : "??" LATIN SMALL LETTER TURNED E WITH MACRON : "??" LATIN SMALL LETTER TURNED E WITH TILDE : "???" LATIN SMALL LETTER TURNED E WITH TILDE AND ACUTE : "???" LATIN SMALL LETTER TURNED E WITH TILDE AND GRAVE ? / TURNED E = U018E : "??" LATIN CAPITAL LETTER TURNED E WITH ACUTE : "??" LATIN CAPITAL LETTER TURNED E WITH GRAVE : "??" LATIN CAPITAL LETTER TURNED E WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER TURNED E WITH CARON : "??" LATIN CAPITAL LETTER TURNED E WITH MACRON : "??" LATIN CAPITAL LETTER TURNED E WITH TILDE : "???" LATIN CAPITAL LETTER TURNED E WITH TILDE AND ACUTE : "???" LATIN CAPITAL LETTER TURNED E WITH TILDE AND GRAVE ? / v with hook = U028B : "??" LATIN SMALL LETTER V WITH HOOK WITH ACUTE : "??" LATIN SMALL LETTER V WITH HOOK WITH GRAVE : "??" LATIN SMALL LETTER V WITH HOOK WITH CIRCUMFLEX : "??" LATIN SMALL LETTER V WITH HOOK WITH CARON : "??" LATIN SMALL LETTER V WITH HOOK WITH MACRON ? / V WITH HOOK = U01B2 : "??" LATIN CAPITAL LETTER V WITH HOOK WITH ACUTE : "??" LATIN CAPITAL LETTER V WITH HOOK WITH GRAVE : "??" LATIN CAPITAL LETTER V WITH HOOK WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER V WITH HOOK WITH CARON : "??" LATIN CAPITAL LETTER V WITH HOOK WITH MACRON ? / upsilon = U028A : "??" LATIN SMALL LETTER UPSILON WITH ACUTE : "??" LATIN SMALL LETTER UPSILONK WITH GRAVE : "??" LATIN SMALL LETTER UPSILON WITH CIRCUMFLEX : "??" LATIN SMALL LETTER UPSILON WITH CARON : "??" LATIN SMALL LETTER UPSILON WITH MACRON ? / UPSILON = U01B1 : "??" LATIN CAPITAL LETTER UPSILON WITH ACUTE : "??" LATIN CAPITAL LETTER UPSILONK WITH GRAVE : "??" LATIN CAPITAL LETTER UPSILON WITH CIRCUMFLEX : "??" LATIN CAPITAL LETTER UPSILON WITH CARON : "??" LATIN CAPITAL LETTER UPSILON WITH MACRON a : "a??" LATIN SMALL LETTER A WITH TILDE AND ACUTE : "a??" LATIN SMALL LETTER A WITH TILDE AND GRAVE A : "A??" LATIN CAPITAL LETTER A WITH TILDE AND ACUTE : "A??" LATIN CAPITAL LETTER A WITH TILDE AND GRAVE e : "e??" LATIN SMALL LETTER E WITH TILDE AND ACUTE : "e??" LATIN SMALL LETTER E WITH TILDE AND GRAVE E : "E??" LATIN CAPITAL LETTER E WITH TILDE AND ACUTE : "E??" LATIN CAPITAL LETTER E WITH TILDE AND GRAVE i : "i??" LATIN SMALL LETTER I WITH TILDE AND ACUTE : "i??" LATIN SMALL LETTER I WITH TILDE AND GRAVE I : "I??" LATIN CAPITAL LETTER I WITH TILDE AND ACUTE : "I??" LATIN CAPITAL LETTER I WITH TILDE AND GRAVE o : "o??" LATIN SMALL LETTER O WITH TILDE AND GRAVE O : "O??" LATIN CAPITAL LETTER O WITH TILDE AND GRAVE u : "u??" LATIN SMALL LETTER U WITH TILDE AND GRAVE U : "U??" LATIN CAPITAL LETTER U WITH TILDE AND GRAVE m : "m?" LATIN SMALL LETTER M WITH GRAVE M : "M?" LATIN CAPITAL LETTER M WITH GRAVE ? / eng = U014B : "??" LATIN SMALL LETTER ENG WITH ACUTE : "??" LATIN SMALL LETTER ENG WITH GRAVE ? / ENG = U014A : "??" LATIN CAPITAL LETTER ENG WITH ACUTE : "??" LATIN CAPITAL LETTER ENG WITH GRAVE -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Feb 15 22:46:28 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 15 Feb 2016 20:46:28 -0800 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: Message-ID: <56C2A9A4.8060300@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 16 02:00:26 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 16 Feb 2016 09:00:26 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: Message-ID: 2016-02-16 0:32 GMT+01:00 Mats Blakstad : > I've worked to upload a keyboard for local languages in Togo to XKB > project, it is a combination keyboard based on French keyboard and extended > to make it possible to write all the local languages in Togo. However many > of the languages have several tones and even use combined tones. However > when I tried to update the composer to make it work it seems like the > composer only can give back a precomposed character and not a string with > combined characters. > > I now wonder, generally, is it best to add new precomposed characters to > Unicode? Should there be a unicode symbol for each combination used? What > is best practise? I ask because I see some unicodes are precomposed > characters, I'm not sure why they are useful, but if they are maybe we also > should add these? > You don't need that. Keyboard layouts MUST generate the combining sequence. (It's then up to the text editors and softwares to adapt themselves to the possibility that a single keystroke could generate multiple characters/code points, and to handle themselves the case of text selection and corrections by grapheme cluster rather than by single character/code point: this is already done in many softwares, including for the Latin script). However, Unicode could standardize names for these "common" combinations (without assigning new code points, which is not needed). There's already a supplementary datafile for them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 16 02:14:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 16 Feb 2016 09:14:12 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: Message-ID: Note that I have also produced my own keyboard several years ago containing almost all characters or sequences needed for African languages in the Latin script, also based on an extension of the French (AZERTY) keyboard. It contains also additions for other European languages (notably German, Dutch, Spanish, Scholar Latin, Czech, Serbian Latin...), or romanizations of other languages (notably Japanese), so it also includes the macron (for Japanese Romaji), breve (for Scholar Latin), caron (Slavic languages), dot below (for Maltese), and additional letters (ij/IJ ligatures for Dutch, o/O with stroke, and other letters used in IPA). I've not extended it though for Chinese romanizations (there are several conventions for tone marks), or Vietnamese (two diacritics needed also for tone marks in addition to vowel modifiers). 2016-02-16 9:00 GMT+01:00 Philippe Verdy : > 2016-02-16 0:32 GMT+01:00 Mats Blakstad : > >> I've worked to upload a keyboard for local languages in Togo to XKB >> project, it is a combination keyboard based on French keyboard and extended >> to make it possible to write all the local languages in Togo. However many >> of the languages have several tones and even use combined tones. However >> when I tried to update the composer to make it work it seems like the >> composer only can give back a precomposed character and not a string with >> combined characters. >> >> I now wonder, generally, is it best to add new precomposed characters to >> Unicode? Should there be a unicode symbol for each combination used? What >> is best practise? I ask because I see some unicodes are precomposed >> characters, I'm not sure why they are useful, but if they are maybe we also >> should add these? >> > > You don't need that. > > Keyboard layouts MUST generate the combining sequence. (It's then up to > the text editors and softwares to adapt themselves to the possibility that > a single keystroke could generate multiple characters/code points, and to > handle themselves the case of text selection and corrections by grapheme > cluster rather than by single character/code point: this is already done in > many softwares, including for the Latin script). > > However, Unicode could standardize names for these "common" combinations > (without assigning new code points, which is not needed). There's already a > supplementary datafile for them. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 16 02:29:45 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 16 Feb 2016 09:29:45 +0100 Subject: Copyleft Symbol In-Reply-To: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> References: <20160215103201.665a7a7059d7ee80bb4d670165c8327d.2257360314.wbe@email03.secureserver.net> Message-ID: 2016-02-15 18:32 GMT+01:00 Doug Ewell : > The recent acceptance by UTC of BITCOIN SIGN, which is also often > perceived as a logo and also sometimes associated with a social > movement, might indicate greater willingness of UTC to encode the > copyleft symbol, even discounting the effects of the Emoji Revolution. > The bitcoin is more than just a social movement, given that the currency was given an official status in international bank systems (first time in Germany I think), meaning that it is now legal to have accounts in that highly speculative foreign currency, and to trade it on legal change markets, and also list prices in that currency (for transactions outside domestic markets where prices in the national currency are still required in almost all countries), even if there's no physical currency and no bank of emission (replaced by an informal organisation and a group of private banks accepting or trading it). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Feb 16 08:01:13 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Feb 2016 15:01:13 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: Message-ID: <1099467789.7736.1455631273655.JavaMail.www@wwinf2226> On 2/15/2016 3:32 PM, Mats Blakstad wrote: [?] > I now wonder, generally, is it best to add new precomposed characters to Unicode? Should there be a unicode symbol for each combination used? What is best practise? I ask because I see some unicodes are precomposed characters, I'm not sure why they are useful, but if they are maybe we also should add these? [?] On Mon, 15 Feb 2016 20:46:28 -0800, Asmus Freytag (t) answered?: [?] > However, precomposing these is simply out. Unicode locked that door and threw away the key (short answer). The long answer will come along shortly. Existing precomposed characters have been proposed before the deadline, i.e. in the past millennium, and encoded for backwards compatibility. Therefore, the scripts of many Latin-writing countries, including Vietnam, can be represented both in NFD *and* NFC, but this is purely fortuitous. The well-known Unicode encoding scheme being based on _combining diacritics_, a part of implementation consists in making these supported at all stages of data processing, including input. The big oopsie that you stumbled upon, is that Windows keyboard layout drivers?as opposed to Linux?cannot generate by dead keys more than one single UTF-16 code unit. Supposedly this is due to a gap in keyboard standardization. When ISO/IEC?9995 was published in 1994, after a decade of work?and after a couple of years thriving Unicode?the standard provided nothing to cater for Unicode implementation. A bit later, the Windows keyboard APIs were frozen, for backwards compatibility. Indeed there _is_ a problem. But there are solutions. On Tue, 16 Feb 2016 09:00:26 +0100, Philippe Verdy answered?: [?] > Keyboard layouts MUST generate the combining sequence. [?] Indeed Unicode states that ?it is straightforward to adapt such a system? of dead keys to output combining sequences as well, and that was the idea when ISO/IEC?9995-11 was added past year. That last and most recent part of the standard specifies the algorithm of an IME that uses the NormalizeString function or the String Normalize method provided by the OS. You may wish to look up the long description in French Wikip?dia?[1]. On Windows there is however no need of a *new* and ISO/IEC-conformant IME, as Keyman keyboard layouts are already able to generate whatever sequence is required, from whatever input is specified, with dead keys or visible on screen. If you checked the Pan Africa (Deadkeys) layout that is suitable for Togo and many other African countries, as well as the official SIL Pan Africa keyboard, and they don?t match your requirements?because diacritics are entered _after_ the base letter, even to get existing precomposed letters output?you may wish to create a layout that outputs combining sequences entered by dead keys, using Keyman Developer. Experience shows however that training on dead key layouts as used for French, can be extended to the use of combining diacritics entered after the base letter, with an appropriate keyboard layout driver. These combining characters being actually the most useful form of most diacritics, it is recommended that they be generated when the space bar is hit after a dead key if such are present. More obviously all needed diacritics are allocated to key positions, so that they can be added to any letter by the means of a single keystroke. One example is the keyboard layout for Bamanankan and French on the /Mali Pense/ site that Don?Osborn?s /Beyond Niamey/ blog linkes to?[2]. Anyway, entering diacritics _after_ the base letter is the most up-to-date way to input composed characters, because it is very intuitive, and because it realizes the spirit of the character representation scheme of Unicode. I hope that helps too. Best regards, Marcel [1] https://fr.wikipedia.org/wiki/ISO/CEI_9995#ISO.2FCEI_9995-11_-_Les_touches_mortes [2] Don Osborn. Beyond Niamey: Writing Bambara right. (2014, November 25). Retrieved October 22, 2015, from http://niamey.blogspot.fr/2014/11/writing-bambara-right.html From leob at mailcom.com Wed Feb 17 16:11:06 2016 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 17 Feb 2016 14:11:06 -0800 Subject: "I love you" hand gesture emoji? Message-ID: Not that I need it a lot, but I'm curious if an emoji for https://en.wikipedia.org/wiki/ILY_sign has ever been requested. Leo From simon at simon-cozens.org Wed Feb 17 16:38:32 2016 From: simon at simon-cozens.org (Simon Cozens) Date: Thu, 18 Feb 2016 09:38:32 +1100 Subject: "I love you" hand gesture emoji? In-Reply-To: References: Message-ID: <56C4F668.7010407@simon-cozens.org> On 18/02/2016 09:11, Leo Broukhis wrote: > Not that I need it a lot, but I'm curious if an emoji for > https://en.wikipedia.org/wiki/ILY_sign has ever been requested. I don't think this has been proposed but I'd love to see it. That Wikipedia article has an American focus but I've seen it used by Deaf people all over the world; it's independent of the finger spelling alphabet in use, and has become a broader cultural symbol. Would be great to have this available for Deaf users. Here's a dekome implementation: http://mjf.jp/view.php?fid=deco4923f9c8f05ec From davidj_faulks at yahoo.ca Thu Feb 18 17:39:25 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Thu, 18 Feb 2016 23:39:25 +0000 (UTC) Subject: Astrology Symbols Again References: <1649409503.5227096.1455838765420.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1649409503.5227096.1455838765420.JavaMail.yahoo@mail.yahoo.com> I've put together a PDF file that contains some ?samples? of astrological symbols not yet in Unicode, linked below: https://drive.google.com/file/d/0B8Yuf2UKJsLHN2JvOUNZV3o0THc/view?usp=sharing It should be noted that while almost all Astrology Software can produce images that contain listings of positions, aspects, and midpoints that contain whatever symbols might be found in a chart, it is rare to actually post these listings, so ?in-text? use of astrological symbols is far more rare than using them in a horoscope chart. Some of these symbols, I only have from a chart so far. I don't think I will be able to propose all of these for Unicode. I would, however, appreciate any comments, especially about the various symbols for Pluto: the creaton of U+26E2 ? seems to indicate that different symbols for the same object should be encoded separatly, but I have at least 4 extra symbols here (perhaps more, due to very variable glyphs). David From lokedhs at gmail.com Sat Feb 20 04:23:13 2016 From: lokedhs at gmail.com (=?UTF-8?Q?Elias_M=C3=A5rtenson?=) Date: Sat, 20 Feb 2016 18:23:13 +0800 Subject: Character folding in text editors Message-ID: Hello Unicode, I have been involved in a rather long discussion on the Emacs-devel mailing list[1] concerning the right way to do character folding and we've reached a point where input from Unicode experts would be welcome. The problem is the implementation of equivalence when searching for characters. For example, if I have a buffer containing the following characters (both using the precomposed and canonical forms): o ? ? ? n ? The character folding feature in Emacs allows a search for "o" to mach some or even all of these characters. The discussion on the mailing list has circulated around both the fact that the correct behaviour here is locale-dependent, and also on the correct way to implement this matching absent any locale-specific exceptions. An English speaker would probably expect a search for "o" to match the first 4 characters and a search for "n" to match the latter two. A Spanish speaker would expect that n and ? be different but otherwise have the same behaviour as the English user. A Swedish user would definitely expect o and ? to compare differently, but ? and ? to compare the same. I have been reading the materials on unicode.org trying to see if this has been specifically addressed anywhere by the Unicode Consortium, but my results are inconclusive at best. What is the "correct" way to do this from Unicode's perspective? There is clearly an aspect of locale-dependence here, but how far can the Unicode data help? In particular, as far as I can see there is no way that the Unicode charts can allow me to write an algorithm where o and ? are seen as similar (as would be expected by an English user). [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Sat Feb 20 11:11:03 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 20 Feb 2016 18:11:03 +0100 Subject: Character folding in text editors In-Reply-To: References: Message-ID: <20160220181103.33145759rv1b1nvr@mail.mimuw.edu.pl> Quote/Cytat - Elias M?rtenson (Sat 20 Feb 2016 11:23:13 AM CET): > Hello Unicode, > > I have been involved in a rather long discussion on the Emacs-devel mailing > list[1] concerning the right way to do character folding and we've reached > a point where input from Unicode experts would be welcome. > > The problem is the implementation of equivalence when searching for > characters. For example, if I have a buffer containing the following > characters (both using the precomposed and canonical forms): > > o ? ? ? n ? > > The character folding feature in Emacs allows a search for "o" to mach some > or even all of these characters. The discussion on the mailing list has > circulated around both the fact that the correct behaviour here is > locale-dependent, and also on the correct way to implement this matching > absent any locale-specific exceptions. What about just using the POSIX equivalent classes in regular expression? From http://www.regular-expressions.info/posixbrackets.html A POSIX locale can define character equivalents that indicate that certain characters should be considered as identical for sorting. In French, for example, accents are ignored when ordering words. ?l?ve comes before ?tre which comes before ?v?nement. ? and ? are all the same as e, but l comes before t which comes before v. With the locale set to French, a POSIX-compliant regular expression engine matches e, ?, ? and ? when you use the collating sequence [=e=] in the bracket expression [[=e=]]. Regards Janusz (an Emacs user) -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From verdy_p at wanadoo.fr Sat Feb 20 11:27:41 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 20 Feb 2016 18:27:41 +0100 Subject: Character folding in text editors In-Reply-To: References: Message-ID: Unless we have case folding tailored by language, you cannot do that based on the Unicode database alone. However CLDR provides tailored data about collation. >From my point of view, it is just a matter or selecting the collation strength to use for searches using collation. All collations in CLDR are locale-dependant (the search algorithm must be using either a language preselection, or detect the default language used by the document, or set explicitly in specific fragments of the document, or use some hints to guess what could be the effective language), even if CLDR also defines a "root" locale for use in language-neutral contexts, or when the language cannot be determined from the document or its metadata. 2016-02-20 11:23 GMT+01:00 Elias M?rtenson : > Hello Unicode, > > I have been involved in a rather long discussion on the Emacs-devel > mailing list[1] concerning the right way to do character folding and we've > reached a point where input from Unicode experts would be welcome. > > The problem is the implementation of equivalence when searching for > characters. For example, if I have a buffer containing the following > characters (both using the precomposed and canonical forms): > > o ? ? ? n ? > > The character folding feature in Emacs allows a search for "o" to mach > some or even all of these characters. The discussion on the mailing list > has circulated around both the fact that the correct behaviour here is > locale-dependent, and also on the correct way to implement this matching > absent any locale-specific exceptions. > > An English speaker would probably expect a search for "o" to match the > first 4 characters and a search for "n" to match the latter two. > > A Spanish speaker would expect that n and ? be different but otherwise > have the same behaviour as the English user. > > A Swedish user would definitely expect o and ? to compare differently, but > ? and ? to compare the same. > > I have been reading the materials on unicode.org trying to see if this > has been specifically addressed anywhere by the Unicode Consortium, but my > results are inconclusive at best. > > What is the "correct" way to do this from Unicode's perspective? There is > clearly an aspect of locale-dependence here, but how far can the Unicode > data help? > > In particular, as far as I can see there is no way that the Unicode charts > can allow me to write an algorithm where o and ? are seen as similar (as > would be expected by an English user). > > [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sat Feb 20 11:56:26 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sat, 20 Feb 2016 19:56:26 +0200 Subject: Character folding in text editors In-Reply-To: (message from Philippe Verdy on Sat, 20 Feb 2016 18:27:41 +0100) References: Message-ID: <83lh6fnu45.fsf@gnu.org> > From: Philippe Verdy > Date: Sat, 20 Feb 2016 18:27:41 +0100 > Cc: unicode Unicode Discussion > > Unless we have case folding tailored by language, you cannot do that based on the Unicode database alone. What about language-independent character-folding: where in the Unicode database is the data for that? From jsbien at mimuw.edu.pl Sat Feb 20 12:03:31 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 20 Feb 2016 19:03:31 +0100 Subject: Character folding in text editors In-Reply-To: References: Message-ID: <20160220190331.558859b47tao6dmb@mail.mimuw.edu.pl> Quote/Cytat - Philippe Verdy (Sat 20 Feb 2016 06:27:41 PM CET): > Unless we have case folding tailored by language, you cannot do that based > on the Unicode database alone. > > However CLDR provides tailored data about collation. > > From my point of view, it is just a matter or selecting the collation > strength to use for searches using collation. Exactly. The POSIX equivalent classes are defined by the locale collation. Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From mark at macchiato.com Sat Feb 20 13:29:36 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 20 Feb 2016 20:29:36 +0100 Subject: Character folding in text editors In-Reply-To: References: Message-ID: Yes, that can be used. Easiest is using ICU. Create a collator, using the "search" keyword. That can be used to search for text, using settings you want for the strength (primary differences, secondary, etc). You can also access the collation keys from the ICU API, and build a mapping yourself of characters to collation keys that you can use for searching with your own algorithm. That mapping can also be used to build an equivalence class of characters that you can pick a representative from. If you don't use ICU, you can also use the CLDR data directly, but you'll have to parse it yourself. You'd start with the root locale, then add in the mappings from the children (eg de.xml). The parsing is not trivial, but since you are only looking for equivalences (not ordering), it is somewhat simpler. Mark On Sat, Feb 20, 2016 at 6:27 PM, Philippe Verdy wrote: > Unless we have case folding tailored by language, you cannot do that based > on the Unicode database alone. > > However CLDR provides tailored data about collation. > > From my point of view, it is just a matter or selecting the collation > strength to use for searches using collation. All collations in CLDR are > locale-dependant (the search algorithm must be using either a language > preselection, or detect the default language used by the document, or set > explicitly in specific fragments of the document, or use some hints to > guess what could be the effective language), even if CLDR also defines a > "root" locale for use in language-neutral contexts, or when the language > cannot be determined from the document or its metadata. > > > > 2016-02-20 11:23 GMT+01:00 Elias M?rtenson : > >> Hello Unicode, >> >> I have been involved in a rather long discussion on the Emacs-devel >> mailing list[1] concerning the right way to do character folding and we've >> reached a point where input from Unicode experts would be welcome. >> >> The problem is the implementation of equivalence when searching for >> characters. For example, if I have a buffer containing the following >> characters (both using the precomposed and canonical forms): >> >> o ? ? ? n ? >> >> The character folding feature in Emacs allows a search for "o" to mach >> some or even all of these characters. The discussion on the mailing list >> has circulated around both the fact that the correct behaviour here is >> locale-dependent, and also on the correct way to implement this matching >> absent any locale-specific exceptions. >> >> An English speaker would probably expect a search for "o" to match the >> first 4 characters and a search for "n" to match the latter two. >> >> A Spanish speaker would expect that n and ? be different but otherwise >> have the same behaviour as the English user. >> >> A Swedish user would definitely expect o and ? to compare differently, >> but ? and ? to compare the same. >> >> I have been reading the materials on unicode.org trying to see if this >> has been specifically addressed anywhere by the Unicode Consortium, but my >> results are inconclusive at best. >> >> What is the "correct" way to do this from Unicode's perspective? There is >> clearly an aspect of locale-dependence here, but how far can the Unicode >> data help? >> >> In particular, as far as I can see there is no way that the Unicode >> charts can allow me to write an algorithm where o and ? are seen as similar >> (as would be expected by an English user). >> >> [1] https://lists.gnu.org/archive/html/emacs-devel/2016-02/msg00506.html >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Feb 20 15:43:15 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 20 Feb 2016 14:43:15 -0700 Subject: Character folding in text editors In-Reply-To: References: Message-ID: <13F29763611C43C0B24B0FDADB58C062@DougEwell> Eli Zaretskii wrote: > What about language-independent character-folding: where in the > Unicode database is the data for that? The OP kind of alluded to that: there is no such thing really as language-independent character folding. About the closest approximation you can get using Unicode data alone (not CLDR) is to normalize to NFD, then ignore the combining diacritics. But that still doesn't work for a character like ?, which doesn't decompose to o + anything, and more importantly, it still won't meet expectations because of the n/? and o/?/? language-dependency problems. As Mark and Philippe said, the real solution is to use CLDR, because that is where language-dependent information like this lives. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Sat Feb 20 16:10:04 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 20 Feb 2016 14:10:04 -0800 Subject: Character folding in text editors In-Reply-To: <83lh6fnu45.fsf@gnu.org> References: <83lh6fnu45.fsf@gnu.org> Message-ID: <56C8E43C.4010200@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Feb 20 17:19:19 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 21 Feb 2016 00:19:19 +0100 Subject: Character folding in text editors In-Reply-To: <13F29763611C43C0B24B0FDADB58C062@DougEwell> References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> Message-ID: It should also be noted that some kind of "folding" described/desired by Elias will likely fail his expectations, even when using collation data in CLDR tailored per language. Notably, this data, even if it is used as it weakest strength (the primary collation level only, discarding other differences at higher strength levels) will most often not collate many digrams/trigrams that are frequently used in the locale for which the data is designed. The reason for that is that most of these digrams/trigrams (used in the orthography to note a single phoneme) are highly context-dependant and could in fact cover several distinct phonemes. E.g. "on" in French is a digram for the nasal o. There are also mute letters (consonnants) following it in the same phoneme. But if the consonnant is followed by a vowel, then there's a posible syllable break between "on" and the following consonnant. However that vowel may also be mute (if it is a final "e"), in which case there's a single syllable.. If the digram "on" is followed by a vowel, it is no longer a digram and there's a syllable break between "o" and "n", but if "on" is followed by a mute vowel (final "e"), that syllable break disappears, but the digram "on" is still two distinct phonemes. "on" may also be followed by another "n" and a vowel (possibly mute) it which case "on" is never a single phoneme. There are similar issues with other digrams/trigrams in French such as "ein", "aint", un". Some distinct difficulties with "gu", "ge" and "qu". And more difficultes with "ch" (also in English and other languages). Different difficulties with "ai"... Determining which digrams/trigrams are a single phoneme requires parsing words for syllable breaks. But there are many exceptions (notably because languages are borrowing lots of words from other languages with their origin orthography, and the phonetic is only slightly altered. There exists some algorithms trying to use those weak "equivalences", based on their apparent orthography, trying to infer some basic phonetic from it. This is used for performing approxiamte searches in arbitrary plain text, even in cases where there may exist some orthographic typos in it. Look for example at the SOUNDEX function (you'll first need to detect word-breaks for some implementations). Trying to use dictionary data for determining the syllable breaks may be useful, but you need a lot of data (and all dictionaries are incomplete). For disambituating some cases, you'll need to determine in fact the actual phonetics by using a phonetic dictionary (data resources for that are difficult to find, even serious linguistic dictionnaries only include a part of the phonetic, and ignore the variants for derived orthographic forms) 2016-02-20 22:43 GMT+01:00 Doug Ewell : > Eli Zaretskii wrote: > > What about language-independent character-folding: where in the >> Unicode database is the data for that? >> > > The OP kind of alluded to that: there is no such thing really as > language-independent character folding. > > About the closest approximation you can get using Unicode data alone (not > CLDR) is to normalize to NFD, then ignore the combining diacritics. But > that still doesn't work for a character like ?, which doesn't decompose to > o + anything, and more importantly, it still won't meet expectations > because of the n/? and o/?/? language-dependency problems. > > As Mark and Philippe said, the real solution is to use CLDR, because that > is where language-dependent information like this lives. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lokedhs at gmail.com Sun Feb 21 00:35:29 2016 From: lokedhs at gmail.com (=?UTF-8?Q?Elias_M=C3=A5rtenson?=) Date: Sun, 21 Feb 2016 14:35:29 +0800 Subject: Character folding in text editors In-Reply-To: <56C8E43C.4010200@ix.netcom.com> References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> Message-ID: On 21 February 2016 at 06:10, Asmus Freytag (t) wrote: Unicode, even CLDR, doesn't nearly have enough data for the purpose. > (and as a corollary of what Elias points out, it's likely to annoy users > of every language, in that it would fold essential and non-essential > distinctions indiscriminately). > > I've been working on this problem in the context of international > top-level domain names, where the aim of the project is to identify labels > that are seen as "the same" by users of a given script (but, in cases of > identical appearance, we also include those seen as identical by users > across scripts). > > None of the working groups in this project has felt like turning to CLDR > for this purpose, and so far, each has approached the issue in a way that > is not linked to sorting. > > Finally, none has seen folding of diacritics as useful; however, in the > case of Arabic, where optional combining marks simply are not supported (so > as to avoid having to define a folding). > > (see > https://www.icann.org/sites/default/files/lgr/lgr-1-arabic-script-01dec15-en.html > ) > Thank you, and everybody else who contributed information. This was very useful to me. I have never actually looked at the CLDR in detail, and I now realise that I have some reading to do. We will see where this goes on the Emacs-devel list. Regards, Elias -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Feb 21 04:47:28 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 21 Feb 2016 11:47:28 +0100 Subject: Character folding in text editors In-Reply-To: <56C8E43C.4010200@ix.netcom.com> References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> Message-ID: On Sat, Feb 20, 2016 at 11:10 PM, Asmus Freytag (t) wrote: > Unicode, even CLDR, doesn't nearly have enough data for the purpose. > (and as a corollary of what Elias points out, it's likely to annoy users > of every language, in that it would fold essential and non-essential > distinctions indiscriminately). > > I've been working on this problem in the context of international > top-level domain names, where the aim of the project is to identify labels > that are seen as "the same" by users of a given script (but, in cases of > identical appearance, we also include those seen as identical by users > across scripts). > > None of the working groups in this project has felt like turning to CLDR > for this purpose, and so far, each has approached the issue in a way that > is not linked to sorting. > > Finally, none has seen folding of diacritics as useful; however, in the > case of Arabic, where optional combining marks simply are not supported (so > as to avoid having to define a folding). > > (see > https://www.icann.org/sites/default/files/lgr/lgr-1-arabic-script-01dec15-en.html > ) > ?It depends on what the folding is being used for: there are many different purposes. For some purposes, the goal of "is seen as the same" ?is appropriate, while for others a broader scope is appropriate?typically because someone wants a quick filter to get to a relatively small set of strings which can then be processed in a more CPU-intensive fashion. In whatever case, one can only get an approximation; the question is whether that approximation is sufficient for whatever the task is at hand. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Feb 21 10:21:24 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:21:24 +0200 Subject: Character folding in text editors In-Reply-To: <13F29763611C43C0B24B0FDADB58C062@DougEwell> References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> Message-ID: <831t86niez.fsf@gnu.org> > From: "Doug Ewell" > Date: Sat, 20 Feb 2016 14:43:15 -0700 > > > What about language-independent character-folding: where in the > > Unicode database is the data for that? > > The OP kind of alluded to that: there is no such thing really as > language-independent character folding. Emacs is currently looking for a useful approximation, given that the language of the text is in general unknown. The folding can be toggled off (either as a global default, or for the current search), for those use cases where it is undesirable or gets in the way. > About the closest approximation you can get using Unicode data alone > (not CLDR) is to normalize to NFD, then ignore the combining diacritics. This is what Emacs currently does, IIUC what you say. The NFD normalization uses the decomposition data included with UnicodeData.txt. Is this what you mean? > But that still doesn't work for a character like ?, which doesn't > decompose to o + anything Why doesn't it, btw? Same question about ?. I've heard an opinion that UnicodeData.txt only included decompositions when the combining mark's glyphs don't overlap those of the basic character. Is that correct? > and more importantly, it still won't meet expectations because of > the n/? and o/?/? language-dependency problems. Given that the feature can be turned off easily, do you think that it will nonetheless be useful, even though language-dependent parts are not available? From eliz at gnu.org Sun Feb 21 10:22:07 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:22:07 +0200 Subject: Character folding in text editors In-Reply-To: <56C8E43C.4010200@ix.netcom.com> (asmus-inc@ix.netcom.com) References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> Message-ID: <83ziuum3tc.fsf@gnu.org> > From: "Asmus Freytag (t)" > Date: Sat, 20 Feb 2016 14:10:04 -0800 > > > What about language-independent character-folding: where in the > > Unicode database is the data for that? > > > > > Unicode, even CLDR, doesn't nearly have enough data for the purpose. This seems to contradict what others said: they said CLDR includes the necessary data. What is missing from CLDR, and how bad will the omissions affect searching? > (and as a corollary of what Elias points out, it's likely to annoy users of every language, in that it would fold essential and non-essential distinctions indiscriminately). Users can easily turn the folding off if they don't like it or if it gets in the way. The important question is: will Emacs with this feature be more or less useful than without it? Another important question is whether character folding in searches should be turned on or off by default. IOW, should we expect more users wanting to turn it off than on? AFAIU, the very least that should be provided is being able to find decomposed characters when a composed one is searched for. The data for this, AFAIU, is in UnicodeData.txt in the form of the canonical decompositions. Is this correct? > none has seen folding of diacritics as useful Really? So you are saying that, based on your experience, being able to ignore diacritics in searches is not a useful feature? From eliz at gnu.org Sun Feb 21 10:23:04 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:23:04 +0200 Subject: Character folding in text editors In-Reply-To: (message from Philippe Verdy on Sun, 21 Feb 2016 00:19:19 +0100) References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> Message-ID: <83y4aem3rr.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 21 Feb 2016 00:19:19 +0100 > Cc: unicode Unicode Discussion > > It should also be noted that some kind of "folding" described/desired by > Elias will likely fail his expectations, even when using collation data in > CLDR tailored per language. I don't think the issue at hand is how to implement the "ultimate" character-folding feature. As I wrote elsewhere in this thread, Emacs has only made its first step on this long road; if the way to reach the final goal is still foggy even for the experts, then Emacs is in good company ;-) What matters for us at this stage is whether what has been implemented, however partial and incomplete, will be useful, and whether it is deemed to be useful enough to be turned on by default. Please keep in mind that Emacs currently doesn't even have language-dependent case tables, and its sorting commands use comparison by Unicode codepoints. (A function that compares text by locale-dependent collation rules was added only recently, and, since it relies on the underlying libc for collation order, you must change the locale to use the rules for another language.) So these capabilities are really only starting to emerge, and until there's a reliable way of determining the language of a given chunk of text, the solutions will continue to be clunky at best. We are not looking for the ultimate solutions, we are looking for useful evolutionary initial steps. From eliz at gnu.org Sun Feb 21 10:28:05 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:28:05 +0200 Subject: Character folding in text editors In-Reply-To: (message from Philippe Verdy on Sun, 21 Feb 2016 00:19:19 +0100) References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> Message-ID: <83twl2m3je.fsf@gnu.org> > From: Philippe Verdy > Date: Sun, 21 Feb 2016 00:19:19 +0100 > Cc: unicode Unicode Discussion > > Unless we have case folding tailored by language, you cannot do that based > on the Unicode database alone. > > However CLDR provides tailored data about collation. > > From my point of view, it is just a matter or selecting the collation > strength to use for searches using collation. All collations in CLDR are > locale-dependant (the search algorithm must be using either a language > preselection, or detect the default language used by the document, or set > explicitly in specific fragments of the document, or use some hints to > guess what could be the effective language), even if CLDR also defines a > "root" locale for use in language-neutral contexts, or when the language > cannot be determined from the document or its metadata. Emacs doesn't (yet) have the notion of the "current language". Being a multi-lingual environment, where different languages are routinely mixed in the same editing buffer, this is a hard problem that doesn't yet have a solution. Emacs does know the "charset" which the given text came from, if the original was encoded in some telltale encoding, like iso-2022-jp; it can also know the script of the text (by looking at the Unicode block of the characters). In some cases, this is enough to deduce the language. But in general, and notably with languages that use the Latin script, this is not enough. Using the locale in which Emacs was started is insufficient in this age of global communications. Therefore, the goal of what is currently implemented in what will become Emacs 25.1 in a few months was deliberately limited to begin with: support only "language-independent" folding. In a nutshell, this means ignoring all the collating weights except the primary. The implementation basically uses the decomposition data in UnicodeData.txt. How different is that from the "root locale" data that is part of CLDR? What are the differences? Does the implementation based on decomposition data have any merit, or is it completely useless/wrong? From eliz at gnu.org Sun Feb 21 10:28:56 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:28:56 +0200 Subject: Character folding in text editors In-Reply-To: (message from Mark Davis =?utf-8?B?4piV77iP?= on Sun, 21 Feb 2016 11:47:28 +0100) References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> Message-ID: <83si0mm3hz.fsf@gnu.org> > From: Mark Davis ?? > Date: Sun, 21 Feb 2016 11:47:28 +0100 > Cc: Unicode Public > > If you don't use ICU, you can also use the CLDR data directly, but you'll > have to parse it yourself. You'd start with the root locale, then add in > the mappings from the children (eg de.xml). The parsing is not trivial, but > since you are only looking for equivalences (not ordering), it is somewhat > simpler. What about using allkeys.txt from the UCA database? Is that equivalent to the root locale in CLDR, as far as equivalence for searching is concerned? If not, how do these two differ? (I've read http://www.unicode.org/reports/tr35/tr35-collation.html#Root_Collation, but it left me not sure whether what it describes affects search matches when secondary weights are ignored.) Also, what is the consensus here about using UCA's decomps.txt for folding characters when ignoring secondary and tertiary weights? From eliz at gnu.org Sun Feb 21 10:29:51 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:29:51 +0200 Subject: Character folding in text editors In-Reply-To: (message from Mark Davis =?utf-8?B?4piV77iP?= on Sun, 21 Feb 2016 11:47:28 +0100) References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> Message-ID: <83r3g6m3gg.fsf@gnu.org> Btw, are there any editors out there which support similar features? If so, can someone please point to them, and perhaps provide a short summary of the features they provide and how are they implemented? Thanks. From eliz at gnu.org Sun Feb 21 10:32:44 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 18:32:44 +0200 Subject: Additional decompositions in decomps.txt Message-ID: <83oabam3bn.fsf@gnu.org> This question is separate from, though related to, the "Character folding in text editors" thread. The UCA database includes the file decomps.txt, which is said to be based on the normative properties: # The decompositions used in the generation of DUCET are loosely based # on the normative decomposition mappings defined in UnicodeData.txt # in the Unicode Character Database. An examination of this data listing # clearly shows the close relationship to the decomposition mappings. # However, those decomposition mappings are adjusted as part of the input # to the generation of DUCET, in order to produce default weights more # appropriate for collation. Those adjusted # decompositions fall into several classes: # # 1. In some cases a decomposition mapping from UnicodeData.txt is # suppressed. # # 2. In some cases a decomposition mapping from UnicodeData.txt is # modified. # # 3. In some cases a new decomposition is added for a character which # has no decomposition mapping in UnicodeData.txt. In this third case, # a new decomposition tag "" is introduced, to distinguish these # introduced decompositions from those derived from UnicodeData.txt. However, I see in decomps.txt entries that seem to belong to neither of the 3 classes described above. Here are 2 notable examples: 00F8;;006F 0338 # LATIN SMALL LETTER O WITH STROKE => LATIN SMALL LETTER O + COMBINING LONG SOLIDUS OVERLAY 0142;;006C 0335 # LATIN SMALL LETTER L WITH STROKE => LATIN SMALL LETTER L + COMBINING SHORT STROKE OVERLAY In both these cases, UnicodeData.txt defines no decomposition properties, but the "" tag I expected to see is absent from decomps.txt. Is there something I'm missing here? From doug at ewellic.org Sun Feb 21 11:53:23 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 21 Feb 2016 10:53:23 -0700 Subject: Character folding in text editors In-Reply-To: <831t86niez.fsf@gnu.org> References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> <831t86niez.fsf@gnu.org> Message-ID: Eli Zaretskii wrote: >> About the closest approximation you can get using Unicode data alone >> (not CLDR) is to normalize to NFD, then ignore the combining >> diacritics. > > This is what Emacs currently does, IIUC what you say. The NFD > normalization uses the decomposition data included with > UnicodeData.txt. Is this what you mean? Yes, the sixth field from the left. For 00F1 this is 006E 0303, so you ignore the 0303 and fold 00F1 to 006E. Remember that the decompositions in UnicodeData.txt may contain other precomposed characters, so you have to apply this process iteratively: 1EA8 -> 00C2 0309 00C2 -> 0041 0302 so you fold 1EA8 to 0041. >> But that still doesn't work for a character like ?, which doesn't >> decompose to o + anything > > Why doesn't it, btw? Same question about ?. > > I've heard an opinion that UnicodeData.txt only included > decompositions when the combining mark's glyphs don't overlap those of > the basic character. Is that correct? This sounds like a great question for Ken Whistler. ? >> and more importantly, it still won't meet expectations because of >> the n/? and o/?/? language-dependency problems. > > Given that the feature can be turned off easily, do you think that it > will nonetheless be useful, even though language-dependent parts are > not available? It's probably a lot better than no folding. Just be prepared for the inevitable complaints from speakers of language X. Users tend to expect features like this to be perfect, even when you warn them. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Sun Feb 21 12:32:15 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 21 Feb 2016 10:32:15 -0800 Subject: Character folding in text editors In-Reply-To: <83ziuum3tc.fsf@gnu.org> References: <83lh6fnu45.fsf@gnu.org> <56C8E43C.4010200@ix.netcom.com> <83ziuum3tc.fsf@gnu.org> Message-ID: <56CA02AF.70801@ix.netcom.com> An HTML attachment was scrubbed... URL: From eliz at gnu.org Sun Feb 21 14:27:06 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Sun, 21 Feb 2016 22:27:06 +0200 Subject: Character folding in text editors In-Reply-To: References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> <831t86niez.fsf@gnu.org> Message-ID: <83egc5n71h.fsf@gnu.org> > From: "Doug Ewell" > Cc: > Date: Sun, 21 Feb 2016 10:53:23 -0700 > > > Given that the feature can be turned off easily, do you think that it > > will nonetheless be useful, even though language-dependent parts are > > not available? > > It's probably a lot better than no folding. Just be prepared for the > inevitable complaints from speakers of language X. Users tend to expect > features like this to be perfect, even when you warn them. Tell me about it ;-) Thanks for the feedback. From mark at macchiato.com Mon Feb 22 01:59:26 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 22 Feb 2016 08:59:26 +0100 Subject: CLDR v29 Beta available for review Message-ID: The CLDR v29 beta is available for review. Information on the release and a summary of the main changes are available at http://cldr.unicode.org/index/downloads/cldr-29. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From tuvalkin at gmail.com Mon Feb 22 08:59:52 2016 From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=) Date: Mon, 22 Feb 2016 14:59:52 +0000 Subject: "Q" shaped as mirrored "P" Message-ID: <56CB2268.3050808@gmail.com> On this photo [ http://web.archive.org/web/20140219141054/http://miranda_do_douro.voila.net/aldeia/aguasvivas_texte.jpg ] of a funereal monument in Portugal, dated 1892; stone carved capitals under the sculpure of a skull read: ?Repara y ?(sic!)? se consideras o estado em que eu estou?:? eu j? fui quem tu ?s e tu ser?s q.? eu sou?. All three "Q"s are shaped as a mirrored "P". This is in the village of ?guas Vivas / Augas Bibas, mun. Miranda de l Douro / Miranda do Douro (land of "?"s?), approximately at http://www.google.pt/maps/@41.5708666,-6.3145678,496m/data=!3m1!1e3 see context in archived page at http://web.archive.org/web/20130302070135/http://miranda_do_douro.voila.net/aguas_vivas.htm The text means ?Look and consider the state I?m in: I was once what/who you are and you?ll be what/who I am.?, says the skull. -- ____. Ant?nio MARTINS-Tuv?lkin | ()| N?o me invejo de quem tem|####| PT-2695-010 Bobadela LRS carros, parelhas e montes | +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | --------------------------------------------------------------------- De sable uma fonte e bordadura escaqueada de jalde e goles por timbre bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" --------------------------------------------------------------------- From kenwhistler at att.net Mon Feb 22 12:10:35 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 22 Feb 2016 10:10:35 -0800 Subject: Additional decompositions in decomps.txt In-Reply-To: <83oabam3bn.fsf@gnu.org> References: <83oabam3bn.fsf@gnu.org> Message-ID: <56CB4F1B.70108@att.net> Eli, You're not missing anything. This is a bug in the documentation of decomps.txt. Initially, added decompositions for the DUCET default weights were all tagged as . This results in a distinct *tertiary* weight in the initial collation weight values in DUCET. Later on, there turned up cases where an added decomposition for the DUCET input worked better *without* a distinct tertiary weight. In particular, this applies to the large collection of combining marks whose secondary weights are now collapsed into a smaller set of distinct values. It also applies to the o with stroke character you cite below. The documentation for decomps.txt just needs to be updated to reflect that new pattern. --Ken On 2/21/2016 8:32 AM, Eli Zaretskii wrote: > # 3. In some cases a new decomposition is added for a character which > # has no decomposition mapping in UnicodeData.txt. In this third case, > # a new decomposition tag "" is introduced, to distinguish these > # introduced decompositions from those derived from UnicodeData.txt. > > However, I see in decomps.txt entries that seem to belong to neither > of the 3 classes described above. Here are 2 notable examples: > > 00F8;;006F 0338 # LATIN SMALL LETTER O WITH STROKE => LATIN SMALL LETTER O + COMBINING LONG SOLIDUS OVERLAY > 0142;;006C 0335 # LATIN SMALL LETTER L WITH STROKE => LATIN SMALL LETTER L + COMBINING SHORT STROKE OVERLAY > > In both these cases, UnicodeData.txt defines no decomposition > properties, but the "" tag I expected to see is absent from > decomps.txt. Is there something I'm missing here? > From eliz at gnu.org Mon Feb 22 13:10:54 2016 From: eliz at gnu.org (Eli Zaretskii) Date: Mon, 22 Feb 2016 21:10:54 +0200 Subject: Additional decompositions in decomps.txt In-Reply-To: <56CB4F1B.70108@att.net> (message from Ken Whistler on Mon, 22 Feb 2016 10:10:35 -0800) References: <83oabam3bn.fsf@gnu.org> <56CB4F1B.70108@att.net> Message-ID: <83r3g4k1c1.fsf@gnu.org> > Cc: unicode at unicode.org > From: Ken Whistler > Date: Mon, 22 Feb 2016 10:10:35 -0800 > > You're not missing anything. This is a bug in the documentation of > decomps.txt. Initially, added decompositions for the DUCET default > weights were all tagged as . This results in a distinct *tertiary* > weight in the initial collation weight values in DUCET. Later on, > there turned up cases where an added decomposition for the DUCET > input worked better *without* a distinct tertiary weight. In > particular, this applies to the large collection of combining marks > whose secondary weights are now collapsed into a smaller set of > distinct values. It also applies to the o with stroke character you > cite below. The documentation for decomps.txt just needs to be > updated to reflect that new pattern. OK, thanks. So conceptually, all those additional decompositions are all in the same class as those tagged "", in that they don't originate from the UCD, but were added for collation purposes, is that correct? From kenwhistler at att.net Mon Feb 22 13:59:24 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 22 Feb 2016 11:59:24 -0800 Subject: Additional decompositions in decomps.txt In-Reply-To: <83r3g4k1c1.fsf@gnu.org> References: <83oabam3bn.fsf@gnu.org> <56CB4F1B.70108@att.net> <83r3g4k1c1.fsf@gnu.org> Message-ID: <56CB689C.4080906@att.net> Yes, that is correct. --Ken On 2/22/2016 11:10 AM, Eli Zaretskii wrote: > OK, thanks. So conceptually, all those additional decompositions are > all in the same class as those tagged "", in that they don't > originate from the UCD, but were added for collation purposes, is that > correct? From kenwhistler at att.net Mon Feb 22 15:19:56 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 22 Feb 2016 13:19:56 -0800 Subject: Just so story: Why isn't o-slash decomposed? (was: Re: Character folding in text editors) In-Reply-To: References: <13F29763611C43C0B24B0FDADB58C062@DougEwell> <831t86niez.fsf@gnu.org> Message-ID: <56CB7B7C.8030903@att.net> On 2/21/2016 9:53 AM, Doug Ewell wrote: > > > >>> But that still doesn't work for a character like ?, which doesn't >>> decompose to o + anything >> >> Why doesn't it, btw? Same question about ?. >> >> I've heard an opinion that UnicodeData.txt only included >> decompositions when the combining mark's glyphs don't overlap those of >> the basic character. Is that correct? > > This sounds like a great question for Ken Whistler. ? Well, with a softball pitch like that one... ;-) The basics are described in TUS 8.0, Section 2.12, Equivalent Sequences, on p. 65, in "Non-decomposition of Certain Diacritics." As to the inevitable why? question. Well, the UTC had to draw a line *somewhere* between clearly independent graphical combining marks applied to clearly distinct bases, versus completely idiosyncratic adjustment of base letter shape to create new letters. (For an example of the latter, think U+025E LATIN SMALL LETTER CLOSED REVERSED OPEN E, as a "sorta e-like character".) The decision was made by the original architects of Unicode, back at the point when the concept of decomposition was getting formalized (circa 1991), to draw the line thus: A. Clearly detached marks, plus a few attached marks at the "periphery" of the base that have predictable positions and do not distort the base letter shape (e.g., cedilla, ogonek, the Vietnamese horn). B. Overlaid marks (bars, slashes) and various hooks, curls, and the Cyrillic descenders. These have fairly unpredictable positions, so fallback displays tend to look bad, and the effect on the base letter shape is also unpredictable for the hooks and curls types of "diacritic" letter formation. Also in this category were any turned, rotated, reversed, or flipped letters. Note that this line is not exactly the same as what the early drafts (and the eventual Unicode 1.0) encoded for combining marks, because a few of the most productive Latin overlaid and attached combining marks were separately encoded. This tends to be the root of most current confusion about the topic for people coming at an attempt to understand the Unicode Standard long after the initial decisions were all engraved in stone. Having the overlaid diacritics (and at least the phonetic hooks) separately encoded enabled some productive use of them before further surveys resulted in filling out the atomic encoding of Latin letters with bars and hooks (see, e.g. Latin Extended-C and the Phonetic Extensions Supplement for many examples). But actually, having separate encoding of the overlaid diacritics, hooks, etc., is also useful for other purposes -- for collation, for example, they provide natural targets for assigning the secondary weights, which then can be used for the artificially introduced decompositions of letters with bars, letters with slashes, letters with hooks, etc., either for the DUCET or for tailorings which want to treat such combinations as having secondary diacritic weights, rather than as primary weight-distinct atomic letters. --Ken From mats.gbproject at gmail.com Mon Feb 22 18:53:54 2016 From: mats.gbproject at gmail.com (Mats Blakstad) Date: Tue, 23 Feb 2016 01:53:54 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <1099467789.7736.1455631273655.JavaMail.www@wwinf2226> References: <1099467789.7736.1455631273655.JavaMail.www@wwinf2226> Message-ID: Thanks for all the useful feedbacks and ideas! Exactly where should these combinations be documented? 2016-02-16 15:01 GMT+01:00 Marcel Schneider : > > Experience shows however that training on dead key layouts as used for > French, can be extended to the use of combining diacritics entered after > the base letter, with an appropriate keyboard layout driver. These > combining characters being actually the most useful form of most > diacritics, it is recommended that they be generated when the space bar is > hit after a dead key if such are present. More obviously all needed > diacritics are allocated to key positions, so that they can be added to any > letter by the means of a single keystroke. One example is the keyboard > layout for Bamanankan and French on the /Mali Pense/ site that Don?Osborn?s > /Beyond Niamey/ blog linkes to?[2]. Anyway, entering diacritics _after_ the > base letter is the most up-to-date way to input composed characters, > because it is very intuitive, and because it realizes the spirit of the > character representation scheme of Unicode. > > Thanks for this info, however; How much are the difference between if people add the diacritics before or after the letter? If people are used to add diacritics before the letter, would it not be pedagogically a better idea to continue that logic on a new keyboard? What we tried to do is to make a keyboard that simply extends the French keyboard (which is by far the most used in Togo), and then people can get more keys to a keyboard they already know. There are also other keyboards used locally by linguistics, but people tend not to learn them, and it can be a barrier when people need to click to change keyboard from "French" to a "Local languages keyboard" all the time; I guess people prefer a keyboard that they can use to write both. Anyway thanks a lot for these really useful ideas that I will keep in mind! -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Feb 23 04:21:46 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 23 Feb 2016 11:21:46 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <1099467789.7736.1455631273655.JavaMail.www@wwinf2226> Message-ID: <655233309.225.1456222906078.JavaMail.www@wwinf1g38> On Tue, 23 Feb 2016 01:53:54 +0100, Mats Blakstad wrote: > Exactly where should these combinations be documented? In the _Named Sequences_ part of the UCD. Its 8.0.0 version is found at: http://unicode.org/Public/8.0.0/ucd/NamedSequences.txt > How much are the difference between if > people add the diacritics before or after the letter? Technically, the main difference between entering diacritics before the base letter vs after the base letter appears only if the representation of the target character in NFC is more than one single code unit [let alone that in spite of what I seemed to suggest, every text stream can be represented both in NFC and NFD]. And then that difference is only relevant on Windows, not on Linux; as about Mac?OS?X I?ve no knowledge. But again, your question being about XKB, you should get it work by adding the combining diacritics after the base letter on output side in a locally customized XCompose configuration file. > If people are used to > add diacritics before the letter, would it not be pedagogically a better > idea to continue that logic on a new keyboard? What we tried to do is to > make a keyboard that simply extends the French keyboard (which is by far > the most used in Togo), and then people can get more keys to a keyboard > they already know. There are also other keyboards used locally by > linguistics, but people tend not to learn them, and it can be a barrier > when people need to click to change keyboard from "French" to a "Local > languages keyboard" all the time; I guess people prefer a keyboard that > they can use to write both. Indeed the use of several keyboard layouts for one single script?Latin script?in the same country is inefficient. To make complete language support widespread, your method of extending the main layout is likely to be the only useful one. A similar effort is actually on-going in France on governmental demand. I?m asking for extension to cover all Latin script written languages and translitteration systems, some of what is still an option, beyond full coverage of European Latin script written languages. I feel that people coming from?or studying languages of?countries and communities on other continents should become able to type their language in that script on any computer in France as well as in any other Latin script using countries, given that the issue is only to add some more characters of the same script. Latin script?as opposed to all other scripts AFAIK?stays still chopped into ?sub-scripts? and subsets. That brings but counter-productive complications, while the implementation of the whole script is technically feasible even on keyboard level. The only difference between keyboard layouts of Latin script using countries should be varying accessibility depending on frequencies of use. As about not altering user experience in Togo, on non-Linux/Unix systems I see two options: 1) the keyboard layout on the whole is provided as an IME, ideally in synergy with the French default layout; 2) the dead key handling is carried over to an IME in conformance with ISO/IEC?9995-11, in synergy with a keyboard layout where all dead key characters are combining diacritics?at risk of slightly altering user experience when writing French and the IME is turned off or unavailable. The first solution has the advantage of being already working, in use, widespread, and fully sustained by SIL, while I can?t see anything of this in the second solution. I missed the point that your request is about an XKB layout file. As this obviously targets Unix-based OSes, i.e. Linux, e.g. Ubuntu, output of combining sequences as a result of dead key lead key press sequences is no problem. However combining sequences are actually not found in XCompose as far as I could see on Xlib Compose / X.org web site, but that again may be considered as purely fortuitous. So on Linux?as opposed to Windows?the goal is a bit easier to reach in that, combining sequences can be generated by dead keys. But it is a bit harder in that, for a given keyboard layout there is no DLL with a complete and customized dead trans list in it. I believe that an installation script could be extended to replace the default composer configuration file by a complete one. In any case, XCompose needs some thorough update because support for numerous characters used in African Latin script based writing systems is very weak, if not purely missing, in XCompose. Same issue should lead to revise ISO/IEC?9995 again, this year best. I do hope that your request as well as all equivalent demands from other countries shall be centralized to achieve a thorough overhaul of ISO/IEC?9995 and its international and national implementations to solve at once all pending keyboard problems. Thanks in turn for your valuable feedback and suggestions! Marcel From charupdate at orange.fr Tue Feb 23 04:51:19 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 23 Feb 2016 11:51:19 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? Message-ID: <422937272.909.1456224679930.JavaMail.www@wwinf1g38> Sorry I see that my e-mail has been altered on the road (?!CRLF ? not found in outbox), I try to send it again (with corrections btw; I was careful, but this time I couldn?t leave the draft overnight?:) On Tue, 23 Feb 2016 01:53:54 +0100, Mats Blakstad wrote: > Exactly where should these combinations be documented? In the _Unicode Named Character Sequences_ part of the UCD. Its 8.0.0 version is found at: http://unicode.org/Public/8.0.0/ucd/NamedSequences.txt > How much are the difference between if > people add the diacritics before or after the letter? Technically, the main difference between entering diacritics before the base letter vs after the base letter appears only if the representation of the target character in NFC is more than one single code unit. And then that difference is only relevant on Windows, not on Linux; as about Mac?OS?X I?ve no knowledge. But your question being about XKB (I missed that point), you should get it work by adding the combining diacritics after the base letter on output side in a locally customized XCompose configuration file. > If people are used to > add diacritics before the letter, would it not be pedagogically a better > idea to continue that logic on a new keyboard? What we tried to do is to > make a keyboard that simply extends the French keyboard (which is by far > the most used in Togo), and then people can get more keys to a keyboard > they already know. There are also other keyboards used locally by > linguistics, but people tend not to learn them, and it can be a barrier > when people need to click to change keyboard from "French" to a "Local > languages keyboard" all the time; I guess people prefer a keyboard that > they can use to write both. Indeed the use of several keyboard layouts for one single script?Latin script?in the same country is inefficient. To make complete language support widespread, your method of extending the main layout is likely to be the only useful one. A similar effort is actually on-going in France on governmental demand. I?m asking for extension to cover all Latin script written languages and translitteration systems, some of what is still an option, beyond full coverage of European Latin script written languages. I feel that people coming from?or studying languages of?countries and communities on other continents should become able to type their language in that script on any computer in France as well as in any other Latin script using countries, given that the issue is only to add some more characters of the same script. Latin script?as opposed to all other scripts AFAIK?stays still chopped into ?sub-scripts? and subsets. That brings but counter-productive complications, while the implementation of the whole script is technically feasible even on keyboard level. The only difference between keyboard layouts of Latin script using countries should be varying accessibility depending on frequencies of use. As about not altering user experience in Togo, on non-Linux/Unix systems I see two options: 1) the keyboard layout on the whole is provided as an IME, ideally in synergy with the French default layout; 2) the dead key handling is carried over to an IME in conformance with ISO/IEC?9995-11, in synergy with a keyboard layout where all dead key characters are combining diacritics?at risk of slightly altering user experience when writing French and the IME is turned off or unavailable. The first solution has the advantage of being already working, in use, widespread, and fully sustained by SIL, while I can?t see anything of this in the second solution. I missed the point that your request is about an XKB layout file. As this obviously targets Unix-based OSes, i.e. Linux, e.g. Ubuntu, output of combining sequences as a result of dead key lead key press sequences is no problem. However combining sequences are actually not found in XCompose as far as I could see on Xlib Compose / X.org web site, but that again may be considered as purely fortuitous. So on Linux?as opposed to Windows?the goal is a bit easier to reach in that, combining sequences can be generated by dead keys. But it is a bit harder in that, for a given keyboard layout there is no DLL with a complete and customized dead trans list in it. I believe that an installation script could be extended to replace the default composer configuration file by a complete one. In any case, XCompose needs some thorough update because support for numerous characters used in African Latin script based writing systems is very weak, if not purely missing, in XCompose. Same issue should lead to revise ISO/IEC?9995 again, this year best. I do hope that your request as well as all equivalent demands from other countries shall be centralized to achieve a thorough overhaul of ISO/IEC?9995 and its international and national implementations to solve at once all pending keyboard problems. Thanks in turn for your valuable feedback and suggestions! Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Feb 23 05:10:51 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 23 Feb 2016 12:10:51 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <655233309.225.1456222906078.JavaMail.www@wwinf1g38> References: <1099467789.7736.1455631273655.JavaMail.www@wwinf2226> <655233309.225.1456222906078.JavaMail.www@wwinf1g38> Message-ID: 2016-02-23 11:21 GMT+01:00 Marcel Schneider : > I feel that people coming from?or studying languages of?countries and > communities on other continents should become able to type their language > in that script on any computer in France as well as in any other Latin > script using countries, given that the issue is only to add some more > characters of the same script. Latin script?as opposed to all other scripts > AFAIK?stays still chopped into ?sub-scripts? and subsets. That brings but > counter-productive complications, while the implementation of the whole > script is technically feasible even on keyboard level. The only difference > between keyboard layouts of Latin script using countries should be varying > accessibility depending on frequencies of use. > There will remain a resistance for the base layout of letters (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation For all other characters (including shifted or non-shifted digits, because this is only an issue on mechanical keyboards, not touche-on-screen keyboard, and mechanical keyboards almost always have a numeric keypad anyway), people can adapt easily, provided that the less frequent but essential punctuation (parentheses, apostrophe, hyphen) can be found on the key labels, as well as the location of dead keys for all the essential diacritics. Indeed, if there's a new standard for French, there will be new physical keyboards placing the labels correctly for the essential punctuation, plus the essential letters combined with diacritics with a single keystroke : but the later letters are language-dependant and not script-dependant, so people writing in other languages for the same script may not find them useful, but should be able to locate the deadkeys to get the full coverage they need. If a standard is adopted, the set of essential letters combined with diacritics should be located on a small part of the keyboard that is the same across all languages of the script, but tuned specifically for a language (or a few languages of one country). There will remain keyboard layouts per country differing only on those locations in this small part, probably reduced to only 5 language-dependant keys (only designed for ease of access, e.g. "?????" in French are very frequent and will be located in that part, but Italians would like to have all vowels with acute, Spanish will want to have the "?" in this part). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Feb 23 11:25:09 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 23 Feb 2016 10:25:09 -0700 Subject: Possible to add new precomposed characters for local language in =?UTF-8?Q?Togo=3F?= Message-ID: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> Philippe Verdy wrote: > There will remain a resistance for the base layout of letters > (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation Philippe is absolutely right here. Most of us on this list are character-set and i18n wonks, and some of us have customized our own keyboard layouts, but we should not delude ourselves into thinking we represent ordinary users. Many people are emotionally tied to a particular keyboard layout and become very confused when faced with something different. Trying to persuade them to adopt a "universal" keyboard, so they can type characters in a language they may not know, is an exercise in social frustration. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Tue Feb 23 18:38:59 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 24 Feb 2016 01:38:59 +0100 Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> Message-ID: And this is demonstrated since long by the epxerience of alternate "ergonomic" layouts, used by very few people. Even without fundamentally changing the layout (e.g. with keyboards in two parts in V), many people don't like them and prefer the traditional one. There are still some device using a pure alphabetic layout but many people criticized it (including in France, when the early Minitel -a text-only teminal for online services and precursor of today's Internet in France- used it in its first version: even if its keyboard was really horrible and made on purpose to type slowly, rapidly it changed to the AZERTY layout). The only revolution did not came from mechanical keyboards but from touch-on-screen layouts, to reduce the number of visible keys to leave space on the screen. Before that, there was the T9 keyboard on the first smartphones, but here again this had not the success, as it was really bad and extremely slow to type without errors. Touch-screen layouts have also kept the basic layout of letters of mechanical keyboards, even if they've suppressed many keys by using a new "mode" key for typing digits, punctuations, symbols or emojis. However touch-screen layouts have successfully integrated all the characters that people wanted, including letters with diacritics that were missing on mechanical keyboards, because their layout is dynamic and all labels are visible. But even on these layouts, typing the common letters with diacritics is still slow compared to traditional keyboards, even with dictionary-based wizards that propose words (it is still a nightmare to type a strong password, dictionary-based assistants offer little help or create errors), or to program something in a computing language, because it needs much more frequent punctuations and symbols or because not everything is a true linguistic word. It is then acceptable to slightly adapt a layout with minor changes, provided that basic letters and common punctuation are still at the same place and don't require pressing new combinations of keys (only acceptable for less frequent letters). We'll continue to live for long with the 3 basic layouts for Latin (QWERTY, AZERTY, QWERTZ). And nothing will really change without a strong national standard that will convince manufacturers to propose it at normal prices, and force software vendors to include it in the builtin layouts for their OSes. 2016-02-23 18:25 GMT+01:00 Doug Ewell : > Philippe Verdy wrote: > > > There will remain a resistance for the base layout of letters > > (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation > > Philippe is absolutely right here. Most of us on this list are > character-set and i18n wonks, and some of us have customized our own > keyboard layouts, but we should not delude ourselves into thinking we > represent ordinary users. Many people are emotionally tied to a > particular keyboard layout and become very confused when faced with > something different. Trying to persuade them to adopt a "universal" > keyboard, so they can type characters in a language they may not know, > is an exercise in social frustration. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Feb 25 02:35:25 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 25 Feb 2016 09:35:25 +0100 (CET) Subject: Possible to add new precomposed characters for local language in Togo? In-Reply-To: References: <20160223102509.665a7a7059d7ee80bb4d670165c8327d.2a091675e5.wbe@email03.secureserver.net> Message-ID: <1368697159.3867.1456389325215.JavaMail.www@wwinf1p04> On Tue, 23 Feb 2016 12:10:51 +0100, Philippe Verdy wrote: > 2016-02-23 11:21 GMT+01:00 Marcel Schneider : > > > I feel that people coming from?or studying languages of?countries and > > communities on other continents should become able to type their language > > in that script on any computer in France as well as in any other Latin > > script using countries, [?] >?> The only difference > > between keyboard layouts of Latin script using countries should be varying > > accessibility depending on frequencies of use. > > > > There will remain a resistance for the base layout of letters (basically > QWERTY vs. AZERTY vs QWERTZ) and basic punctuation > For all other characters (including shifted or non-shifted digits, because > this is only an issue on mechanical keyboards, not touche-on-screen > keyboard, and mechanical keyboards almost always have a numeric keypad > anyway), people can adapt easily, provided that the less frequent but > essential punctuation (parentheses, apostrophe, hyphen) can be found on the > key labels, as well as the location of dead keys for all the essential > diacritics. > > Indeed, if there's a new standard for French, there will be new physical > keyboards placing the labels correctly for the essential punctuation, plus > the essential letters combined with diacritics with a single keystroke : > but the later letters are language-dependant and not script-dependant, so > people writing in other languages for the same script may not find them > useful, but should be able to locate the deadkeys to get the full coverage > they need. If a standard is adopted, the set of essential letters combined > with diacritics should be located on a small part of the keyboard that is > the same across all languages of the script, but tuned specifically for a > language (or a few languages of one country). > There will remain keyboard layouts per country differing only on those > locations in this small part, probably reduced to only 5 language-dependant > keys (only designed for ease of access, e.g. "?????" in French are very > frequent and will be located in that part, but Italians would like to have > all vowels with acute, Spanish will want to have the "?" in this part). On Tue, 23 Feb 2016 10:25:09 -0700, Doug Ewell replied: > Philippe Verdy wrote: > > > There will remain a resistance for the base layout of letters > > (basically QWERTY vs. AZERTY vs QWERTZ) and basic punctuation > > Philippe is absolutely right here. Most of us on this list are > character-set and i18n wonks, and some of us have customized our own > keyboard layouts, but we should not delude ourselves into thinking we > represent ordinary users. Many people are emotionally tied to a > particular keyboard layout and become very confused when faced with > something different. Trying to persuade them to adopt a "universal" > keyboard, so they can type characters in a language they may not know, > is an exercise in social frustration. On Wed, 24 Feb 2016 01:38:59 +0100, Philippe Verdy replied: > And this is demonstrated since long by the epxerience of alternate > "ergonomic" layouts, used by very few people. > [?] > > We'll continue to live for long with the 3 basic layouts for Latin (QWERTY, > AZERTY, QWERTZ). And nothing will really change without a strong national > standard that will convince manufacturers to propose it at normal prices, > and force software vendors to include it in the builtin layouts for their > OSes. When I?wrote: ?The only difference [?] should be [?]?, I swapped over into an ideal world? let alone that the historic swap from QWERTY to AZERTY was triggered by an ?accessibility? issue based ?on frequencies of use?. My purpose being not to *enforce* ergonomics as about the alphabetical layout, I fully agree with Mats?Blakstad, whose ?method of extending the main layout is likely to be the only useful one? as I wrote in the same e-mail?and with Doug?Ewell and Philippe?Verdy, whose valuable contributions came on to sustain. All parts of the Latin script as provided by Unicode, that are not used to write local and national languages e.g. of Togo, or of France, may be hidden as on keytops, but accessible on software side, i.e. in the layout driver or in the configuration files. One other challenge in Togo would be how to give easy access to the seven supplemental letters ?, ?, ?, ?, ?, ? and ?, while the five French precomposed letters are to be maintained, let alone ? and ??the latter being rather seldom in French however?that are part of the new governmental requirements in France, among other characters like the angle quotation marks, called guillemets-chevrons[1]. Generally talking, I can?t help believe that providing the ability to type any Latin script using language on any Latin keyboard would be a good idea. Again, that is feasible without overloading the keyboard with dead keys, just providing the most frequently used ones, six in Togo as I?can see. Marcel [1] Vers une norme fran?aise pour les claviers informatiques - Langue fran?aise et langues de France - Minist?re de la Culture et de la Communication. (2016, January 15). Retrieved January 19, 2016, from http://www.culturecommunication.gouv.fr/Politiques-ministerielles/Langue-francaise-et-langues-de-France/Politiques-de-la-langue/Langues-et-numerique/Les-technologies-de-la-langue-et-la-normalisation/Vers-une-norme-francaise-pour-les-claviers-informatiques From public at khwilliamson.com Sun Feb 28 23:04:27 2016 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 28 Feb 2016 22:04:27 -0700 Subject: Girl, 12, charged for threatening her school with emojis Message-ID: <56D3D15B.4070705@khwilliamson.com> http://abc27.com/2016/02/27/girl-12-charged-for-threatening-emojis/ From asmus-inc at ix.netcom.com Mon Feb 29 00:27:28 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 28 Feb 2016 22:27:28 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <56D3D15B.4070705@khwilliamson.com> References: <56D3D15B.4070705@khwilliamson.com> Message-ID: <56D3E4D0.9030902@ix.netcom.com> An HTML attachment was scrubbed... URL: From textexin at xencraft.com Mon Feb 29 01:14:23 2016 From: textexin at xencraft.com (Tex Texin) Date: Sun, 28 Feb 2016 23:14:23 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <56D3E4D0.9030902@ix.netcom.com> References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> Message-ID: <003901d172c0$d1626970$74273c50$@xencraft.com> Apparently, ?killing? isn?t threatening but the emojis are. I think this is because the schools and police believe the emojis are functions on their cell phones which might actually detonate, shoot or stab. The problem is with the reporter who headlined with the use of emojis as if that were significant and not that she specifically stated killing and hid behind another kids email account. It should say she was charged with threatening killing. But yeah arresting a 12 year old is as idiotic as arresting everyone whoever threatened (or written) ?I?ll kill you? when they were angry. However, how any of this belongs on the Unicode list is beyond me. Surely we do not need to comment on every use of emoji that occurs in the media. tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: Sunday, February 28, 2016 10:27 PM To: unicode at unicode.org Subject: Re: Girl, 12, charged for threatening her school with emojis On 2/28/2016 9:04 PM, Karl Williamson wrote: http://abc27.com/2016/02/27/girl-12-charged-for-threatening-emojis/ "The mother says the girl shouldn?t have been charged." In civilized countries 12-year-olds would be considered too young to be dragged into the courts. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Feb 29 03:18:37 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 29 Feb 2016 01:18:37 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <003901d172c0$d1626970$74273c50$@xencraft.com> References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> Message-ID: <56D40CED.2080406@ix.netcom.com> On 2/28/2016 11:14 PM, Tex Texin wrote: > > However, how any of this belongs on the Unicode list is beyond me. > Surely we do not need to comment on every use of emoji that occurs in > the media. > > But there you are mistaken, my dear sir! We are constantly told that the discussions on this list have no official status, and cannot affect the UTCs deliberations, so the only useful topics left are of the "facebook"-post ilk. A./ PS: now what was the "tongue-in-cheek" emoji character code again? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Feb 29 15:55:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 29 Feb 2016 22:55:12 +0100 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <56D40CED.2080406@ix.netcom.com> References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> Message-ID: This discussion is not official, but visibly a court takes emojis seriously and wants to assign them a legal meaning... Well emojis were initially designed to track amotions and form a sort of new language, but the court will have to explain what is the meaning of these 3 characters (not really images, just a handful of bytes in a short message) in legal terms. These 3 characters are very far from being convincing, even if they are interpreted as 3 words. This is very low for accusing someone so young of threatening someone with dangerous words. Take a linguistic dictionnary, it is full of these dangerous words. Take a catalog of firearms from the NRA, it is largely more threatening, but the NRA is not charged in a US court... And these characters are really fake firearms, very virtual. So it's not the meaning, nor the technical mean by which these terms were sent which is essential, the court will in fact want to judge about the intent and the effective psychological nature of this threat. What is the real intent of a 12-year old girl? There's not enough element in the short message to judge and given her age she does not really realize that this could have a so dramatic effect (nobody has experienced that before based on only three words which are not even evident personal insults). We'll have to bring to the fire many old famous comics (intended to children) showing similar images in bubbles instead of slang words, or label them "only for adults". 2016-02-29 10:18 GMT+01:00 Asmus Freytag (t) : > On 2/28/2016 11:14 PM, Tex Texin wrote: > > However, how any of this belongs on the Unicode list is beyond me. Surely > we do not need to comment on every use of emoji that occurs in the media. > > But there you are mistaken, my dear sir! > > We are constantly told that the discussions on this list have no official > status, and cannot affect the UTCs deliberations, so the only useful topics > left are of the "facebook"-post ilk. > > A./ > > PS: now what was the "tongue-in-cheek" emoji character code again? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Feb 29 16:24:32 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 29 Feb 2016 14:24:32 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> Message-ID: <56D4C520.20709@ix.netcom.com> On 2/29/2016 1:55 PM, Philippe Verdy wrote: > . Well emojis were initially designed to track amotions and form a > sort of new language, E-moji means "picture-character" in Japanese, has nothing to do (at first) with emotions. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Mon Feb 29 16:37:00 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Mon, 29 Feb 2016 19:37:00 -0300 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <56D4C520.20709@ix.netcom.com> References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com> Message-ID: It's a picture-character, sure; but I'd think that, like kaomoji before them, they've been used since the beginning to express the attitude of the writer, a kind of "emotion" (in linguistic terms, the "mood" of the utterance). For example, consider the ubiquitous ? sign, which also predates cellphone emoji; it's long been used in manga to denote a mood of flirtatiousness, fondness, cuteness, playfulness and so on. Likewise, the "veins popping" sign in manga ( http://tvtropes.org/pmwiki/pmwiki.php/Main/CrossPoppingVeins ) may be a drawing of veins; but it's used quite abstractly to denote an angry mood, and can even be used among text, in speech balloons. 2016-02-29 19:24 GMT-03:00 Asmus Freytag (t) : > On 2/29/2016 1:55 PM, Philippe Verdy wrote: > > . Well emojis were initially designed to track amotions and form a sort of > new language, > > > E-moji means "picture-character" in Japanese, has nothing to do (at first) > with emotions. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Feb 29 18:04:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 1 Mar 2016 01:04:07 +0100 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: <56D4C520.20709@ix.netcom.com> References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com> Message-ID: Today's Japanese emojis are (for most of them) recent inventions; may be there are some earlier tracks in Japanese comics, but you may as well find them in comics of America or Europe since the about the 1940's. All these icons were *later* renamed emojis in English and Unicode, but there's a long history of using icons for such emotions Look at the little heart drawn near the signature on an handwritten letter or discrete messages, or similar symbols carved by lovers on walls and trees. Or long before as a sign of recognition such as the fish for the first Christians in the Roman Empire, or even before in some hieroglyphic inscriptions in antic Egyptian, Mayan, and Chinese civilizations since Bronze Age or before. In fact you could also add all the symbols (not necessarily with religious meaning) found on graves for expressing that the remaining family of friend is missing the defunct. You could also add the similar symbols on jewelry for showing we love someone, or warrior paintings on faces. The modern Japanese Emojis were not the first pictograpic signs to express emotions (even if now they have been extended to many other things and they are now widespreading the rest of the world with these extensions). Still their main usage remains for emotions ; starting in the 1970's these were ASCII art symbols such as the famous :-) 2016-02-29 23:24 GMT+01:00 Asmus Freytag (t) : > On 2/29/2016 1:55 PM, Philippe Verdy wrote: > > . Well emojis were initially designed to track amotions and form a sort of > new language, > > > E-moji means "picture-character" in Japanese, has nothing to do (at first) > with emotions. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Mon Feb 29 18:25:23 2016 From: gwalla at gmail.com (Garth Wallace) Date: Mon, 29 Feb 2016 16:25:23 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com> Message-ID: Some are used to express emotions but many are not: food items, animals, landmarks, activities, etc. I think the majority do not have clear emotional referents. The original set introduced in Unicode 6.0 included things like ROASTED SWEET POTATO and TOKYO TOWER. On Mon, Feb 29, 2016 at 4:04 PM, Philippe Verdy wrote: > Today's Japanese emojis are (for most of them) recent inventions; may be > there are some earlier tracks in Japanese comics, but you may as well find > them in comics of America or Europe since the about the 1940's. > > All these icons were *later* renamed emojis in English and Unicode, but > there's a long history of using icons for such emotions Look at the little > heart drawn near the signature on an handwritten letter or discrete > messages, or similar symbols carved by lovers on walls and trees. Or long > before as a sign of recognition such as the fish for the first Christians in > the Roman Empire, or even before in some hieroglyphic inscriptions in antic > Egyptian, Mayan, and Chinese civilizations since Bronze Age or before. > > In fact you could also add all the symbols (not necessarily with religious > meaning) found on graves for expressing that the remaining family of friend > is missing the defunct. > You could also add the similar symbols on jewelry for showing we love > someone, or warrior paintings on faces. > > The modern Japanese Emojis were not the first pictograpic signs to express > emotions (even if now they have been extended to many other things and they > are now widespreading the rest of the world with these extensions). Still > their main usage remains for emotions ; starting in the 1970's these were > ASCII art symbols such as the famous :-) > > > > 2016-02-29 23:24 GMT+01:00 Asmus Freytag (t) : >> >> On 2/29/2016 1:55 PM, Philippe Verdy wrote: >> >> . Well emojis were initially designed to track amotions and form a sort of >> new language, >> >> >> E-moji means "picture-character" in Japanese, has nothing to do (at first) >> with emotions. >> >> A./ > > From asmus-inc at ix.netcom.com Mon Feb 29 20:11:26 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 29 Feb 2016 18:11:26 -0800 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com> Message-ID: <56D4FA4E.9040204@ix.netcom.com> An HTML attachment was scrubbed... URL: