From frederic.grosshans at gmail.com Tue Mar 1 04:14:22 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 1 Mar 2016 11:14:22 +0100 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> Message-ID: <56D56B7E.9020702@gmail.com> Le 29/02/2016 22:55, Philippe Verdy a ?crit : > So it's not the meaning, nor the technical mean by which these terms > were sent which is essential, the court will in fact want to judge > about the intent and the effective psychological nature of this > threat. What is the real intent of a 12-year old girl? There's not > enough element in the short message to judge and given her age she > does not really realize that this could have a so dramatic effect > (nobody has experienced that before based on only three words which > are not even evident personal insults). > > We'll have to bring to the fire many old famous comics (intended to > children) showing similar images in bubbles instead of slang words, or > label them "only for adults". > ?? ?? ?? indeed recall some of the symbols proposed by Karl Pentzlin in 2010 L2/10-402 Proposal to encode some additional Comic Style Symbols (http://www.unicode.org/L2/L2010/10402-comic-symbols.pdf ). It really looks like comics-style swearwords to me Fred From leob at mailcom.com Tue Mar 1 12:10:53 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 1 Mar 2016 10:10:53 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References:

Message-ID: I have a less disruptive proposal than to encode an unprecedented combining emoji. How about adding variation sequences + U+FE0F VS16 to signify BANKNOTE with ? Leo On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > For the pound emoji, throw in ~90M Egyptians. > > --J?rg Knappen > > *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr > *Von:* "Leo Broukhis" > *An:* "Mark Davis ??" > *Cc:* "unicode Unicode Discussion" > *Betreff:* Re: Enclosing BANKNOTE emoji? > The emojiexpress.com site is useful to check which new emoji or > combinations people actually use, but the stats are likely skewed by only > measuring input from one platform. > > Another way to look at the emojitracker.com stats: > > 339M people in the Eurozone : 389K uses of Euro emoji > 126M people in Japan : 354K uses of Yen emoji > 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in > for Lira) : 515K uses of pound emoji > > The total is 605M people : 1258K uses of non-dollar emoji > Assuming the same average frequency of use, 2933K uses of the dollar emoji > would be produced by 1411M people, out of which us + canada + mexico + > australia (500M) + other countries using $ as (part of) the sign for > their currency are way less than a half. This means that substantially more > than 500M people are using the dollar emoji by default, instead of emoji of > their national currencies. Assuming a lesser frequency of use will result > in a greater estimate of the affected population. > > Leo > > > On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? wrote: >> >> Look at http://www.emojixpress.com/stats/. The stats are different, >> since they collect data from keyboards not twitter posts, but they have a >> nice button to view only the news emoji. >> >> (The numbers on the new ones will be smaller, just because it takes time >> for systems to support them, and people to start using them. However, they >> bear out my predication that the most popular would be the eyes-rolling >> face). >> >> >> Mark >> >> >> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>> >>> A caveat about using emojitracker.com : it doesn't count newer emoji >>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>> added, their counts will be skewed. >>> >>> Leo >>> >>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: >>>> >>>> Thank you for the links, quite mesmerizing! >>>> >>>> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >>>> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >>>> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >>>> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >>>> 10x more than the lowest counts, and about the same frequency as various >>>> individual clock faces). >>>> >>>> It is quite evident that the dollar banknote emoji serves as a stand-in >>>> for at least half a dozen of various currencies. >>>> >>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>> wrote: >>>> >>>>> I would suggest that you first gather statistics and present >>>>> statistics on how often the current combinations are used compared to other >>>>> emoji, eg by consulting sources such as: >>>>> >>>>> http://www.emojixpress.com/stats/ >>>>> or >>>>> http://emojitracker.com/ >>>>> >>>>> >>>>> Mark >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>> wrote: >>>>>> >>>>>> There are >>>>>> >>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>> >>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>> currency sign, would produce the corresponding banknote, to forestall >>>>>> requests for individual emoji for banknotes with remaining currency >>>>>> signs. >>>>>> >>>>>> Leo >>>>>> >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.jacobs at xs4all.nl Tue Mar 1 12:31:35 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Tue, 01 Mar 2016 19:31:35 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References:

Message-ID: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> How would the system distinguish between US and Canada dollar? Both would be <$> + U+FE0F VS16 Chris Leo Broukhis schreef op 2016-03-01 19:10: > I have a less disruptive proposal than to encode an unprecedented combining emoji. > How about adding variation sequences + U+FE0F VS16 to signify BANKNOTE with ? > > Leo > > On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > > For the pound emoji, throw in ~90M Egyptians. > > --J?rg Knappen > > GESENDET: Dienstag, 09. Februar 2016 um 23:46 Uhr > VON: "Leo Broukhis" > AN: "Mark Davis ??" > CC: "unicode Unicode Discussion" > BETREFF: Re: Enclosing BANKNOTE emoji? > > The emojiexpress.com [1] site is useful to check which new emoji or combinations people actually use, but the stats are likely skewed by only measuring input from one platform. > Another way to look at the emojitracker.com [2] stats: > 339M people in the Eurozone : 389K uses of Euro emoji 126M people in Japan : 354K uses of Yen emoji 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in for Lira) : 515K uses of pound emoji > The total is 605M people : 1258K uses of non-dollar emoji Assuming the same average frequency of use, 2933K uses of the dollar emoji would be produced by 1411M people, out of which us + canada + mexico + australia (500M) + other countries using $ as (part of) the sign for their currency are way less than a half. This means that substantially more than 500M people are using the dollar emoji by default, instead of emoji of their national currencies. Assuming a lesser frequency of use will result in a greater estimate of the affected population. > Leo > > On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? wrote: > > Look at http://www.emojixpress.com/stats/. The stats are different, since they collect data from keyboards not twitter posts, but they have a nice button to view only the news emoji. > > (The numbers on the new ones will be smaller, just because it takes time for systems to support them, and people to start using them. However, they bear out my predication that the most popular would be the eyes-rolling face). > > Mark > > On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: > > A caveat about using emojitracker.com [2] : it doesn't count newer emoji yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are added, their counts will be skewed. > Leo > > On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: > > Thank you for the links, quite mesmerizing! > > On emojitracker.com [2] (cumulative counts, but only on twitter, AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but 10x more than the lowest counts, and about the same frequency as various individual clock faces). > It is quite evident that the dollar banknote emoji serves as a stand-in for at least half a dozen of various currencies. > > On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? wrote: > > I would suggest that you first gather statistics and present statistics on how often the current combinations are used compared to other emoji, eg by consulting sources such as: > > http://www.emojixpress.com/stats/ > or > http://emojitracker.com/ > > Mark > > On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: There are > > ?? U+01F4B4 Banknote With Yen Sign > ?? U+01F4B5 Banknote With Dollar Sign > ?? U+01F4B6 Banknote With Euro Sign > ?? U+01F4B7 Banknote With Pound Sign > > This is clearly an incomplete set. It makes sense to have a generic > "enclosing banknote" emoji character which, when combined with a > currency sign, would produce the corresponding banknote, to forestall > requests for individual emoji for banknotes with remaining currency > signs. > > Leo Links: ------ [1] http://emojiexpress.com [2] http://emojitracker.com [3] http://mark at macchiato.com [4] http://leob at mailcom.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From leob at mailcom.com Tue Mar 1 12:35:12 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 1 Mar 2016 10:35:12 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> References:

<95ca449690e088a9c5f276d8e16e196e@xs4all.nl> Message-ID: It doesn't have to. How does the system distinguish between US and Canada dollar in plain text? Both are <$>. Leo On Tue, Mar 1, 2016 at 10:31 AM, Chris Jacobs wrote: > How would the system distinguish between US and Canada dollar? > > Both would be <$> + U+FE0F VS16 > > Chris > > > Leo Broukhis schreef op 2016-03-01 19:10: > > I have a less disruptive proposal than to encode an unprecedented > combining emoji. > How about adding variation sequences + U+FE0F VS16 to > signify BANKNOTE with ? > > Leo > > On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > >> For the pound emoji, throw in ~90M Egyptians. >> >> --J?rg Knappen >> >> *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr >> *Von:* "Leo Broukhis" >> *An:* "Mark Davis ??" >> *Cc:* "unicode Unicode Discussion" >> *Betreff:* Re: Enclosing BANKNOTE emoji? >> The emojiexpress.com site is useful to check which new emoji or >> combinations people actually use, but the stats are likely skewed by only >> measuring input from one platform. >> >> Another way to look at the emojitracker.com stats: >> >> 339M people in the Eurozone : 389K uses of Euro emoji >> 126M people in Japan : 354K uses of Yen emoji >> 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in >> for Lira) : 515K uses of pound emoji >> >> The total is 605M people : 1258K uses of non-dollar emoji >> Assuming the same average frequency of use, 2933K uses of the dollar >> emoji would be produced by 1411M people, out of which us + canada + mexico >> + australia (500M) + other countries using $ as (part of) the sign for >> their currency are way less than a half. This means that substantially more >> than 500M people are using the dollar emoji by default, instead of emoji of >> their national currencies. Assuming a lesser frequency of use will result >> in a greater estimate of the affected population. >> >> Leo >> >> >> On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? >> wrote: >>> >>> Look at http://www.emojixpress.com/stats/. The stats are different, >>> since they collect data from keyboards not twitter posts, but they have a >>> nice button to view only the news emoji. >>> >>> (The numbers on the new ones will be smaller, just because it takes time >>> for systems to support them, and people to start using them. However, they >>> bear out my predication that the most popular would be the eyes-rolling >>> face). >>> >>> >>> Mark >>> >>> >>> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>>> >>>> A caveat about using emojitracker.com : it doesn't count newer emoji >>>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>>> added, their counts will be skewed. >>>> >>>> Leo >>>> >>>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: >>>>> >>>>> Thank you for the links, quite mesmerizing! >>>>> >>>>> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >>>>> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >>>>> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >>>>> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >>>>> 10x more than the lowest counts, and about the same frequency as various >>>>> individual clock faces). >>>>> >>>>> It is quite evident that the dollar banknote emoji serves as a >>>>> stand-in for at least half a dozen of various currencies. >>>>> >>>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>>> wrote: >>>>> >>>>>> I would suggest that you first gather statistics and present >>>>>> statistics on how often the current combinations are used compared to other >>>>>> emoji, eg by consulting sources such as: >>>>>> >>>>>> http://www.emojixpress.com/stats/ >>>>>> or >>>>>> http://emojitracker.com/ >>>>>> >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>>> wrote: >>>>>>> >>>>>>> There are >>>>>>> >>>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>>> >>>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>>> currency sign, would produce the corresponding banknote, to forestall >>>>>>> requests for individual emoji for banknotes with remaining currency >>>>>>> signs. >>>>>>> >>>>>>> Leo >>>>>>> >>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From doug at ewellic.org Tue Mar 1 12:49:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Mar 2016 11:49:29 -0700 Subject: Girl, 12, charged for threatening her school with emojis Message-ID: <20160301114929.665a7a7059d7ee80bb4d670165c8327d.336797b96f.wbe@email03.secureserver.net> Asmus Freytag wrote: >> . Well emojis were initially designed to track amotions and form a >> sort of new language, > > E-moji means "picture-character" in Japanese, has nothing to do (at > first) with emotions. I wonder if it would help some folks to remember that "mojibake," a term many of us are familiar with, contains the same root "moji" ("character"). Exercise: consider "emojibake." ???? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From leoboiko at namakajiri.net Tue Mar 1 13:44:48 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Tue, 1 Mar 2016 16:44:48 -0300 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com>

Message-ID: Ah but that is a "majority" by a dictionary/type count. Due to Zipf's Law, in language matters we should always distinguish dictionary counts from actual usage. E.g. Twitter is very popular in Japan, and I think we'll all agree that the top used kanji are predominantly modal: http://emojitracker.com/ Thomas Dimson's great distributional analysis for Instagram gives us hashtags that are equivalent to emoji; again, I think it's clear that their primary use is for modality. http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji . What's more, a lot of emoji which seem to have no "clear emotional referent" is appropriated for modal purposes. For example, this thread's ?? ?? ?? are graphical depictions of objects, but I think you'll all agree that the girl was expressing a mood; she wasn't saying "gun, knife, bomb". I'm told that U+1F481, INFORMATION DESK PERSON ??, was taken to be "sassy girl" or "hair flick", and from that it became a modality indicator for sassiness, sarcasm, fabulousness etc. (I suspect that another major use of emoji, besides modality, is deictic: "I'm at Tokyo Tower" + Tokyo Tower emoji, "Merry Christmas" + Christmas-related emoji. Emotional mood still seems to be to be clearly the dominant use.) 2016-02-29 21:25 GMT-03:00 Garth Wallace : > Some are used to express emotions but many are not: food items, > animals, landmarks, activities, etc. I think the majority do not have > clear emotional referents. The original set introduced in Unicode 6.0 > included things like ROASTED SWEET POTATO and TOKYO TOWER. > > On Mon, Feb 29, 2016 at 4:04 PM, Philippe Verdy > wrote: > > Today's Japanese emojis are (for most of them) recent inventions; may be > > there are some earlier tracks in Japanese comics, but you may as well > find > > them in comics of America or Europe since the about the 1940's. > > > > All these icons were *later* renamed emojis in English and Unicode, but > > there's a long history of using icons for such emotions Look at the > little > > heart drawn near the signature on an handwritten letter or discrete > > messages, or similar symbols carved by lovers on walls and trees. Or long > > before as a sign of recognition such as the fish for the first > Christians in > > the Roman Empire, or even before in some hieroglyphic inscriptions in > antic > > Egyptian, Mayan, and Chinese civilizations since Bronze Age or before. > > > > In fact you could also add all the symbols (not necessarily with > religious > > meaning) found on graves for expressing that the remaining family of > friend > > is missing the defunct. > > You could also add the similar symbols on jewelry for showing we love > > someone, or warrior paintings on faces. > > > > The modern Japanese Emojis were not the first pictograpic signs to > express > > emotions (even if now they have been extended to many other things and > they > > are now widespreading the rest of the world with these extensions). Still > > their main usage remains for emotions ; starting in the 1970's these were > > ASCII art symbols such as the famous :-) > > > > > > > > 2016-02-29 23:24 GMT+01:00 Asmus Freytag (t) : > >> > >> On 2/29/2016 1:55 PM, Philippe Verdy wrote: > >> > >> . Well emojis were initially designed to track amotions and form a sort > of > >> new language, > >> > >> > >> E-moji means "picture-character" in Japanese, has nothing to do (at > first) > >> with emotions. > >> > >> A./ > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 2 09:49:17 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Mar 2016 08:49:17 -0700 Subject: Enclosing BANKNOTE =?UTF-8?Q?emoji=3F?= Message-ID: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> On February 8, Leo Broukhis wrote: > This is clearly an incomplete set. It makes sense to have a generic > "enclosing banknote" emoji character which, when combined with a > currency sign, would produce the corresponding banknote, to forestall > requests for individual emoji for banknotes with remaining currency > signs. I'm not wildly opposed to these -- maybe more so to the more recent idea of variation selectors to transform currency symbols into emoji -- but I wonder if there is really a demand for such images, especially at the small size normally associated with emoji, or if this is simply speculation. At least in principle, "expected usage level" is supposed to be one factor that speaks for or against encoding. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From leob at mailcom.com Wed Mar 2 10:34:40 2016 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Mar 2016 08:34:40 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> References: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> Message-ID: Per se, the level of use is quite respectable. On emojitracker (not yet updated with newer emoji), :dollar: is at #330/845, and the lowest of the group, :yen:, is #688. My calculations based on the usage count and population of the countries using corresponding signs shows that :dollar: is way out of proportion, which means that it is used by default quite a lot. Speaking of "enclosing banknote" vs variation selector, the shorthands (:dollar:, :yen:, etc.) suggest that Twitter treats the banknote emoji as emoji-style of the currency signs, and a new character would be superfluous. Leo On Wed, Mar 2, 2016 at 7:49 AM, Doug Ewell wrote: > On February 8, Leo Broukhis wrote: > > > This is clearly an incomplete set. It makes sense to have a generic > > "enclosing banknote" emoji character which, when combined with a > > currency sign, would produce the corresponding banknote, to forestall > > requests for individual emoji for banknotes with remaining currency > > signs. > > I'm not wildly opposed to these -- maybe more so to the more recent idea > of variation selectors to transform currency symbols into emoji -- but I > wonder if there is really a demand for such images, especially at the > small size normally associated with emoji, or if this is simply > speculation. At least in principle, "expected usage level" is supposed > to be one factor that speaks for or against encoding. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Mar 2 17:35:30 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Mar 2016 00:35:30 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References:

<95ca449690e088a9c5f276d8e16e196e@xs4all.nl> Message-ID: Both are $ in plain text yes, but they are in textual context. Emojis are to be used alone and interpreted mostly by themselves. They are also highly pictographic and represent the actual object in a realistic way. So a neutral "$" sign in a banknote emoji would not distinguish the (green) US dollar from the Canadian dollar. In fact that backnote emoji for the US dollar would typically not use the "$" currency sign itself (not alone), but would be actual green banknote (it will probably be encoded by itself, just like the one for the yen), or as a fallback, a small version of the country flag, or the letters "US" inside (just like country flag icons). You can still have a banknote emoji based on the currency sign but it will only represent that currency sign and not the actual currency unit (except those currency units whose symbols are strongly tied to the the currency such the the euro sign, or the symbols for the new shekkel, or the new ruppiah, not used for other currencies...). Using variation selectors would not be a solution. In my opinion it would be best to combine a generic/blank banknote emoji with the other symbol representing a currency sign or country flag, tied together using the same technics as those used for emojis representing people or group or people, i.e. a format control hinting the presence of a ligature. 2016-03-01 19:35 GMT+01:00 Leo Broukhis : > It doesn't have to. > > How does the system distinguish between US and Canada dollar in plain > text? Both are <$>. > > Leo > > > On Tue, Mar 1, 2016 at 10:31 AM, Chris Jacobs > wrote: > >> How would the system distinguish between US and Canada dollar? >> >> Both would be <$> + U+FE0F VS16 >> >> Chris >> >> >> Leo Broukhis schreef op 2016-03-01 19:10: >> >> I have a less disruptive proposal than to encode an unprecedented >> combining emoji. >> How about adding variation sequences + U+FE0F VS16 to >> signify BANKNOTE with ? >> >> Leo >> >> On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: >> >>> For the pound emoji, throw in ~90M Egyptians. >>> >>> --J?rg Knappen >>> >>> *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr >>> *Von:* "Leo Broukhis" >>> *An:* "Mark Davis ??" >>> *Cc:* "unicode Unicode Discussion" >>> *Betreff:* Re: Enclosing BANKNOTE emoji? >>> The emojiexpress.com site is useful to check which new emoji or >>> combinations people actually use, but the stats are likely skewed by only >>> measuring input from one platform. >>> >>> Another way to look at the emojitracker.com stats: >>> >>> 339M people in the Eurozone : 389K uses of Euro emoji >>> 126M people in Japan : 354K uses of Yen emoji >>> 140M people in UK + Turkey (likely users of the Pound emoji as a >>> stand-in for Lira) : 515K uses of pound emoji >>> >>> The total is 605M people : 1258K uses of non-dollar emoji >>> Assuming the same average frequency of use, 2933K uses of the dollar >>> emoji would be produced by 1411M people, out of which us + canada + mexico >>> + australia (500M) + other countries using $ as (part of) the sign for >>> their currency are way less than a half. This means that substantially more >>> than 500M people are using the dollar emoji by default, instead of emoji of >>> their national currencies. Assuming a lesser frequency of use will result >>> in a greater estimate of the affected population. >>> >>> Leo >>> >>> >>> On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? >>> wrote: >>>> >>>> Look at http://www.emojixpress.com/stats/. The stats are different, >>>> since they collect data from keyboards not twitter posts, but they have a >>>> nice button to view only the news emoji. >>>> >>>> (The numbers on the new ones will be smaller, just because it takes >>>> time for systems to support them, and people to start using them. However, >>>> they bear out my predication that the most popular would be the >>>> eyes-rolling face). >>>> >>>> >>>> Mark >>>> >>>> >>>> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>>>> >>>>> A caveat about using emojitracker.com : it doesn't count newer emoji >>>>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>>>> added, their counts will be skewed. >>>>> >>>>> Leo >>>>> >>>>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis >>>>> wrote: >>>>>> >>>>>> Thank you for the links, quite mesmerizing! >>>>>> >>>>>> On emojitracker.com (cumulative counts, but only on twitter, >>>>>> AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above >>>>>> the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around >>>>>> 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around >>>>>> 20%ile, but 10x more than the lowest counts, and about the same frequency >>>>>> as various individual clock faces). >>>>>> >>>>>> It is quite evident that the dollar banknote emoji serves as a >>>>>> stand-in for at least half a dozen of various currencies. >>>>>> >>>>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>>>> wrote: >>>>>> >>>>>>> I would suggest that you first gather statistics and present >>>>>>> statistics on how often the current combinations are used compared to other >>>>>>> emoji, eg by consulting sources such as: >>>>>>> >>>>>>> http://www.emojixpress.com/stats/ >>>>>>> or >>>>>>> http://emojitracker.com/ >>>>>>> >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>>>> wrote: >>>>>>>> >>>>>>>> There are >>>>>>>> >>>>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>>>> >>>>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>>>> currency sign, would produce the corresponding banknote, to >>>>>>>> forestall >>>>>>>> requests for individual emoji for banknotes with remaining currency >>>>>>>> signs. >>>>>>>> >>>>>>>> Leo >>>>>>>> >>>>>>> >>>>>>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From mandel59 at gmail.com Thu Mar 3 13:42:02 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 04:42:02 +0900 Subject: Failure on Japanese dolls emoji Message-ID: <56D8938A.2090000@gmail.com> Hello, Unicode 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: ?? Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. Thanks, Ryusei -------------- next part -------------- An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Thu Mar 3 16:59:33 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 3 Mar 2016 23:59:33 +0100 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8938A.2090000@gmail.com> References: <56D8938A.2090000@gmail.com> Message-ID: On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: > > Hello, Unicode > > 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 > > Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so you are expecting a particular visual, which is not promised anywhere. Is a bit like if I complained that some "MOUNTAIN" emojis are wrong because they don't look like Monte Bianco. Cheers From mandel59 at gmail.com Thu Mar 3 18:57:29 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 09:57:29 +0900 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DA94.4040509@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> Message-ID: <56D8DD79.9070102@gmail.com> On 2016/03/04 7:59, Pierpaolo Bernardi wrote: > On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: >> Hello, Unicode >> >> 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji:http://mandel59.hateblo.jp/entry/2016/03/04/041437 >> >> Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. > But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so > you are expecting a particular visual, which is not promised anywhere. > > Is a bit like if I complained that some "MOUNTAIN" emojis are wrong > because they don't look like Monte Bianco. > > Cheers JAPANESE DOLLS in Unicode is collected from the character sets of KDDI and SoftBank, Japanese telecom companies, and the emoji is named as ?? ? or ???? (both are hina-matsuri) in these specs. Here is a capture of Chart with FPDAM8 data and glyphs via https://sites.google.com/site/unicodesymbols/Home/emoji-symbols And the NamesList.txt of Unicode Character Database gives the description: Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities to let the emoji look like hina-matsuri? Ryusei -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 48630 bytes Desc: not available URL: From alolita.sharma at gmail.com Thu Mar 3 19:14:18 2016 From: alolita.sharma at gmail.com (Alolita Sharma) Date: Thu, 3 Mar 2016 17:14:18 -0800 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DD79.9070102@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: Hi Ryusei, I provided your useful feedback to the Emoji design team at Twitter and they will update the twemoji for Japanese dolls. Thanks for providing excellent examples in your post. Best, Alolita On Thu, Mar 3, 2016 at 4:57 PM, Ryusei Yamaguchi wrote: > On 2016/03/04 7:59, Pierpaolo Bernardi wrote: > > On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: > > Hello, Unicode > > 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 > > Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. > > But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so > you are expecting a particular visual, which is not promised anywhere. > > Is a bit like if I complained that some "MOUNTAIN" emojis are wrong > because they don't look like Monte Bianco. > > Cheers > > > JAPANESE DOLLS in Unicode is collected from the character sets of KDDI and > SoftBank, Japanese telecom companies, and the emoji is named as ??? or ???? > (both are hina-matsuri) in these specs. Here is a capture of Chart with > FPDAM8 data and glyphs > > via > https://sites.google.com/site/unicodesymbols/Home/emoji-symbols > > > And the NamesList.txt of Unicode Character Database gives the description: > Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities > to let the emoji look like hina-matsuri? > > Ryusei > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 48630 bytes Desc: not available URL: From olopierpa at gmail.com Thu Mar 3 19:20:08 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Fri, 4 Mar 2016 02:20:08 +0100 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DD79.9070102@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: On Fri, Mar 4, 2016 at 1:57 AM, Ryusei Yamaguchi wrote: > And the NamesList.txt of Unicode Character Database gives the description: Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities to let the emoji look like hina-matsuri? OK. Then you are right in your complaint! Cheers From mandel59 at gmail.com Thu Mar 3 21:12:15 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 12:12:15 +0900 Subject: Failure on Japanese dolls emoji In-Reply-To: References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: <56D8FD0F.7080503@gmail.com> Thank you, Alolita :) Ryusei On 2016/03/04 10:14, Alolita Sharma wrote: > Hi Ryusei, > > I provided your useful feedback to the Emoji design team at Twitter and > they will update the twemoji for Japanese dolls. > Thanks for providing excellent examples in your post. > > Best, > Alolita > > > > On Thu, Mar 3, 2016 at 4:57 PM, Ryusei Yamaguchi > wrote: > > On 2016/03/04 7:59, Pierpaolo Bernardi wrote: >> On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: >>> Hello, Unicode >>> >>> 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji:http://mandel59.hateblo.jp/entry/2016/03/04/041437 >>> >>> Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. >> But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so >> you are expecting a particular visual, which is not promised anywhere. >> >> Is a bit like if I complained that some "MOUNTAIN" emojis are wrong >> because they don't look like Monte Bianco. >> >> Cheers > > JAPANESE DOLLS in Unicode is collected from the character sets of > KDDI and SoftBank, Japanese telecom companies, and the emoji is > named as ??? or ???? (both are hina-matsuri) in these specs. > Here is a capture of Chart with FPDAM8 data and glyphs > > via > https://sites.google.com/site/unicodesymbols/Home/emoji-symbols > > > And the NamesList.txt of Unicode Character Database gives the > description: Japanese Hinamatsuri or girls' doll festival. Aren't > they the authorities to let the emoji look like hina-matsuri? > > Ryusei > > From doug at ewellic.org Fri Mar 4 10:51:38 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Mar 2016 09:51:38 -0700 Subject: Failure on Japanese dolls emoji Message-ID: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> Pierpaolo Bernardi wrote: >> And the NamesList.txt of Unicode Character Database gives the >> description: Japanese Hinamatsuri or girls' doll festival. Aren't >> they the authorities to let the emoji look like hina-matsuri? > > OK. Then you are right in your complaint! FWIW, I agree that annotations in NamesList.txt are a better justification for prescribing the glyph design of a Unicode character, even an emoji, than tribal knowledge about the history or origin of the character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? ?? From asmus-inc at ix.netcom.com Fri Mar 4 11:19:33 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 4 Mar 2016 09:19:33 -0800 Subject: Failure on Japanese dolls emoji In-Reply-To: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> References: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> Message-ID: <56D9C3A5.7080606@ix.netcom.com> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Mar 6 22:56:02 2016 From: prosfilaes at gmail.com (David Starner) Date: Mon, 07 Mar 2016 04:56:02 +0000 Subject: Mammal emoji Message-ID: Seeing the presence of foxes on the upcoming emoji list, I remembered the Audubon Mammals (North America) app has silhouettes of mammals on the browse by shape tab. So let's see if they're covered: Armored Mammals (-): Okay, we're off to a bad start. The image here is sort of porcupine-ish, and there's two distinct creatures under the label, the porcupine(-) and armadillo(-). Neither of which are in Unicode. Bats (N): In the new list. Which is good; they're sort of iconic. Bears(+) Cats(+): Several varieties Chipmunks, Squirrels and Prairie Dogs(+): Breaking down more than icons the app uses, there is a Chipmunk(+) emoji, no Squirrel(-) emoji; that might be an oversight. Prairie dogs(-) probably aren't. Hoofed Mammals(+): Breaking it down more Bison(-), Sheep(+), Reindeer (-) (and that's sort of surprising), Peccary(-), Deer(N), Moose(-) (aka Elk in Europe), Ox(+) (actually Muskox ... and I'm pretty sure that's a distinction Unicode doesn't want to worry about), Pronghorn (-) (nor antelope(-), or the actually related giraffe(-) and okapi(-). Probably covered by the unrelated deer.) Boar(+), Horse(+) Large Rodents(-): Beaver(-), Muskrat(-), Marmot(-), Nutria(-) Marine Mammals(+): Dolphin(+), Whale(+), Seal (-), Sea Lion(-), Walrus(-), Manatee(-) Mice and Rats(+): Mouse(+), Rat(+) Opossum(-): Otters(-): Rabbits and Hares(+): Raccoons and Their Kin(-): Shrews and Moles(-): Voles, Lemmings, Pikas, and Pocket Gophers(-): Weasels, Skunks and Their Kin(-): While a disparate group, badgers(-), skunks(-), ferrets(-), weasels(-) and wolverines(-) all have arguments for encoding. Wolves, Foxes, and Coyote(+): Fox(+), Dog(+), Wolf(+), Coyote(-) So nine icons out of the 17 have a reasonable encoding in Unicode. To cover the set would need an armadillo or porcupine, a beaver, a possum, an otter, a raccoon, a shrew, a lemming, and a weasel or skunk. Beavers (O Canada!), raccoons, ferrets/weasel (popular pet) and skunk (emoji uses abound) probably have the best encoding arguments there. (This is not an actual proposal, but feel free to forward it on if anyone might want to make one. Just a discussion of a set of icons in the reflection of emoji.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Mar 7 14:58:45 2016 From: petercon at microsoft.com (Peter Constable) Date: Mon, 7 Mar 2016 20:58:45 +0000 Subject: Mammal emoji In-Reply-To: References: Message-ID: I know you?re not proposing anything and just providing info for discussion. I want to make sure it?s clear to others that there is no requirement for encoded emoji in Unicode to provide comprehensive coverage (by any measure) of any semantic or conceptual domain. So, if there isn?t any raccoon emoji in Unicode, that doesn?t imply that there must or ever will be a raccoon emoji. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner Sent: Sunday, March 6, 2016 8:56 PM To: Unicode Mailing List Subject: Mammal emoji Seeing the presence of foxes on the upcoming emoji list, I remembered the Audubon Mammals (North America) app has silhouettes of mammals on the browse by shape tab. So let's see if they're covered: Armored Mammals (-): Okay, we're off to a bad start. The image here is sort of porcupine-ish, and there's two distinct creatures under the label, the porcupine(-) and armadillo(-). Neither of which are in Unicode. Bats (N): In the new list. Which is good; they're sort of iconic. Bears(+) Cats(+): Several varieties Chipmunks, Squirrels and Prairie Dogs(+): Breaking down more than icons the app uses, there is a Chipmunk(+) emoji, no Squirrel(-) emoji; that might be an oversight. Prairie dogs(-) probably aren't. Hoofed Mammals(+): Breaking it down more Bison(-), Sheep(+), Reindeer (-) (and that's sort of surprising), Peccary(-), Deer(N), Moose(-) (aka Elk in Europe), Ox(+) (actually Muskox ... and I'm pretty sure that's a distinction Unicode doesn't want to worry about), Pronghorn (-) (nor antelope(-), or the actually related giraffe(-) and okapi(-). Probably covered by the unrelated deer.) Boar(+), Horse(+) Large Rodents(-): Beaver(-), Muskrat(-), Marmot(-), Nutria(-) Marine Mammals(+): Dolphin(+), Whale(+), Seal (-), Sea Lion(-), Walrus(-), Manatee(-) Mice and Rats(+): Mouse(+), Rat(+) Opossum(-): Otters(-): Rabbits and Hares(+): Raccoons and Their Kin(-): Shrews and Moles(-): Voles, Lemmings, Pikas, and Pocket Gophers(-): Weasels, Skunks and Their Kin(-): While a disparate group, badgers(-), skunks(-), ferrets(-), weasels(-) and wolverines(-) all have arguments for encoding. Wolves, Foxes, and Coyote(+): Fox(+), Dog(+), Wolf(+), Coyote(-) So nine icons out of the 17 have a reasonable encoding in Unicode. To cover the set would need an armadillo or porcupine, a beaver, a possum, an otter, a raccoon, a shrew, a lemming, and a weasel or skunk. Beavers (O Canada!), raccoons, ferrets/weasel (popular pet) and skunk (emoji uses abound) probably have the best encoding arguments there. (This is not an actual proposal, but feel free to forward it on if anyone might want to make one. Just a discussion of a set of icons in the reflection of emoji.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Mar 7 15:11:31 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 7 Mar 2016 13:11:31 -0800 Subject: Mammal emoji In-Reply-To: References: Message-ID: <56DDEE83.5080504@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 7 19:02:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 8 Mar 2016 02:02:28 +0100 Subject: Mammal emoji In-Reply-To: <56DDEE83.5080504@ix.netcom.com> References: <56DDEE83.5080504@ix.netcom.com> Message-ID: 2016-03-07 22:11 GMT+01:00 Asmus Freytag (t) : > Sometimes looking at semantic domains points out candidates to consider. The ultimate reason for requesting another mammal emoji would rest on the need to included in communications Is there an emoji for the concept of "overbooked/too much work"? which is the last state (and cause) before either: - at best, (it is not unexpectable if people care about each other and themselves!) the sudden abandon/dismiss to do something else, or - at worse, (if it was not personally ancitipated, and other people didn't care) personal breakdown (with deep, costly and durable consequences). This case of breakdown caused by earlier overbooking at work has now a popular term "burnout" (another candidate emoji, but more difficult to represent graphically as you could represent the state of someone depressed, but not its cause). It is becoming popular today with the (ongoing) regulation of conditions of work and prevention of risks by organisations (basically by better distribution of responsabilities, better scheduling of tasks, preservation of personal lifetime of workers, choice to delay some works, and accepting that everything cannot be done with existing resources, even if it could "pay" in the short term). I think that many Unicoders on this list may be at this early step, they have troubles to follow everything in the pipe of incoming requests or proposals, and the UTC is probably under-resourced. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ori at avtalion.name Wed Mar 9 15:17:17 2016 From: ori at avtalion.name (Ori Avtalion) Date: Wed, 9 Mar 2016 23:17:17 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 Message-ID: Unicode includes the following symbols as "Go Markers": * U+2686 ? WHITE CIRCLE WITH DOT RIGHT * U+2687 ? WHITE CIRCLE WITH TWO DOTS * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS It is unclear what they are for. I hope someone could explain. 1) I could not find any Go notation that uses dots inside the stones. 2) Why are there no symbols for white/black stones without dots? 3) An earlier proposal [1] suggested additional symbols: * GRAY CIRCLE WITH GRAY DOT RIGHT * GRAY CIRCLE WITH GRAY TWO DOTS * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT * GRAY FILLED CIRCLE WITH WHITE TWO DOTS what was their purpose? Any why are Go Markers proposed as "Mathematical symbols"? Are they meant for mathematical research of the game of Go and not for actual notation? [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf Thanks in advance! From kenwhistler at att.net Wed Mar 9 16:52:34 2016 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 9 Mar 2016 14:52:34 -0800 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: Message-ID: <56E0A932.8010909@att.net> I don't know the answer to this. But I suspect that that the source was from one of the collection of fonts associated with the STIX project research that led to the collection of mathematical symbols additions noted in L2/01-067 (superseded by L2/01-142), as well as the earlier mathematical symbols proposals with the bulk of the symbols that were added to Unicode 3.2. Given that context, it is, indeed, most likely that the symbols were associated with some publication(s) in game theory, rather than with professional Go notations per se. See, for example, Mathematical Go: Chilling Gets the Last Point: http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 I don't see black/white circles with dots in the bit of that publication scanned on Amazon, but it does use a black circle with a delta symbol as part of the game notation for discussion, as well as black and white circles with numbers, denoting sequences of stone placements. But to know for sure, you would probably have to get confirmation of original sources from Barbara Beeton and/or Patrick Ion, who collected together symbol candidates from a multitude of print sources back in the 1998 - 2001 time frame. --Ken On 3/9/2016 1:17 PM, Ori Avtalion wrote: > Unicode includes the following symbols as "Go Markers": > * U+2686 ? WHITE CIRCLE WITH DOT RIGHT > * U+2687 ? WHITE CIRCLE WITH TWO DOTS > * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT > * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS > > It is unclear what they are for. I hope someone could explain. > > 1) I could not find any Go notation that uses dots inside the stones. > 2) Why are there no symbols for white/black stones without dots? > 3) An earlier proposal [1] suggested additional symbols: > * GRAY CIRCLE WITH GRAY DOT RIGHT > * GRAY CIRCLE WITH GRAY TWO DOTS > * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT > * GRAY FILLED CIRCLE WITH WHITE TWO DOTS > what was their purpose? Any why are Go Markers proposed as > "Mathematical symbols"? Are they meant for mathematical research of > the game of Go and not for actual notation? > > [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf > > Thanks in advance! > > From jtauber at jtauber.com Wed Mar 9 17:17:55 2016 From: jtauber at jtauber.com (James Tauber) Date: Wed, 9 Mar 2016 18:17:55 -0500 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: Black (and white) circle with "delta"/triangle is common in general Go books as is black and white circle with numbers (up into the hundreds). I've also seen circles and squares inside the black and while circle. A quick look at the 10 or so printed Go books I have don't have any examples of those 4 Go Markers U+2686 to U+2689. James On Wed, Mar 9, 2016 at 5:52 PM, Ken Whistler wrote: > I don't know the answer to this. But I suspect that that the source > was from one of the collection of fonts associated with the STIX > project research that led to the collection of mathematical symbols > additions noted in L2/01-067 (superseded by L2/01-142), as well > as the earlier mathematical symbols proposals with the bulk of > the symbols that were added to Unicode 3.2. > > Given that context, it is, indeed, most likely that the symbols were > associated with some publication(s) in game theory, rather than > with professional Go notations per se. See, for example, > Mathematical Go: Chilling Gets the Last Point: > > http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 > > I don't see black/white circles with dots in the bit of that publication > scanned on Amazon, but it does use a black circle with a delta > symbol as part of the game notation for discussion, as well as > black and white circles with numbers, denoting sequences of stone > placements. > > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. > > --Ken > > > On 3/9/2016 1:17 PM, Ori Avtalion wrote: > >> Unicode includes the following symbols as "Go Markers": >> * U+2686 ? WHITE CIRCLE WITH DOT RIGHT >> * U+2687 ? WHITE CIRCLE WITH TWO DOTS >> * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT >> * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS >> >> It is unclear what they are for. I hope someone could explain. >> >> 1) I could not find any Go notation that uses dots inside the stones. >> 2) Why are there no symbols for white/black stones without dots? >> 3) An earlier proposal [1] suggested additional symbols: >> * GRAY CIRCLE WITH GRAY DOT RIGHT >> * GRAY CIRCLE WITH GRAY TWO DOTS >> * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT >> * GRAY FILLED CIRCLE WITH WHITE TWO DOTS >> what was their purpose? Any why are Go Markers proposed as >> "Mathematical symbols"? Are they meant for mathematical research of >> the game of Go and not for actual notation? >> >> [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf >> >> Thanks in advance! >> >> >> > -- James Tauber http://jtauber.com/ @jtauber on Twitter -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu Mar 10 01:00:57 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 10 Mar 2016 16:00:57 +0900 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: <56E11BA9.4030703@it.aoyama.ac.jp> On 2016/03/10 07:52, Ken Whistler wrote: > I don't know the answer to this. But I suspect that that the source > was from one of the collection of fonts associated with the STIX > project research that led to the collection of mathematical symbols > additions noted in L2/01-067 (superseded by L2/01-142), as well > as the earlier mathematical symbols proposals with the bulk of > the symbols that were added to Unicode 3.2. > > Given that context, it is, indeed, most likely that the symbols were > associated with some publication(s) in game theory, rather than > with professional Go notations per se. See, for example, > Mathematical Go: Chilling Gets the Last Point: > > http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 I own and have read the actual book. For examples of the characters mentioned, please see e.g. pp. 17, 21,.... I think the grey stones in the earlier proposal were left out because in the book, there are board diagrams with e.g. 1/4 of a stone gray,... So yes, these symbols are used for for mathematical research of the game of Go, and not as far as I know for actual notation. The research is in combinatorial game theory, where very weird infinitesimal numbers (e.g. greater than 0 but smaller than any positive number!) are often used. These numbers are part of the 'Surreal Numbers' introduced in Donald Knuth's 1974 book of the same name. And while I have only seen the symbols in mathematical work, that theory can be highly relevant in actual endgames, and at least professional players should be aware of it (the theory, not the symbols), because often games can be decided by the last point won or lost in the endgame. > I don't see black/white circles with dots in the bit of that publication > scanned on Amazon, but it does use a black circle with a delta > symbol as part of the game notation for discussion, as well as > black and white circles with numbers, denoting sequences of stone > placements. As James said, the circles with numbers are extremely widely used; it's the basic way to show games (because stones are not moved around and only very rarely removed from the board, the main notation for Go is not a list of moves with coordinates (as e.g. in Chess), but just a diagram of the final (or intermediate) board position with every move labeled with a number. But because these numbers can go up to the 200s, it doesn't make sense to register them all as characters (one would need over 500!). Regards, Martin. > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. > > --Ken > > On 3/9/2016 1:17 PM, Ori Avtalion wrote: >> Unicode includes the following symbols as "Go Markers": >> * U+2686 ? WHITE CIRCLE WITH DOT RIGHT >> * U+2687 ? WHITE CIRCLE WITH TWO DOTS >> * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT >> * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS >> >> It is unclear what they are for. I hope someone could explain. >> >> 1) I could not find any Go notation that uses dots inside the stones. >> 2) Why are there no symbols for white/black stones without dots? >> 3) An earlier proposal [1] suggested additional symbols: >> * GRAY CIRCLE WITH GRAY DOT RIGHT >> * GRAY CIRCLE WITH GRAY TWO DOTS >> * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT >> * GRAY FILLED CIRCLE WITH WHITE TWO DOTS >> what was their purpose? Any why are Go Markers proposed as >> "Mathematical symbols"? Are they meant for mathematical research of >> the game of Go and not for actual notation? >> >> [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf >> >> Thanks in advance! >> >> > > . > From andrewcwest at gmail.com Thu Mar 10 03:17:14 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 09:17:14 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E11BA9.4030703@it.aoyama.ac.jp> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > So yes, these symbols are used for for mathematical research of the game of > Go, and not as far as I know for actual notation. Which indicates how absurd the proposal to emojify these four characters is. http://www.unicode.org/L2/L2016/16021-game-pieces-emoji.pdf Andrew From andrewcwest at gmail.com Thu Mar 10 05:26:05 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 11:26:05 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E11BA9.4030703@it.aoyama.ac.jp> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > because these numbers can go up to the 200s, it doesn't make sense to > register them all as characters (one would need over 500!). I don't get why that would make no sense. We already have CIRCLED NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and these characters are widely used (in East Asian contexts, at least) for representing note numbers in text. In my opinion it would be eminently sensible to extend both series up to 999, which would cover the needs of Go notation and as well as note numbering for the vast majority of users. Andrew From leoboiko at gmail.com Thu Mar 10 05:34:30 2016 From: leoboiko at gmail.com (Leonardo Boiko) Date: Thu, 10 Mar 2016 08:34:30 -0300 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE? 2016/03/10 8:30 "Andrew West" : > On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > > > because these numbers can go up to the 200s, it doesn't make sense to > > register them all as characters (one would need over 500!). > > I don't get why that would make no sense. We already have CIRCLED > NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and > these characters are widely used (in East Asian contexts, at least) > for representing note numbers in text. In my opinion it would be > eminently sensible to extend both series up to 999, which would cover > the needs of Go notation and as well as note numbering for the vast > majority of users. > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Mar 10 06:00:38 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 12:00:38 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 11:34, Leonardo Boiko wrote: > Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE? Of course that approach is possible, but it is quite problematic, both from the perspective of the font developer and the end user, because the circle would have to be able to combine with an indefinite number of preceding characters, and it is not easy to either determine where the boundary is (in the font) or specify the boundary (by the end user). For example, given a text string of "1234" what does the combining circle combine with? Unitary characters would be just way simpler and more reliable. Andrew From everson at evertype.com Thu Mar 10 06:17:24 2016 From: everson at evertype.com (Michael Everson) Date: Thu, 10 Mar 2016 12:17:24 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: <2A330A52-29E8-4A40-837C-C8979171C670@evertype.com> On 10 Mar 2016, at 11:26, Andrew West wrote: > > On 10 March 2016 at 07:00, Martin J. D?rst wrote: >> >> because these numbers can go up to the 200s, it doesn't make sense to register them all as characters (one would need over 500!). > > I don't get why that would make no sense. We already have CIRCLED NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and these characters are widely used (in East Asian contexts, at least) > for representing note numbers in text. In my opinion it would be eminently sensible to extend both series up to 999, which would cover the needs of Go notation and as well as note numbering for the vast majority of users. Good ideas don?t always get past the UTC. Remember when we wanted to encode 256 two-letter codes for the country flags? That was replaced by the ?emoji flag alphabet?. Now some people want to use combinations for currency emojis, and evidently that (with some combining emoji banknote character) would have been easier with 256 atomic codes than it would be with the flag alphabet. Michael Everson * http://www.evertype.com/ From oren.watson at gmail.com Wed Mar 9 21:08:14 2016 From: oren.watson at gmail.com (Oren Watson) Date: Wed, 9 Mar 2016 22:08:14 -0500 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: I was surprised to find out that there are gaps in the Mathematical alphanumeric symbols block (U+1d400 to u+1d7ff). The gaps are associated with the inclusion of similar symbols in other blocks, chiefly the Letterlike Symbols Block. Examples of such gaps include U+1d49d, U+1d506, etc. But as a matter of convenience and simplicity, these missing codepoints could have been defined, as decomposing directly to the equivalents in Letterlike symbols, in the same manner that the ?ngstr?m sign decomposes to the letter ?. That would make these ranges contiguous. Is there a policy about leaving gaps in otherwise contiguous ranges of codepoints? --Oren Watson -------------- next part -------------- An HTML attachment was scrubbed... URL: From ori at avtalion.name Thu Mar 10 10:35:16 2016 From: ori at avtalion.name (Ori Avtalion) Date: Thu, 10 Mar 2016 18:35:16 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: On Thu, Mar 10, 2016 at 12:52 AM, Ken Whistler wrote: > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. I have emailed Barbara with the question, and pointed to this thread. Will report back when I get a response. On Twitter, someone pointed out an example of the two-dot notation, and even a center-dot notation (instead of the "right dot" of U+2686 ?). See page 4 (printed page number 206) of this PDF: http://library.msri.org/books/Book29/files/kim.pdf From asmus-inc at ix.netcom.com Thu Mar 10 10:43:29 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Mar 2016 08:43:29 -0800 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: References: Message-ID: <56E1A431.2030005@ix.netcom.com> An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Thu Mar 10 13:09:05 2016 From: oren.watson at gmail.com (Oren Watson) Date: Thu, 10 Mar 2016 14:09:05 -0500 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: <56E1A431.2030005@ix.netcom.com> References: <56E1A431.2030005@ix.netcom.com> Message-ID: Thank you for the detailed explanation, Asmus. Is there a standard denoting which characters are part of each "mathematical variable alphabet"? There is a table on Wikipedia < https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols#Latin_letters> but the placement of characters into the gaps is unsourced. Perhaps I'm overthinking this, but I don't think it's necessarily obvious that the character BLACK-LETTER CAPITAL C should be used as the nonexistent character *MATHEMATICAL FRAKTUR CAPITAL C. Is there a document clarifying this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Mar 10 13:48:03 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 19:48:03 +0000 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: References: <56E1A431.2030005@ix.netcom.com> Message-ID: On 10 March 2016 at 19:09, Oren Watson wrote: > > Is there a standard denoting which characters are part of each "mathematical > variable alphabet"? There is a table on Wikipedia > > but the placement of characters into the gaps is unsourced. Perhaps I'm > overthinking this, but I don't think it's necessarily obvious that the > character BLACK-LETTER CAPITAL C should be used as the nonexistent character > *MATHEMATICAL FRAKTUR CAPITAL C. Is there a document clarifying this? Yes, the code charts in the Unicode Standard: http://www.unicode.org/charts/PDF/U1D400.pdf The annotation for each reserved code point refers to the character that logically belongs there. Andrew From doug at ewellic.org Thu Mar 10 14:49:17 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 13:49:17 -0700 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Andrew West replied to Oren Watson: >> Is there a standard denoting which characters are part of each >> "mathematical variable alphabet"? There is a table on Wikipedia [...] > > Yes, the code charts in the Unicode Standard: > > http://www.unicode.org/charts/PDF/U1D400.pdf > > The annotation for each reserved code point refers to the character > that logically belongs there. NamesList.txt also has this information, and unlike the others, it's both official and machine-readable: 1D505 MATHEMATICAL FRAKTUR CAPITAL B # 0042 latin capital letter b 1D506 x (black-letter capital c - 212D) -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From andrewcwest at gmail.com Thu Mar 10 15:00:46 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 21:00:46 +0000 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Message-ID: On 10 March 2016 at 20:49, Doug Ewell wrote: > >> >> http://www.unicode.org/charts/PDF/U1D400.pdf >> >> The annotation for each reserved code point refers to the character >> that logically belongs there. > > NamesList.txt also has this information, and unlike the others, it's > both official and machine-readable: It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is machine-readable, although the file specifically warns that "this file should not be parsed for machine-readable information". Andrew From kenwhistler at att.net Thu Mar 10 15:40:47 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 10 Mar 2016 13:40:47 -0800 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Message-ID: <56E1E9DF.9060405@att.net> On 3/10/2016 1:00 PM, Andrew West wrote: > It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is > machine-readable, although the file specifically warns that "this file > should not be parsed for machine-readable information". > NamesList.txt is just a structured text file, so of course it is "machine-readable". The problem is that because it is machine-readable, people tend to jump to the conclusion that all the information they need can simply be reliably parsed out of that file. It can't be. The reason is that NamesList.txt is itself the result of a complicated merge of code point, name, and decomposition mapping information from UnicodeData.txt, of listings of standardized variation sequences from StandardizedVariants.txt, and then a very long list of annotational material, including names list subhead material, etc., maintained in other sources. If people actually want to get reliably parsed data on code points, names, and decomposition mappings, they should get that directly from UnicodeData.txt. Likewise for information about standardized variation sequences, from StandardizedVariants.txt. The *reason* that NamesList.txt exists at all is to drive the tool, unibook, that formats the full Unicode code charts for posting. It is only posted in the Unicode Character Database at all as a matter of convenience, to give people access to a text only version of the names list that appears in the fully formatted pdf versions of the code charts that contain all the representative glyphs. NamesList.txt should *not* be data mined. Well, nobody can stop people from attempting to do so, of course, but they tend to end up confused and disappointed, because their assumptions going in don't match the editorial realities that affect the development of the annotational content added to the names list and the actual use for which NamesList.txt was created in the first place. --Ken From doug at ewellic.org Thu Mar 10 15:48:10 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 14:48:10 -0700 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: <20160310144810.665a7a7059d7ee80bb4d670165c8327d.acb8ba6be9.wbe@email03.secureserver.net> Andrew West wrote: > It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is > machine-readable, although the file specifically warns that "this file > should not be parsed for machine-readable information". Yes, I saw that mattress tag. I could not find any other files in the UCD proper that reference the unassigned code points within the MAS block, except for DerivedGeneralCategory.txt, which simply says the code points are unassigned. MathClass-*.txt and MathClassEx-*.txt have the information in question in a machine-readable format, if only in comments: 1D505;A #1D506=212D;A 1D505;A;d?".;Bfr;ISOMFRK;;MATHEMATICAL FRAKTUR CAPITAL B #1D506=212D;A;;Cfr;ISOMFRK;;FRAKTUR CAPITAL C These files are not part of the UCD, and aren't updated with every Unicode release, but might be a better reference. Perhaps UTC members can offer a recommendation here. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu Mar 10 16:14:09 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 15:14:09 -0700 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) Message-ID: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Ken Whistler wrote: > NamesList.txt should *not* be data mined. And yet it was the only Unicode data file utilized by MSKLC. There are many possible reasons for this approach, which we will probably never know. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Thu Mar 10 19:05:43 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Mar 2016 17:05:43 -0800 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> References: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Message-ID: <56E219E7.7030708@ix.netcom.com> An HTML attachment was scrubbed... URL: From js_choi at icloud.com Thu Mar 10 19:49:52 2016 From: js_choi at icloud.com (=?utf-8?Q?=22J=2E=C2=A0S=2E_Choi=22?=) Date: Thu, 10 Mar 2016 19:49:52 -0600 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: <56E1E9DF.9060405@att.net> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> Message-ID: <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> > On Mar 10, 2016, at 3:40 PM, Ken Whistler wrote: > > On 3/10/2016 1:00 PM, Andrew West wrote: >> It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is >> machine-readable, although the file specifically warns that "this file >> should not be parsed for machine-readable information". >> > > NamesList.txt is just a structured text file, so of course it is "machine-readable". > The problem is that because it is machine-readable, people tend to jump > to the conclusion that all the information they need can simply be > reliably parsed out of that file. > > It can't be. > > The reason is that NamesList.txt is itself the result of a complicated merge > of code point, name, and decomposition mapping information from > UnicodeData.txt, of listings of standardized variation sequences from > StandardizedVariants.txt, and then a very long list of annotational > material, including names list subhead material, etc., maintained in > other sources. > > If people actually want to get reliably parsed data on code points, names, > and decomposition mappings, they should get that directly from > UnicodeData.txt. Likewise for information about standardized variation > sequences, from StandardizedVariants.txt. > > The *reason* that NamesList.txt exists at all is to drive the tool, unibook, > that formats the full Unicode code charts for posting. It is only > posted in the Unicode Character Database at all as a matter of > convenience, to give people access to a text only version of the > names list that appears in the fully formatted pdf versions of the code charts > that contain all the representative glyphs. > > NamesList.txt should *not* be data mined. Well, nobody can stop > people from attempting to do so, of course, but they tend to end > up confused and disappointed, because their assumptions going in > don't match the editorial realities that affect the development of > the annotational content added to the names list and the actual > use for which NamesList.txt was created in the first place. > > --Ken > > On Mar 10, 2016, at 7:05 PM, Asmus Freytag (t) wrote: > > On 3/10/2016 2:14 PM, Doug Ewell wrote: >> Ken Whistler wrote: >> >> >>> NamesList.txt should *not* be data mined. >>> >> And yet it was the only Unicode data file utilized by MSKLC. >> >> There are many possible reasons for this approach, which we will >> probably never know. >> >> > > Extracting information from namelist.txt that was added to that file based on information from the UCD is plain folly - not least because it uses a secondary source instead of a primary source. What may not have come across from Ken's description is that the process for incorporating this data is under editorial control - and some values or entries may be suppressed for readability. There is explicitly not guarantee for completeness. > > There is some information that *only* exists in the nameslist.txt file. This includes, informal aliases for character names, cross references, etc.. The problem with extracting this information blindly (that is, not mediated by a human) is, again, that the level of consistency of presentation is that appropriate for a human reader, not for an extraction algorithm. > > For example, to reduce clutter, cross references are not symmetric or transitive, even though the relationship that gave rise to the cross reference in te first place (e.g. similarity) would normally be one that is symmetric and transitive. The human reader can be trusted to determine that, for example "<" is the "main" entry and that from there all the other, same or similar characters are referenced, but by not listing the reverse direction everywhere, the level of clutter in the rest of the nameslist is reduced, making additional cross references stand out more. > > Those are just the intentional inconsistencies. > > There is a historical development in the annotations - over time, more characters get annotated. However, annotations are not always backported, so the level of annotations can be inconsistent for reasons of incremental development. > > Now, for the x-refs on gaps, a human reader could extract and verify the set, but relying blindly on an algorithm to extract the data is fraught with peril. (Other gaps may have slightly different origin and status, yet also carry an annotation). > > Using the mathematical data files for this is a step up, because the data there is focused on a single use case. The downside is that the information is in a comment field. > > A./ One thing about NamesList.txt is that, as far as I have been able to tell, it?s the only machine-readable, parseable source of those annotations and cross-references. As part of the Unicode Standard and the UCD, the name lists? annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification?s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people?i.e., using screen-reader-friendly HTML rather than PDF?while making clear that the annotations are merely references to the original, normative Standard?s actual code charts and name lists. What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources? data have been published, then, for better or for worse, the name list is all that is available for much information on many code points? usage. Sincerely, J. S. Choi From asmusf at ix.netcom.com Thu Mar 10 20:13:21 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 10 Mar 2016 18:13:21 -0800 Subject: NamesList.txt as data source In-Reply-To: <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> Message-ID: <56E229C1.6020105@ix.netcom.com> On 3/10/2016 5:49 PM, "J. S. Choi" wrote: > One thing about NamesList.txt is that, as far as I have been able to tell, it?s the only machine-readable, parseable source of those annotations and cross-references. There are explanations about character use that are only maintained in the PDF of the core specification, where this information is packaged in a way that can be understood by a human reader, but is not amenable to be extracted by machine. While the annotations, comments, cross references etc. in Namelist.txt appear, formally, to be machine extractable, the way they are created and managed make them just as much "human-accessible" only as the core specification. The goal getting a complete and machine-readable description of character behavior is illusory. > > As part of the Unicode Standard and the UCD, the name lists? annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification?s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people?i.e., using screen-reader-friendly HTML rather than PDF?while making clear that the annotations are merely references to the original, normative Standard?s actual code charts and name lists. This is a different issue. The nameslist.txt is a reasonable source for driving other _formatting_ programs than just Unibook. In fact, the possibility of reuse in this context probably among the unstated rationales for making the information and syntax available in the first place. Let's understand this properly: using the file to translate it into a "human-readable" output format is a proper use of this data, even if that translation is done using a mechanical too, as long as the format is a) a format that benefits from the special shortcuts taken in selecting the information present in the namelist.txt file, b) a format intended to be interpreted by a observant and intelligent human reader, and not c) a format intended as direct input to any text-processing algorithm, or any algorithm that "understands" the contents > > What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources? data have been published, then, for better or for worse, the name list is all that is available for much information on many code points? usage. See my first through third paragraph. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Fri Mar 11 11:37:38 2016 From: oren.watson at gmail.com (Oren Watson) Date: Fri, 11 Mar 2016 12:37:38 -0500 Subject: NamesList.txt as data source In-Reply-To: <56E229C1.6020105@ix.netcom.com> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> <56E229C1.6020105@ix.netcom.com> Message-ID: Ok, so let me see if I understand this correctly. Suppose I'm writing a editor for math equations, and I want the user to be able to press a "Doublestruck" button and then type an C or D to get a ? or ?? respectively. There is apparently no official source containing a machine-readable table of the doublestruck equivalents of each character that has such an equivalent. Such a table might also include { -> ? and such. This seems like something that would be very convenient to have centralized and standardized. --Oren Watson -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Mar 11 12:24:29 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 11 Mar 2016 10:24:29 -0800 Subject: NamesList.txt as data source In-Reply-To: References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> <56E229C1.6020105@ix.netcom.com> Message-ID: <56E30D5D.3080006@att.net> On 3/11/2016 9:37 AM, Oren Watson wrote: > Ok, so let me see if I understand this correctly. Suppose I'm writing > a editor for math equations, and I want the user to be able to press a > "Doublestruck" button and then type an C or D to get a ? or ?? > respectively. There is apparently no official source containing a > machine-readable table of the doublestruck equivalents of each > character that has such an equivalent. Such a table might also include > { -> ? and such. > > This seems like something that would be very convenient to have > centralized and standardized. > O.k., it is taking more time to talk about this than to just make the lists. See attached list, which took about 5 minutes to cull. That lists the 24 "unifications" mentioned on page 7 of UTR #25, Unicode Support for Mathematics: http://www.unicode.org/reports/tr25/ It matches the 24 explicit cross-references listed in the Unicode names list. If the ability to pull out such a list and make it "machine-readable" in a few minutes doesn't suffice, and you need something which counts as a more "official source", then the best way forward would be to engage with the UTC during the next update cycle for UTR #25, when its associated data table needs to be checked for the 9.0 repertoire additions, and advocate that some further documentation be made explicitly for those 24 mappings. BTW, all 24 *are* already present in MathClassEx-14.txt: http://www.unicode.org/Public/math/revision-14/MathClassEx-14.txt as commented-out entry lines. So an even faster way to get a centralized (if not "official") list, is to take MathClassEx-14.txt and % grep #1D MathClassEx-14.txt | grep reserved > maplistout.txt See also attached. As for starting down the road of suggesting additional equivalences, e.g. for double-struck parentheses, that is certainly something somebody could do, and might be interesting content to add to UTR #25 -- but it goes beyond the formal unification issue for the 24 mathematical alphabet letters already encoded in the 2100 block. --Ken -------------- next part -------------- 1D455 ; 210E # planck constant 1D49D ; 212C # script capital b 1D4A0 ; 2130 # script capital e 1D4A1 ; 2131 # script capital f 1D4A3 ; 210B # script capital h 1D4A4 ; 2110 # script capital i 1D4A7 ; 2112 # script capital l 1D4A8 ; 2133 # script capital m 1D4AD ; 211B # script capital r 1D4BA ; 212F # script small e 1D4BC ; 210A # script small g 1D4C4 ; 2134 # script small o 1D506 ; 212D # black-letter capital c 1D50B ; 210C # black-letter capital h 1D50C ; 2111 # black-letter capital i 1D515 ; 211C # black-letter capital r 1D51D ; 2128 # black-letter capital z 1D53A ; 2102 # double-struck capital c 1D53F ; 210D # double-struck capital h 1D545 ; 2115 # double-struck capital n 1D547 ; 2119 # double-struck capital p 1D548 ; 211A # double-struck capital q 1D549 ; 211D # double-struck capital r 1D551 ; 2124 # double-struck capital z -------------- next part -------------- #1D455=210E;N;;;;;ITALIC SMALL H #1D49D=212C;A;;Bscr;ISOMSCR;;SCRIPT CAPITAL B #1D4A0=2130;A;;Escr;ISOMSCR;;SCRIPT CAPITAL E #1D4A1=2131;A;;Fscr;ISOMSCR;;SCRIPT CAPITAL F #1D4A3=210B;A;;Hscr;ISOMSCR;;SCRIPT CAPITAL H #1D4A4=2110;A;;Iscr;ISOMSCR;;SCRIPT CAPITAL I #1D4A7=2112;A;;Lscr;ISOMSCR;;SCRIPT CAPITAL L #1D4A8=2133;A;;Mscr;ISOMSCR;;SCRIPT CAPITAL M #1D4AD=211B;A;;Rscr;ISOMSCR;;SCRIPT CAPITAL R #1D4BA=212F;A;;escr;ISOMSCR;;SCRIPT SMALL E #1D4BC=210A;A;;gscr;ISOMSCR;;SCRIPT SMALL G #1D4C4=2134;A;;oscr;ISOMSCR;;SCRIPT SMALL O #1D506=212D;A;;Cfr;ISOMFRK;;FRAKTUR CAPITAL C #1D50B=210C;A;;Hfr;ISOMFRK;;FRAKTUR CAPITAL H #1D50C=2111;A;;Ifr;ISOMFRK;;FRAKTUR CAPITAL I #1D515=211C;A;;Rfr;ISOMFRK;;FRAKTUR CAPITAL R #1D51D=2128;A;;Zfr;ISOMFRK;;FRAKTUR CAPITAL Z #1D53A=2102;A;;Copf;ISOMOPF;;DOUBLE-STRUCK CAPITAL C #1D53F=210D;A;;Hopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL H #1D545=2115;A;;Nopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL N #1D547=2119;A;;Popf;ISOMOPF;;DOUBLE-STRUCK CAPITAL P #1D548=211A;A;;Qopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL Q #1D549=211D;A;;Ropf;ISOMOPF;;DOUBLE-STRUCK CAPITAL R #1D551=2124;A;;Zopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL Z From verdy_p at wanadoo.fr Fri Mar 11 20:35:30 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 12 Mar 2016 03:35:30 +0100 Subject: Easter island inscriptions Message-ID: What is the encoding status of this script, found on inscriptions of Easter Island ? http://www.jps.auckland.ac.nz/document?wid=115 -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Mar 12 18:29:02 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 01:29:02 +0100 (CET) Subject: NamesList.txt as data source In-Reply-To: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> References: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Message-ID: <1736223179.14515.1457828942560.JavaMail.www@wwinf1f18> On Thu, 10 Mar 2016 15:14:09 -0700, Doug Ewell wrote: > Ken Whistler wrote: > > > NamesList.txt should *not* be data mined. > > And yet it was the only Unicode data file utilized by MSKLC. > > There are many possible reasons for this approach, which we will > probably never know. Sadly it is too late to ask Michael Kaplan the question. To add one more answer in his place: I never doubted that NamesList.txt was the best choice for MSKLC, which parses the file for code points and character names to generate a human readable display and output as defined by Asmus Freytag on Thu, 10 Mar 2016 18:13:21 -0800. This would have been similarly achieved by parsing UnicodeData.txt. However the main difference between using NamesList vs. UnicodeData in the MSKLC as I see it, is the cultural benefit for the end-user. Consistently, the Names List is shipped in the root directory of MSKLC, beside a copy of the EULA, and then copied to a safe location at %User%\AppData\Local\MSKLC (where I recently updated it to some 8.0.0 version of its French translation), so that the user can view it?and even alter it without disturbing the tool. It?s sort of a pocket version of the Code Charts? textual information, thus likely to satisfy both the (human) keyboard editor and the creator (software). Extrapolating from my case, I believe that the >2 million downloads of MSKLC [1] surely contributed to some extent to spread the knowledge about Unicode, and to give people the desire to learn more?because indeed, Ken Whistler warned on Thu, 10 Mar 2016 13:40:47 -0800, and the Code Charts Disclaimer clearly states that they ?do not provide all the information needed to fully support individual scripts using the Unicode Standard.? And they can?t even. On Thu, 10 Mar 2016 18:13:21 -0800, Asmus Freytag wrote: > The goal getting a complete and machine-readable description of > character behavior is illusory. Marcel [1] Kaplan, M. S. (2013, October 4). The story of MSKLC | Sorting it all Out, v2! Retrieved August 18, 2015, from http://www.siao2.com/2013/10/04/10454264.aspx From charupdate at orange.fr Sat Mar 12 18:35:23 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 01:35:23 +0100 (CET) Subject: Easter island inscriptions In-Reply-To: References: Message-ID: <2023910531.14519.1457829323896.JavaMail.www@wwinf1f18> On Sat, 12 Mar 2016 03:35:30 +0100, Philippe Verdy wrote: > What is the encoding status of this script, found on inscriptions of Easter > Island ? > > http://www.jps.auckland.ac.nz/document?wid=115 It is in the pipeline and has already a codespace in project, but I guess that there must also be some more _actual_ scripts not yet encoded. Kind regards, Marcel From charupdate at orange.fr Sat Mar 12 19:42:27 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 02:42:27 +0100 (CET) Subject: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX? Message-ID: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> AFAICS the iconic representation of U+202F NNBSP for use on keyboards and in keyboard documentation is not yet encoded. In the 2010-08-27[1] and 2012-09-07[2] proposals to encode symbols for use on on-screen keyboards and in documentation, *U+2432 was aimed for this symbol. Actual usage probably relies on formatting, PUA, or icons; the latter epecially for on-screen keyboards as I imagine them, because local applications usually have icon libraries and thus must not rely on plain text only. Why mapping invisible characters to plain text symbols for local use should be any easier than mapping them to the already standardized icons, is out of reach of my understanding. Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? [For the representative glyph, the question is about giving it the same width as that of U+237D by lengthening the shoulders, or reducing the overall width to the same extent as the width of the box. In the first case it might be called SHOULDERED NARROW OPEN BOX, in the second case rather NARROW SHOULDERED OPEN BOX. ISO/IEC?9995-7 standardized the first glyph.] My personal opinion is quite in favor of a symbol to represent the narrow no-break space. By contrast, though I?m using part of the keyboard symbols, I?m not likely to utilize the whole mass of symbols (partly of little obvious use) introduced by ISO/IEC?9995-7:2009 and its 2012 amendment, because for most of them whether I can?t see any reality behind, or I?feel they are way too confusing for end-users, or they are better replaced with more concrete representations?e.g. a letter with hook instead of a symbol for hook applicator, as for dead keys I generally prefer real letters instead of isolated diacritics whatever they are represented with. Let alone that we never can have usable keyboards with all those deadkeys on them, so that we have to rely on compose sequences, that can be documented in natural language and are far more mnemonic, e.g. ?compose?}? for palatal hook; ?compose?{? for retroflex hook; ?compose?]? for hook above; ?compose?[? for a hook as on U+0187..U+0188. A model of practicity are Keyman keyboard layouts, that may use ASCII characters only, to enter whatever letters and diacritics. So I believe that if the NNBSP symbol hadn?t been buried in a bunch of other late ISO/IEC?9995-7 symbols, it would now be a part of Unicode. BTW U+202F NNBSP had been encoded three years before the release of ISO/IEC?9995-7:2002. Best regards, Marcel [1] http://www.unicode.org/L2/L2010/10351-n3897-jtc1sc35n1579.pdf [2] http://www.unicode.org/L2/L2012/12302-wg1-%209995-7-n4317.pdf From gwalla at gmail.com Sat Mar 12 19:52:47 2016 From: gwalla at gmail.com (Garth Wallace) Date: Sat, 12 Mar 2016 17:52:47 -0800 Subject: Easter island inscriptions In-Reply-To: References: Message-ID: On Fri, Mar 11, 2016 at 6:35 PM, Philippe Verdy wrote: > What is the encoding status of this script, found on inscriptions of Easter > Island ? > > http://www.jps.auckland.ac.nz/document?wid=115 It's called Rongorongo, and according to the Roadmap about 40 columns have been provisionally set aside in the SMP but no proposal has been submitted yet. From charupdate at orange.fr Sun Mar 13 00:13:37 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 07:13:37 +0100 (CET) Subject: Easter island inscriptions In-Reply-To: References:

Message-ID: <1077012845.181.1457849617600.JavaMail.www@wwinf1m08> On Sat, 12 Mar 2016 17:52:47 -0800, Garth Wallace wrote: > On Fri, Mar 11, 2016 at 6:35 PM, Philippe Verdy wrote: > > What is the encoding status of this script, found on inscriptions of Easter > > Island ? > > > > http://www.jps.auckland.ac.nz/document?wid=115 > > It's called Rongorongo, and according to the Roadmap about 40 columns > have been provisionally set aside in the SMP but no proposal has been > submitted yet. Though no _formal_ proposal is found, the draft proposal from Michael Everson [1][2] is ready to feed in since a long time. From there and the reserved code space I concluded that it is ?in the pipeline?, but perhaps I?was somewhat too optimistic. Marcel [1] http://www.unicode.org/L2/L1999/rongorongo.pdf [2] http://www.evertype.com/standards/iso10646/pdf/rongorongo.pdf From jsbien at mimuw.edu.pl Sun Mar 13 00:55:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sun, 13 Mar 2016 07:55:24 +0100 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <56E1E9DF.9060405@att.net> (Ken Whistler's message of "Thu, 10 Mar 2016 13:40:47 -0800") References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> Message-ID: <864mcaeujn.fsf_-_@mimuw.edu.pl> On Thu, Mar 10 2016 at 22:40 CET, kenwhistler at att.net writes: > The *reason* that NamesList.txt exists at all is to drive the tool, > unibook, that formats the full Unicode code charts for posting. [...] On Fri, Mar 11 2016 at 3:13 CET, asmusf at ix.netcom.com writes: > On 3/10/2016 5:49 PM, "J. S. Choi" wrote: >> One thing about NamesList.txt is that, as far as I have been able to >> tell, it?s the only machine-readable, parseable source of those >> annotations and cross-references. [...] > This is a different issue. The nameslist.txt is a reasonable source > for driving other formatting programs than just Unibook. Exactly. A student of mine wrote a font sampling program producing output in a Unibook-like form. For this purpose he wrote also a converter from NamesList format to XML: https://github.com/ppablo28/fntsample_ucd_comments https://github.com/ppablo28/ucd_xml_parser I use the XML version of NamesList to provide my own comments to characters (work in progress): https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf Other examples of NamesList.txt use are http://www.fileformat.info/info/unicode/ https://codepoints.net/ Although not exactly the formatting programs, in my opinion they constitute also a valid use. > In fact, the possibility of reuse in this context probably among the > unstated rationales for making the information and syntax available in > the first place. I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? [...] >> What are these other primary sources that maintain these other >> annotation data; are they publicly available? If the name list is the >> only place where these sources? data have been published, then, for >> better or for worse, the name list is all that is available for much >> information on many code points? usage. > See my first through third paragraph. You wrote: [...] > There are explanations about character use that are only maintained in > the PDF of the core specification, where this information is packaged > in a way that can be understood by a human reader, but is not amenable > to be extracted by machine. > > While the annotations, comments, cross references etc. in Namelist.txt > appear, formally, to be machine extractable, the way they are created > and managed make them just as much "human-accessible" only as the core > specification. I'm afraid it's not clear for me. Let's take an example. Sometime ago I inquired about a controversial alias for U+018D: http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html Can I really find anything about "reversed Polish-hook o" in the core specification which is not a literal copy of the information from NamesList.txt? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eric.muller at efele.net Sun Mar 13 10:14:33 2016 From: eric.muller at efele.net (Eric Muller) Date: Sun, 13 Mar 2016 08:14:33 -0700 Subject: Emoji Feminism - The New York Times Message-ID: <56E583D9.2070708@efele.net> http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0 From bortzmeyer at nic.fr Sun Mar 13 10:31:56 2016 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Sun, 13 Mar 2016 16:31:56 +0100 Subject: Emoji Feminism - The New York Times In-Reply-To: <56E583D9.2070708@efele.net> References: <56E583D9.2070708@efele.net> Message-ID: <20160313153156.GA4440@nic.fr> On Sun, Mar 13, 2016 at 08:14:33AM -0700, Eric Muller wrote a message of 1 lines which said: > http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0 Funny (I love penguins), but the New York Times should read UTR #51, section 2.1 http://www.unicode.org/reports/tr51/tr51-3.html#Gender From charupdate at orange.fr Sun Mar 13 12:13:28 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 18:13:28 +0100 (CET) Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <864mcaeujn.fsf_-_@mimuw.edu.pl> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <864mcaeujn.fsf_-_@mimuw.edu.pl> Message-ID: <109301831.6849.1457889208287.JavaMail.www@wwinf1d36> On Sun, 13 Mar 2016 07:55:24 +0100, Janusz S. Bie? wrote: > For this purpose he wrote also a converter from NamesList format to XML That goes straight into the direction I?suggested past year as a beta feedback item[1], but I?never thought that it could be so simple. > I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? The difference however between homemade databases and official ones is that the latter raise much higher expectations. Asmus Freytag outlined in this thread?as well as in his comments on my feedback?that *no* ?complete? UCD version, regardless of how complete it effectively might be, can ever meet the assumptions people inevitably would make on it. Further, experience shows that the actually provided information is way more than most people are able to mentally process. E.g. most online character information providers do not display the formal aliases, so that in the best case some aware users add that information using the comment facility. I don?t cite any: These are free tools and platforms that must not be criticized. When we imagine a hypothetical UCD containing detailed information about the usage of any existing language, not only Polish but also Czech, Romanian, Portugese, Vietnamese, Devanagari, Tirhuta, just to cite some few, the result would be a data mass of which I?m not sure that it would pay back the cost induced at collection, nor that it would really be useful. For the NamesList, the TXT format is superior to XML at least in that, it prevents from forgetting that NamesList.txt is the source of the Code Charts. Not less, not more. Marcel [1] http://www.unicode.org/review/pri297/feedback.html Date/Time: Sat May 2 07:10:09 CDT 2015 ???Opt Subject: PRI #297: UnicodeXData.txt Date/Time: Wed May 6 08:03:04 CDT 2015 ???Opt Subject: PRI #297: feedback on XML files From c933103 at gmail.com Sun Mar 13 13:39:46 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 14 Mar 2016 02:39:46 +0800 Subject: Proposed Unicode 10.0 emoji U+1F961 Takeout Box In-Reply-To: References:

Message-ID: Its sample glyph and emoji description say oyster pail, which according to Wikipedia it seems to be a mostly American things. Would it be better to create emoji for other takeout boxes like Chinese Ricebox (Not the American style one), Japanese Bento, and pizza box, or alternatively provide selector for the emoji to change it to different style? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 13 14:03:20 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 13 Mar 2016 13:03:20 -0600 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: References: Message-ID: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> My point is that of J.S. Choi and Janusz Bie?: the problem with declaring NamesList off-limits is that it does contain information that is either: ? not available in any other UCD file, or ? available, but only in comments (like the MAS mappings), which aren't supposed to be parsed either. Ken wrote: > [ .. ] NamesList.txt is itself the result of a complicated merge > of code point, name, and decomposition mapping information from > UnicodeData.txt, of listings of standardized variation sequences from > StandardizedVariants.txt, and then a very long list of annotational > material, including names list subhead material, etc., maintained in > other sources. But sometimes an implementer really does need a piece of information that exists only in those "other sources." When that happens, sometimes the only choices are to resort to NamesList or to create one's own data file, as Ken did by parsing the comment lines from the math file. Both of these are equally distasteful when trying to be conformant. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Sun Mar 13 16:24:55 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 13 Mar 2016 15:24:55 -0600 Subject: Proposed Unicode 10.0 emoji U+1F961 Takeout Box Message-ID: <07573F85A1A945ACA6C37EE3973B58E0@DougEwell> gfb hjjhjh wrote: > or alternatively provide > selector for the emoji to change it to different style? http://www.unicode.org/reports/tr52/ -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Sun Mar 13 21:14:05 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 14 Mar 2016 03:14:05 +0100 (CET) Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> Message-ID: <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > My point is that of J.S. Choi and Janusz Bie?: the problem with > declaring NamesList off-limits is that it does contain information that > is either: > > ? not available in any other UCD file, or > ? available, but only in comments (like the MAS mappings), which aren't > supposed to be parsed either. > > Ken wrote: > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > of code point, name, and decomposition mapping information from > > UnicodeData.txt, of listings of standardized variation sequences from > > StandardizedVariants.txt, and then a very long list of annotational > > material, including names list subhead material, etc., maintained in > > other sources. > > But sometimes an implementer really does need a piece of information > that exists only in those "other sources." When that happens, sometimes > the only choices are to resort to NamesList or to create one's own data > file, as Ken did by parsing the comment lines from the math file. Both > of these are equally distasteful when trying to be conformant. If so, then extending the XML UCD with all the information that is actually missing in it while available in the Code Charts and NamesList.txt, ends up being a good idea. But it still remains that such a step would exponentially increase the amount of data, because items that were not meant to be systematically provided, must be. Further I?see that once this is completed, other requirements could need to tackle the same job on the core specs. The point would be to know whether in Unicode implementation and i18n, those needs are frequent. E.g. the last Apostrophe thread showed that full automatization is sometimes impossible anyway. Marcel From asmus-inc at ix.netcom.com Sun Mar 13 22:32:03 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 13 Mar 2016 20:32:03 -0700 Subject: annotations In-Reply-To: <864mcaeujn.fsf_-_@mimuw.edu.pl> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <864mcaeujn.fsf_-_@mimuw.edu.pl> Message-ID: <56E630B3.5000405@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 14 02:23:18 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 14 Mar 2016 08:23:18 +0100 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> Message-ID: is the term "exponentially" really appropriate ? the NamesList file is not so large, and the grow would remain linear. Anyway, this file (current CSV format or XML format) does not need to be part of the core UCD files, they can be in a separate download for people needing it. One benefit I would see is that this conversion to XML using an automated tool could ensure that it is properly formated. But I believe that Unibook is already parsing it to produce consistent code charts so its format is already checked. And this advantage is not really effective. But the main benefit would be that the file could be edited and updated using standard tools. XML is not the only choice available, JSON today is simpler to parse, easier to read (and even edit) by humans, it can embed indentation whitespaces (outside quoted strings) that won't be considered part of the data (unlike XML where they "pollute" the DOM with extra text elements). In fact I belive that the old CSV formats used in the original UCD may be deprecated in favor of JSON (the old format could be automatically generated for applications that want them. It could unify all formats with a single parser in all tools. Files in older CSV or tabulated formats would be in a separate derived collection. Then users would choose which format they prefer (legacy now derived, JSON, or XML if people really want it). The advantage of XML however is the stability for later updates that may need to insert additional data or annotations (with JSON or CSV/tabulated formats, the number of columns is fixed, all columns must be fed at least with an empty data, even if it is is not significant). Note that legacy formats also have comments after hash signs, but many comments found at end of data lines also have some parsable meaning, so they are structured, and may be followed by an extra hash sign for a real comment) The advantage of existing XSV/tabulated formats is that they are extremely easy to import in a spreadsheet for easier use by a human (I won't requiest the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and XML could as well be imported provided that the each data file remains structured as a 2D grid without substructures within cells (otherwise you need to provide an explicit schema). But note that some columns is frequently structured: those containing the code point key is frequently specifying a code range using an additional separator; as well those whose value is an ordered list of code points, using space separator and possibly a leading subtag (such as decomposition data): in XML you would translate them into separate subelements or into additional attributes, and in JSON, you'll need to structure these structured cells using subarrays. So the data is *already* not strictly 2D (converting them to a pure 2D format, for relational use, would require adding additional key or referencing "ID" columns and those converted files would be much less easier to read/edit by humans, in *any* format: CSV/tabular, JSON or XML). Other candidate formats also include Turtle (generally derived from OWL, but replacing the XML envelope format by a tabulated "2.5D" format which is much easier than XML to read/edit and much more compact than XML-based formats and easier to parse)... 2016-03-14 3:14 GMT+01:00 Marcel Schneider : > On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > > > My point is that of J.S. Choi and Janusz Bie?: the problem with > > declaring NamesList off-limits is that it does contain information that > > is either: > > > > ? not available in any other UCD file, or > > ? available, but only in comments (like the MAS mappings), which aren't > > supposed to be parsed either. > > > > Ken wrote: > > > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > > of code point, name, and decomposition mapping information from > > > UnicodeData.txt, of listings of standardized variation sequences from > > > StandardizedVariants.txt, and then a very long list of annotational > > > material, including names list subhead material, etc., maintained in > > > other sources. > > > > But sometimes an implementer really does need a piece of information > > that exists only in those "other sources." When that happens, sometimes > > the only choices are to resort to NamesList or to create one's own data > > file, as Ken did by parsing the comment lines from the math file. Both > > of these are equally distasteful when trying to be conformant. > > > If so, then extending the XML UCD with all the information that is > actually missing in it while available in the Code Charts and > NamesList.txt, ends up being a good idea. But it still remains that such a > step would exponentially increase the amount of data, because items that > were not meant to be systematically provided, must be. > > Further I see that once this is completed, other requirements could need > to tackle the same job on the core specs. > > The point would be to know whether in Unicode implementation and i18n, > those needs are frequent. E.g. the last Apostrophe thread showed that full > automatization is sometimes impossible anyway. > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Mar 14 11:19:35 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 14 Mar 2016 09:19:35 -0700 Subject: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX? In-Reply-To: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> References: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> Message-ID: <56E6E497.1000902@att.net> U+23FF is already assigned to OBSERVER EYE SYMBOL, which is already under ballot for 10646 (and approved by the UTC). http://www.unicode.org/alloc/Pipeline.html Please always first check that page before suggesting code points for prospective new characters. --Ken On 3/12/2016 5:42 PM, Marcel Schneider wrote: > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? > > From kenwhistler at att.net Mon Mar 14 12:01:46 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 14 Mar 2016 10:01:46 -0700 Subject: annotations In-Reply-To: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> Message-ID: <56E6EE7A.9010109@att.net> On 3/13/2016 12:03 PM, Doug Ewell wrote: > My point is that of J.S. Choi and Janusz Bie?: the problem with > declaring NamesList off-limits is that it does contain information > that is either: > > ? not available in any other UCD file, or > ? available, but only in comments (like the MAS mappings), which aren't > supposed to be parsed either. NamesList.txt is not "off-limits". The information in it is there because it is useful for publication in the Unicode code charts, to help with the identification and interpretation of the characters in the standard. And because NamesList.txt itself is published as part of the UCD, nobody is going to stop you (or anybody else) from parsing information out of it. The trick is this: the status of annotational data in NamesList.txt is different than that of normative data like the code points, names, formal name aliases, decomposition mappings, and standardized variation sequences. Annotations are -- well, annotational -- and there are no guarantees about their completeness or stability, and so on. They emerge from a kind of ongoing rugby scrum between the UTC members, national body comments on 10646 amendments, public suggestions via feedback and email lists, and the ability of editors to accommodate reasonable suggestions that might help the readability and usefulness of the names list without larding it up to heavily with extraneous information that would make it *harder* to use. People who parse NamesList.txt for data almost inevitably and immediately end up expecting it to do things it does not (and cannot reasonably) do. See this thread right here for pertinent examples. *That* is the problem I see, because it then tends to lead to frustrated clamoring for NamesList.txt to be "fixed" to do things and carry information that it wasn't (and isn't) designed to do. --Ken From doug at ewellic.org Mon Mar 14 13:22:14 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 14 Mar 2016 11:22:14 -0700 Subject: annotations Message-ID: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Ken Whistler wrote: > The trick is this: the status of annotational data in NamesList.txt > is different than that of normative data like the code points, names, > formal name aliases, decomposition mappings, and standardized > variation sequences. I get that. I am FAR more comfortable with that type of guideline: ? the data isn't normative (at least not all of it) ? the format isn't set in stone ? don't ask for additions or changes ? caveat emptor than with any sort of blanket statement about "don't parse this file." I hereby promise to use NamesList.txt responsibly and with all of the above conditions in mind. Hopefully others will too. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Mon Mar 14 13:33:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 14 Mar 2016 19:33:24 +0100 Subject: annotations In-Reply-To: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Message-ID: <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> Quote/Cytat - Doug Ewell (pon, 14 mar 2016, 19:22:14): > Ken Whistler wrote: > >> The trick is this: the status of annotational data in NamesList.txt >> is different than that of normative data like the code points, names, >> formal name aliases, decomposition mappings, and standardized >> variation sequences. > > I get that. I am FAR more comfortable with that type of guideline: > > ? the data isn't normative (at least not all of it) > ? the format isn't set in stone > ? don't ask for additions or changes What about reporting possible mistakes? Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From doug at ewellic.org Mon Mar 14 14:30:59 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 14 Mar 2016 12:30:59 -0700 Subject: annotations Message-ID: <20160314123059.665a7a7059d7ee80bb4d670165c8327d.5035fe7d06.wbe@email03.secureserver.net> Janusz S. Bie? wrote: >> ? don't ask for additions or changes > > What about reporting possible mistakes? I'd assume that egregious, demonstrable errors, such as misspelled character names or incorrect individual code points, could be reported, and anything beyond that probably should not. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Mar 14 16:05:05 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 14 Mar 2016 14:05:05 -0700 Subject: annotations In-Reply-To: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Message-ID: <56E72781.3070807@ix.netcom.com> On 3/14/2016 11:22 AM, Doug Ewell wrote: > Ken Whistler wrote: > >> The trick is this: the status of annotational data in NamesList.txt >> is different than that of normative data like the code points, names, >> formal name aliases, decomposition mappings, and standardized >> variation sequences. > I get that. I am FAR more comfortable with that type of guideline: > > ? the data isn't normative (at least not all of it) > ? the format isn't set in stone > ? don't ask for additions or changes Additions and changes to annotations are considered all the time. There's just no implication that these must satisfy some arbitrary criteria of completeness and internal consistency. They are added when the editorial committee feels that the benefit outweighs the cost (bloat & clutter). The nature of all of these is more akin to comments - except that they are not presented using a comment syntax (and the xrefs look structured, instead of "see also code point XXXX"). Totally a perception issue. > ? caveat emptor Always! > > than with any sort of blanket statement about "don't parse this file." > > I hereby promise to use NamesList.txt responsibly and with all of the > above conditions in mind. Hopefully others will too. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > From asmus-inc at ix.netcom.com Mon Mar 14 16:05:27 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 14 Mar 2016 14:05:27 -0700 Subject: annotations In-Reply-To: <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> Message-ID: <56E72797.90005@ix.netcom.com> On 3/14/2016 11:33 AM, Janusz S. Bien wrote: > Quote/Cytat - Doug Ewell (pon, 14 mar 2016, 19:22:14): > >> Ken Whistler wrote: >> >>> The trick is this: the status of annotational data in NamesList.txt >>> is different than that of normative data like the code points, names, >>> formal name aliases, decomposition mappings, and standardized >>> variation sequences. >> >> I get that. I am FAR more comfortable with that type of guideline: >> >> ? the data isn't normative (at least not all of it) >> ? the format isn't set in stone >> ? don't ask for additions or changes > > What about reporting possible mistakes? see my reply to Doug > > Regards > > Janusz > From doug at ewellic.org Tue Mar 15 09:42:30 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Mar 2016 07:42:30 -0700 Subject: annotations Message-ID: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> Asmus Freytag wrote: >> ? don't ask for additions or changes > > Additions and changes to annotations are considered all the time. Well, yes. I meant additions and changes to the scope of the file. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Mar 15 10:34:09 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 15 Mar 2016 08:34:09 -0700 Subject: annotations In-Reply-To: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> References: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> Message-ID: <56E82B71.6030008@ix.netcom.com> An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Tue Mar 15 16:21:51 2016 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 15 Mar 2016 21:21:51 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 15 March 2016 at 19:48, K.C.Saff wrote: > > I often see numbers roll over at 100, displayed on a new board, so even just > the full set of two digit forms adds a lot of utility for go games. This > seems to be a standard practice at Wikipedia ( > https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol#Game_4 ), Sensei's > Library and a lot of books that I've worked through. That's certainly true, although it is not hard to find examples which go over 100 (http://www.babelstone.co.uk/Ludus/Weiqi/FamousGames_279.jpg), and even the AlphaGo vs Lee Sedol Wikipedia page shows one game diagram that goes into the 200s. > Completing both sets > up to 99, adding "00", and including the most common markers (triangle, > square, etc.) seems like a good, useful compromise. Possibly. I certainly have very little expectation that a proposal to complete both sets to 999 (or even 399) would have any chance of success. I am currently working on a proposal for the triangle and square go markers, and am still considering the best approach to the circled numbers. Any feedback would be most welcome. http://www.babelstone.co.uk/Unicode/GoNotation.pdf Andrew From ori at avtalion.name Tue Mar 15 17:11:06 2016 From: ori at avtalion.name (Ori Avtalion) Date: Wed, 16 Mar 2016 00:11:06 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> Message-ID: I have received a response from Barbara Beeton, along with an approval to post it here. I have redacted the intro and outro where she admits it's "certainly not a real answer", but IMO it's still useful for documentation. Response from Barbara Beeton: (Date: Tue, 15 Mar 2016 16:49:14 -0400) On Thu, 10 Mar 2016, Ori Avtalion wrote: I'm trying to find an answer to the question in the subject line. I posted it to the Unicode mailing list [1], and was suggested to contact you, as you are one of the authors who proposed the symbols. [1] http://unicode.org/pipermail/unicode/2016-March/003412.html I can find no use of dots in common Go notations of games. What is the origin of the dots on the Go markers and what are they used for? I have researched the records of the STIX project and find the following. All the "regular" sources of symbols were recorded in a "master table" that has been kept up to date, but there have been few additions since about 2007. A somewhat earlier version, dated October 2006, can be found here: http://www.ams.org/STIX/bnb/stix-tbl.ascii-2006-10-20 Since this is simply a huge, column aligned ascii table, a layout guide is provided, which lists sources and other information including when codes were added: http://www.ams.org/STIX/bnb/stix-tbl.layout-2006-05-15 For the code range in question -- U+2686 - U+2689 -- the date of addition was 2000/02/01; in the same group are the six die faces, U+2680 - U+2685. As you can see, no sources are listed. Since there were also other, "irregular" sources, for which records exist only on paper, I also dug through those files. (Which is why it has taken so long to answer.) The only reference I can find is a document submitted to WG2 that includes that range: ISO/IEC JTC1/SC2/WG2 N2336 2001-04-02 The only mention of the range consists of a grid for 2680-26FF, blanked out except for the 10 symbols, and a page listing them in the form appropriate for inclusion in the Unicode charts; the content of that page is identical to what is in the chart for the 26xx range of Unicode 8.0 except for two comments (for 2680 and 2687). There may be an earlier document in the WG2 archives, probably dated in late 1999 or pre-February 1, 2000, that has more information, but I don't have a copy. The fact that die faces and (purported) go symbols were added at the same time may be helpful. What I surmise happened is that someone requested that these symbols be added to a submission-in-progress; since the collection of math symbols was rather diverse, a few more wouldn't be noticed, but it's unfortunate that nobody seems to have kept records. Perhaps someone who was active in the UTC at the time may have a memory; all I can attest to is that the request did *not* originate with the STIX project. From davidj_faulks at yahoo.ca Tue Mar 15 22:14:15 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Wed, 16 Mar 2016 03:14:15 +0000 (UTC) Subject: Variations and Unifications ? References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode. For example, the semisquare aspect is usually shown with a glyph that is identical to ? (U+2220 ANGLE). However, sometimes it looks like <, or like ? (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? The parallel aspect, similarily, sometimes looks like ? (U+2225 PARALLEL TO), but is often shown as // or ? (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way. There is also contra-parallel, which sometime is shown like ? (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal). The ?part of fortune? is sometimes a circled ?, or sometimes a circled +. Would it be better to have dedicated characters than to assume unifications in these cases? From frederic.grosshans at gmail.com Wed Mar 16 08:35:54 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 16 Mar 2016 14:35:54 +0100 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp>

Message-ID: <56E9613A.4030605@gmail.com> Le 15/03/2016 22:21, Andrew West a ?crit : > > Possibly. I certainly have very little expectation that a proposal to > complete both sets to 999 (or even 399) would have any chance of > success. And then, there are also the historical example of ideographic numbers used for the same purpose in historic texts (like here http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ or here http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod ). The above has been found with a quick google search, and I have no idea whether these symbols were used in the running text or not. Fr?d?ric From asmus-inc at ix.netcom.com Wed Mar 16 12:34:54 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 16 Mar 2016 10:34:54 -0700 Subject: Variations and Unifications ? In-Reply-To: <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> Message-ID: <56E9993E.1090902@ix.netcom.com> An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Mar 16 19:45:26 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 17 Mar 2016 00:45:26 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E9613A.4030605@gmail.com> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp>

<56E9613A.4030605@gmail.com> Message-ID: Hi Fr?d?ric, The historic use of ideographic numbers for marking Go moves are discussed in the latest draft of my document: http://www.babelstone.co.uk/Unicode/GoNotation.pdf Andrew On 16 March 2016 at 13:35, Fr?d?ric Grosshans wrote: > Le 15/03/2016 22:21, Andrew West a ?crit : >> >> >> Possibly. I certainly have very little expectation that a proposal to >> complete both sets to 999 (or even 399) would have any chance of >> success. > > And then, there are also the historical example of ideographic numbers used > for the same purpose in historic texts (like here > http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ > or here > http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod > ). > > The above has been found with a quick google search, and I have no idea > whether these symbols were used in the running text or not. > > Fr?d?ric > From charupdate at orange.fr Wed Mar 16 20:00:35 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 17 Mar 2016 02:00:35 +0100 (CET) Subject: Proposal for *U+2427 NARROW SHOULDERED OPEN BOX (was: Re: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX?) In-Reply-To: <56E6E497.1000902@att.net> References: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> <56E6E497.1000902@att.net> Message-ID: <297423993.28545.1458176435992.JavaMail.www@wwinf1e16> On Mon, 14 Mar 2016 09:19:35 -0700, Ken Whistler wrote: > U+23FF is already assigned to OBSERVER EYE SYMBOL, which is > already under ballot for 10646 (and approved by the UTC). > > http://www.unicode.org/alloc/Pipeline.html > > Please always first check that page before suggesting code points > for prospective new characters. > > --Ken > > On 3/12/2016 5:42 PM, Marcel Schneider wrote: > > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? > > Thank you. I remember OBSERVER EYE but didn?t notice its code point and forgot to do a search for ?23[F[F]]? on the Pipeline page. Sorry. Now I see that *U+2427 would be even better as it is both in the block of U+2423 OPEN BOX and in the originally intended block, except that now I dropped the other symbols and stay just with the NNBSP symbol to propose for the next free contiguous scalar value. I really hope that such a new or, more accurately, third proposal would be accepted, as the NARROW NO-BREAK SPACE is so important it must have its symbol encoded at some point, similarly to SPACE and NO-BREAK SPACE. About the proposed name, there is to say that first I changed it to the glyph-descriptional one as preferred in Unicode, rather than SYMBOL FOR NARROW NO-BREAK SPACE. And last I made it more analogous to the name of the symbolized character, by inverting ?SHOULDERED? and ?NARROW?. The original proposer cannot simply resume on that ?narrow? basis, being committed to consistency with ISO/IEC?9995-7, so that an individual like I am, might be good to send the proposal? However generally it would be better done by a NB, the more as this belongs to the international keyboard standard. Other countries might be interested that have a multilingual standard layout, and/or a national layout including U+202F. Another scenario would be that the French NB re-proposes a reduced set of additional symbols, which IMHO should comprise at least the NARROW SHOULDERED OPEN BOX, but ideally once it will have completed the revision of most parts of ISO/IEC?9995, including part?7. Best regards, Marcel From verdy_p at wanadoo.fr Thu Mar 17 01:11:33 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 07:11:33 +0100 Subject: Variations and Unifications ? In-Reply-To: <56E9993E.1090902@ix.netcom.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> Message-ID: "Disunification may be an answer?" We should avoid it as well. We have other solutions in Unicode - variation selectors (often used for sinograms when their unified shapes must be distinguished in some contexts such as people names or toponyms or trademark names or in other specific contexts), - or combining sequences (including in Arabic or Hebrew where many combining characters are not always represented visually, the same occuring as well in Latin with accents not always presented over capitals), - or sequences of multiple characters (like in Emojis for skin color variants, or sequences for encoding flags), - or other sequences using joiners (e.g. in South Asian scripts). Disunification is only acceptable when - there's a complete disunification of concepts and the "similar" shapes are also different even if one originates from the other (E.g. the Latin slashed o disunifiied from the Latin o, even if there's also the sequence o+combining slash, almost never used as its rendering is too much approximative in most cases) - or there's a clear distinction of semantics and properties (e.g. the Latin AE ligature, which is not appropriately represented by the two separate letters, not even with a "hinting" joiner, and that has specific properties as a plain letter, e.g. with mappings) Before disunifying a character, we should first study the alternative of their representation as sequences. 2016-03-16 18:34 GMT+01:00 Asmus Freytag (t) : > On 3/15/2016 8:14 PM, David Faulks wrote: > > As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode. > > For example, the semisquare aspect is usually shown with a glyph that is identical to ? (U+2220 ANGLE). However, sometimes it looks like <, or like ? (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? > > The parallel aspect, similarily, sometimes looks like ? (U+2225 PARALLEL TO), but is often shown as // or ? (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way. > There is also contra-parallel, which sometime is shown like ? (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal). > > The ?part of fortune? is sometimes a circled ?, or sometimes a circled +. > > Would it be better to have dedicated characters than to assume unifications in these cases? > > > > My take is that for symbols there's always that tension between encoding > the "concept" or encoding the shape. In my view, it is often impossible to > answer the question whether the different angles (for example) are merely > different "shapes" of one and the same "symbol", or whether it isn't the > case that there are different "conventions" (using different symbols for > the same concept). > > Disunification is useful, whenever different concepts require distinct > symbol shapes (even if there are some general similarities). If other > concepts make use of the same shapes interchangeably, it is then up to the > author to fix the convention by selecting one or the other shape. > Conceptually, that is similar to the decimal point: it can be either a > period, or a comma, depending on locale (read: depending on the convention > the author follows). > > Sometimes, concepts use multiple symbol shapes, but all of these shapes > map to the same concept (and other uses are not known). In that case, > unifying the shapes might be acceptable. The selection of shape is then a > matter of the font (and may not always be under the control of the author). > Conceptually, that is similar to the integral sign, which can be slanted or > upright. The choice is one of style. While authors or readers may prefer > one look over the other, the identity of the symbol is not in question, and > there's no impact on transmission of the contents of the text. > > Whenever we have the former case, that is, multiple conventional > presentations that are symbols in their own right in other contexts, then > encoding an additional "generic" shape should be avoided. Unicode > explicitly did not encode a generic "decimal point". If the convention that > is used matters, the author is better off being able to select a specific > shape. The results will be more predictable. The downside is that a search > will have to cover all the conventions. Conceptually, that is no different > from having to search for both "color" and "colour". > > The final case is where a convention for depicting a concept uses a symbol > that itself has some variability (for example when representing some other > concepts), such that some of its forms make it less than ideal for the > conventional use intended for the concept in question. Unicode has > historically not always been able to provide a solution. In some of these > cases, plain text (that is, without a fixed font association) may simply > not give the desired answer. If specialized fonts for the convention (e.g. > astrological fonts) do not usually exist or can't be expected, then > disunifying the symbol's shapes may be an answer. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Mar 17 02:20:06 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 17 Mar 2016 00:20:06 -0700 Subject: Variations and Unifications ? In-Reply-To: References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> Message-ID: <56EA5AA6.6040202@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 17 02:47:26 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 08:47:26 +0100 Subject: Variations and Unifications ? In-Reply-To: <56EA5AA6.6040202@ix.netcom.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> <56EA5AA6.6040202@ix.netcom.com> Message-ID: One problem caused by disunification is the complexification of algorithms handling text. I forgot an important case where disunification also occured : combining sequences are the "normal" encoding, but legacy charsets encoded the precomposed character separately and Unicode had to map them for round trip compatibility purpose. This had a consequence : the creation of additional properties (i.e. for "canonical equivalences") in order to conciliate the two sets of encodings and allow some form for equivalence In fact this is general: each time we disunify a character, we have to add new properties, and possibly update the algorithms to take these properties into account and find some form of equivalences. So disunification solves one problem but creates others. We have to trade the benefits and costs of using the disunified characters with those using the "normal" characters (possibly in sequences). But given the number of cases where we have to support sequences (even if it's only combining sequences for canonical equivalences), we should really defavor the real need of disunifying characters: if it's possible with sequences, don't desunify. A famous example (based on a legacydecision which was bad in my opinion as the cost was not considered) was the desunification of Latin/Greek letters for mathematical purpose, only to force a specific style. But the alternative representation using sequences (using variation selectors for example, as the addition of specific modifier for "styles" like "bold", "italic" or "monospace" was rejected with good reasons) was not really analyzed in terms of benefits and costs, using the algorithms we already have (and that could have been updated). But mathemetical symbols are (normally...) not used at all in the same context as plain alphabetic letters (even if there's absolutely no warranty that they will be always distinctable from them when they occur in some linguistic text rendered with the same style...). The naive thinking that disunification will make things simpler is completely wrong (given that an application that would ignore all character properties and would use only isolated characters would break legitime rules in many cases, even for rendering purposes. It is in fact simpler to keep the possible sequences that are already encoded (or that could be extended to cover more cases: e.g. add new variation sequences, introduce some new modiers, not just new combining characters, and so on). We were strongly told : Unicode encodes characters, not glyphs. This should be remembered (and the argument of costs caused by disunification of distinct glyphs is also a good one against it). 2016-03-17 8:20 GMT+01:00 Asmus Freytag (t) : > On 3/16/2016 11:11 PM, Philippe Verdy wrote: > > "Disunification may be an answer?" We should avoid it as well. > > Disunification is only acceptable when > - there's a complete disunification of concepts.... > > > I think answering this question depends on the understanding of "concept", > and on understanding what it is that Unicode encodes. > > When it comes to *symbols*, which is where the discussion originated, > it's not immediately obvious what Unicode encodes. For example, I posit > that Unicode does not encode the "concept" for specific mathematical > operators, but the individual "symbols" that are used for them. > > For example PRIME and DOUBLE PRIME can be used for minutes and seconds > (both of time and arc) as well as for other purposes. Unicode correctly > does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it > up to the notational convention to relate the concept and the symbol. > > Thus we have a case where multiple concepts match a single symbol. For the > converse, we take the well-known case of COMMA and FULL STOP which can both > be used to separate a decimal fraction. > > Only in those cases where a single concept is associated so exclusively > with a given symbol, do we find the situation that it makes sense to treat > variations in shape of that symbol as the same symbol, but with different > glyphs. > > For some astrological symbols that is the case, but for others it is not. > Therefore, the encoding model for astrological text cannot be uniform. > Where symbols have exclusive association with a concept, the natural > encoding is to encode symbols with an understood set of variant glyphs. > Where concepts are denoted with symbols that are also used otherwise, then > the association of concept to symbol must become a matter of notational > convention and cannot form the basis of encoding: the code elements have to > be on a lower level, and by necessity represent specific symbol shapes. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Thu Mar 17 11:43:33 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 12:43:33 -0400 Subject: =?UTF-8?Q?Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= Message-ID: <56EADEB5.9000406@bisharat.net> Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "?". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). Out of curiosity, did a web search on "interna?onal" and got over 11k hits, apparently all PDFs. Anyone have any idea what's going on? Am assuming this is not a deliberate choice by diverse people creating PDFs and wanting "ti" ligatures for stylistic reasons. Note the document linked above is current, so this is not (just) an issue with older documents. Don Osborn From olopierpa at gmail.com Thu Mar 17 12:26:56 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 17 Mar 2016 18:26:56 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: <56EADEB5.9000406@bisharat.net> References: <56EADEB5.9000406@bisharat.net> Message-ID: That document displays correctly for me using both the pdf viewer built into chrome and the standalone Acrobat reader v.11. The problem could be in your PDF viewer? What are you viewing the document with? On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: > Odd result when copy/pasting text from a PDF: For some reason "ti" in the > (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does appear > that the glyph is a "ti" ligature (which afaik is not coded as such in > Unicode). > > Out of curiosity, did a web search on "interna?onal" and got over 11k hits, > apparently all PDFs. > > Anyone have any idea what's going on? Am assuming this is not a deliberate > choice by diverse people creating PDFs and wanting "ti" ligatures for > stylistic reasons. Note the document linked above is current, so this is not > (just) an issue with older documents. > > Don Osborn From leoboiko at namakajiri.net Thu Mar 17 12:37:05 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 17 Mar 2016 14:37:05 -0300 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: The PDF *displays* correctly. But try copying the string 'ti' from the text another application outside of your PDF viewer, and you'll see that the thing that *displays* as 'ti' is *coded* as ?, as Don Osborn said. 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi : > That document displays correctly for me using both the pdf viewer > built into chrome and the standalone Acrobat reader v.11. The problem > could be in your PDF viewer? What are you viewing the document with? > > On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: >> Odd result when copy/pasting text from a PDF: For some reason "ti" in the >> (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "?". Looking more closely at the original text, it does appear >> that the glyph is a "ti" ligature (which afaik is not coded as such in >> Unicode). >> >> Out of curiosity, did a web search on "interna?onal" and got over 11k hits, >> apparently all PDFs. >> >> Anyone have any idea what's going on? Am assuming this is not a deliberate >> choice by diverse people creating PDFs and wanting "ti" ligatures for >> stylistic reasons. Note the document linked above is current, so this is not >> (just) an issue with older documents. >> >> Don Osborn > From dzo at bisharat.net Thu Mar 17 12:45:34 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 13:45:34 -0400 Subject: =?UTF-8?Q?Re:_Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net>

Message-ID: <56EAED3E.1080601@bisharat.net> Thanks Leonardo, that is my initial observation. And it has implications for web searches. And there's more. Apparently this is one of a number of such substitutions, which taken together begin to look like the old pre-Unicode hacks of 8-bit fonts. And I found some of them via web search in a number of Google Books and pages on issuu.com. Evidently some kind of font issue, and not random assignments. From the same document: ff ligature = ? fl ligature = ? ft ligature = ? tt ligature = ? And perhaps others. Seems to defeat the intent of Unicode, as these documents and pages will not come up in typical web search on the normal spellings (unless maybe Google is incorporating an algorithm to include results for say "interna?onal" in a search on the term "international"?). Don On 3/17/2016 1:37 PM, Leonardo Boiko wrote: > The PDF *displays* correctly. But try copying the string 'ti' from > the text another application outside of your PDF viewer, and you'll > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > Osborn said. > > > 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi : >> That document displays correctly for me using both the pdf viewer >> built into chrome and the standalone Acrobat reader v.11. The problem >> could be in your PDF viewer? What are you viewing the document with? >> >> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the >>> (English) text of the document at >>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>> is coded as "?". Looking more closely at the original text, it does appear >>> that the glyph is a "ti" ligature (which afaik is not coded as such in >>> Unicode). >>> >>> Out of curiosity, did a web search on "interna?onal" and got over 11k hits, >>> apparently all PDFs. >>> >>> Anyone have any idea what's going on? Am assuming this is not a deliberate >>> choice by diverse people creating PDFs and wanting "ti" ligatures for >>> stylistic reasons. Note the document linked above is current, so this is not >>> (just) an issue with older documents. >>> >>> Don Osborn From jknappen at web.de Thu Mar 17 12:57:15 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 17 Mar 2016 18:57:15 +0100 Subject: =?UTF-8?Q?Aw=3A_Joined_=22ti=22_coded_as_=22=C6=9F=22_in_PDF?= In-Reply-To: <56EADEB5.9000406@bisharat.net> References: <56EADEB5.9000406@bisharat.net> Message-ID: An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Thu Mar 17 13:02:19 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 17 Mar 2016 19:02:19 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net>

Message-ID: On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko wrote: > The PDF *displays* correctly. But try copying the string 'ti' from > the text another application outside of your PDF viewer, and you'll > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > Osborn said. Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about unicode. It uses the encoding of the fonts used. The ti ligature is a glyph in the font used in that document. Its code has nothing to do with anything unicode. It looks like a pre-unicode hack because unicode says nothing about font technologies, and hence nothing has changed in PDF because of unicode (nor could have, unicode does not mandate how to encode ligatures). From leoboiko at namakajiri.net Thu Mar 17 13:06:22 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 17 Mar 2016 15:06:22 -0300 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: Yeah, I've stumbled upon this a lot in academic Japanese/Chinese texts. I try to copy some Chinese character, only to find out that it's really a string of random ASCII characters. Is there only one of those crap PDF pseudo-encodings? If so, I'll use a conversor next time... 2016-03-17 14:57 GMT-03:00 "J?rg Knappen" : > I inspected the pdf file, and its font encoding is termed "Identity-H". I > couldn't reveal much about this encoding, but it seems to be a private > encoding of Adobe used especially for Asian fonts. > > --J?rg Knappen > > Gesendet: Donnerstag, 17. M?rz 2016 um 17:43 Uhr > Von: "Don Osborn" > An: unicode at unicode.org > Betreff: Joined "ti" coded as "?" in PDF > Odd result when copy/pasting text from a PDF: For some reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does > appear that the glyph is a "ti" ligature (which afaik is not coded as > such in Unicode). > > Out of curiosity, did a web search on "interna?onal" and got over 11k > hits, apparently all PDFs. > > Anyone have any idea what's going on? Am assuming this is not a > deliberate choice by diverse people creating PDFs and wanting "ti" > ligatures for stylistic reasons. Note the document linked above is > current, so this is not (just) an issue with older documents. > > Don Osborn From doug at ewellic.org Thu Mar 17 13:11:44 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Mar 2016 11:11:44 -0700 Subject: Joined "ti" coded as =?UTF-8?Q?=22=C6=9F=22=20in=20PDF?= Message-ID: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> Don Osborn wrote: > Odd result when copy/pasting text from a PDF: For some reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does > appear that the glyph is a "ti" ligature (which afaik is not coded as > such in Unicode). When I copy and paste the PDF text in question into BabelPad, I get: > Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By > invita??on only) The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use character. Truncating this character to 16 bits, which is a Bad Thing?, yields U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either Don's clipboard or the editor he pasted it into is not fully Unicode-compliant. Don's point about using alternative characters to implement ligatures, thereby messing up web searches, remains valid. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From steve at swales.us Thu Mar 17 13:17:12 2016 From: steve at swales.us (Steve Swales) Date: Thu, 17 Mar 2016 11:17:12 -0700 Subject: =?utf-8?Q?Re=3A_Joined_=22ti=22_coded_as_=22=C6=9F=22_in_PDF?= In-Reply-To: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> Message-ID: <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. -steve > On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: > > Don Osborn wrote: > >> Odd result when copy/pasting text from a PDF: For some reason "ti" in >> the (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "?". Looking more closely at the original text, it does >> appear that the glyph is a "ti" ligature (which afaik is not coded as >> such in Unicode). > > When I copy and paste the PDF text in question into BabelPad, I get: > >> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >> invita??on only) > > The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use > character. > > Truncating this character to 16 bits, which is a Bad Thing?, yields > U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either > Don's clipboard or the editor he pasted it into is not fully > Unicode-compliant. > > Don's point about using alternative characters to implement ligatures, > thereby messing up web searches, remains valid. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From charupdate at orange.fr Thu Mar 17 14:00:54 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 17 Mar 2016 20:00:54 +0100 (CET) Subject: =?UTF-8?Q?Re:_Joined_"ti"_coded_as_"=C6=9F"_in_PDF?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net>

Message-ID: <364670549.29314.1458241254719.JavaMail.www@wwinf1p21> On Thu, Mar 17, 2016 at 19:02:19, Pierpaolo Bernardi wrote: > unicode says nothing about font technologies It mentions them a little bit however in the core specifications: http://www.unicode.org/versions/Unicode8.0.0/ch23.pdf#G23126 > unicode does not mandate how to encode ligatures Probably because Unicode specifies that ?it is the task of the rendering system? to select ligature glyphs on the basis of characteristic sequences of characters in the text stream. While having found some of the mentioned oddities in an old PDF file (ffi ligature ending up as Y, ffl ligature as Z), I?m now really puzzled about actual practise. Marcel From verdy_p at wanadoo.fr Thu Mar 17 15:18:35 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 21:18:35 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net>

Message-ID: 2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi : > On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko > wrote: > > The PDF *displays* correctly. But try copying the string 'ti' from > > the text another application outside of your PDF viewer, and you'll > > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > > Osborn said. > > Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about > unicode. It uses the encoding of the fonts used. > That's correct, however the PDF specs contain guidelines for naming glyphs in fonts in such a way that the encoding can be deciphered. This is needed for example in applications such as PDF forms where user input is expected. When those PDF are generated from rich text, the fonts used may be built with TrueType (without any glyph name in them, only mappings of sequences of codepoints) or OpenType or Postscript. When OpenType fonts contain Postscript glyphs, their names may be completely arbitrary, it does not even matter if the font used was mapped to Unciode or if it used a legacy or proprietary encoding). If you see a "?" when copy-pasting from the PDF, it's because the font used to produce it did not follow these guidelines (or did not specify any glyphname, in which case this is a sort of OCR algorithm that attempts to decipher the glyph : the "ti" ligature is visually extremely near from the "?", and an OCR has lot of difficulties to disguish them, unless they also use some linguistic dictionnary searches and some hints about the script used in surrounding characters to enhance the guess). Note that PDF's (or DejaVu's) are not required to contain only text, or they could just embed a scanned and compressed bitmap image (if you want to see how an OCR can be wrong, look at how it fails with lots of errors, for example in the decoding projects for Wikibooks, working with scanned bitmaps of old books: OCR is just an helper, but there's still lot of work to correct what has been guessed and reencode the correct text; even if humans are smarter than OCR, this is a lot of work to perform manually : encoding the text of a single scanned old book still takes one or two months for an experienced editor, and there are still many errors to review later by someone else) Most PDFs were not created with the idea of decoding later their rendered texts. In fact they were intended to be read or printed "as is", including with their styles, colors, and decorations of fonts everywhere or text over photos. They were even created to be non modifiable and used then for archival. Some PDF tools will also cleanup from the PDF the additional metadata such as the original fonts used, instead these PDFs will locally embed pseudo-fonts containing sets of glyphs from various fonts (in mixed styles), in random order or sorted by frequency of use in the document or by order of occurence in the original text. These embedded fonts are generated on the fly to contain only the necessary glyphs for the document. When those embedded fonts are generated, there's a compression algorithme that drops lots of things from the original font, including its metadata such as the original "Postscript" glyph names. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Thu Mar 17 15:44:19 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 16:44:19 -0400 Subject: =?UTF-8?Q?Re:_Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= In-Reply-To: <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> Message-ID: <56EB1723.7030301@bisharat.net> Thanks all for the feedback. Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. So, when I did a web search on "interna?onal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? A web search on what you came up with - "Interna??onal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. > > -steve > >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >> >> Don Osborn wrote: >> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in >>> the (English) text of the document at >>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>> is coded as "?". Looking more closely at the original text, it does >>> appear that the glyph is a "ti" ligature (which afaik is not coded as >>> such in Unicode). >> When I copy and paste the PDF text in question into BabelPad, I get: >> >>> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >>> invita??on only) >> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >> character. >> >> Truncating this character to 16 bits, which is a Bad Thing?, yields >> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >> Don's clipboard or the editor he pasted it into is not fully >> Unicode-compliant. >> >> Don's point about using alternative characters to implement ligatures, >> thereby messing up web searches, remains valid. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> > From lang.support at gmail.com Thu Mar 17 18:34:04 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 18 Mar 2016 10:34:04 +1100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: <56EB1723.7030301@bisharat.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> Message-ID: There are a few things going on. In the first instance, it may be the font itself that is the source of the problem. My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. At least thatsmy current understanding. Andrew On 18 Mar 2016 7:47 am, "Don Osborn" wrote: > Thanks all for the feedback. > > Doug, It may well be my clipboard (running Windows 7 on this particular > laptop). Get same results pasting into Word and EmEditor. > > So, when I did a web search on "interna?onal," as previously mentioned, > and come up with a lot of results (mostly PDFs), were those also a > consequence of many not fully Unicode compliant conversions by others? > > A web search on what you came up with - "Interna??onal" - yielded many > more (82k+) results, again mostly PDFs, with terms like "interna onal" > (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?). > > Searching within the PDF document already mentioned, "international" comes > up with nothing (which is a major fail as far as usability). Searching the > PDF in a Firefox browser window, only "interna?onal" finds the occurrences > of what displays as "international." However after downloading the document > and searching it in Acrobat, only a search for "interna??onal" will find > what displays as "international." > > A separate web search on "E?ects" came up with 300+ results, including > some GoogleBooks which in the texts display "effects" (as far as I > checked). So this is not limited to Adobe? > > J?rg, With regard to "Identity H," a quick search gives the impression > that this encoding has had a fairly wide and not so happy impact, even if > on the surface level it may have facilitated display in a particular style > of font in ways that no one complains about. > > Altogether a mess, from my limited encounter with it. There must have been > a good reason for or saving grace of this solution? > > Don > > On 3/17/2016 2:17 PM, Steve Swales wrote: > >> Yes, it seems like your mileage varies with the PDF >> viewer/interpreter/converter. Text copied from Preview on the Mac replaces >> the ti ligature with a space. Certainly not a Unicode problem, per se, but >> an interesting problem nevertheless. >> >> -steve >> >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >>> >>> Don Osborn wrote: >>> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in >>>> the (English) text of the document at >>>> >>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>>> is coded as "?". Looking more closely at the original text, it does >>>> appear that the glyph is a "ti" ligature (which afaik is not coded as >>>> such in Unicode). >>>> >>> When I copy and paste the PDF text in question into BabelPad, I get: >>> >>> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >>>> invita??on only) >>>> >>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >>> character. >>> >>> Truncating this character to 16 bits, which is a Bad Thing?, yields >>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >>> Don's clipboard or the editor he pasted it into is not fully >>> Unicode-compliant. >>> >>> Don's point about using alternative characters to implement ligatures, >>> thereby messing up web searches, remains valid. >>> >>> -- >>> Doug Ewell | http://ewellic.org | Thornton, CO ???? >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Mar 17 23:18:38 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 17 Mar 2016 21:18:38 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp>

<56E9613A.4030605@gmail.com> Message-ID: There's another strategy for dealing with enclosed numbers, which is taken by the font Quivira in its PUA: encoding separate left-half-circle-enclosed and right-half-circle-enclosed digits. This would require 20 characters to cover the double digit range 00?99. Enclosed three digit numbers would require an additional 30 for left, center, and right thirds, though it may be possible to reuse the left and right half circle enclosed digits and assume that fonts will provide left half-center third-right half ligatures (Quivira provides "middle parts" though the result is a stadium instead of a true circle). It should be possible to do the same for enclosed ideographic numbers, I think. The problems I can see with this are confusability with the already encoded atomic enclosed numbers, and breaking in vertical text. On Wed, Mar 16, 2016 at 5:45 PM, Andrew West wrote: > Hi Fr?d?ric, > > The historic use of ideographic numbers for marking Go moves are > discussed in the latest draft of my document: > > http://www.babelstone.co.uk/Unicode/GoNotation.pdf > > Andrew > > > On 16 March 2016 at 13:35, Fr?d?ric Grosshans > wrote: >> Le 15/03/2016 22:21, Andrew West a ?crit : >>> >>> >>> Possibly. I certainly have very little expectation that a proposal to >>> complete both sets to 999 (or even 399) would have any chance of >>> success. >> >> And then, there are also the historical example of ideographic numbers used >> for the same purpose in historic texts (like here >> http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ >> or here >> http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod >> ). >> >> The above has been found with a quick google search, and I have no idea >> whether these symbols were used in the running text or not. >> >> Fr?d?ric >> > From d3ck0r at gmail.com Fri Mar 18 01:28:18 2016 From: d3ck0r at gmail.com (J Decker) Date: Thu, 17 Mar 2016 23:28:18 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp>

<56E9613A.4030605@gmail.com>