From frederic.grosshans at gmail.com Tue Mar 1 04:14:22 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 1 Mar 2016 11:14:22 +0100 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> Message-ID: <56D56B7E.9020702@gmail.com> Le 29/02/2016 22:55, Philippe Verdy a ?crit : > So it's not the meaning, nor the technical mean by which these terms > were sent which is essential, the court will in fact want to judge > about the intent and the effective psychological nature of this > threat. What is the real intent of a 12-year old girl? There's not > enough element in the short message to judge and given her age she > does not really realize that this could have a so dramatic effect > (nobody has experienced that before based on only three words which > are not even evident personal insults). > > We'll have to bring to the fire many old famous comics (intended to > children) showing similar images in bubbles instead of slang words, or > label them "only for adults". > ?? ?? ?? indeed recall some of the symbols proposed by Karl Pentzlin in 2010 L2/10-402 Proposal to encode some additional Comic Style Symbols (http://www.unicode.org/L2/L2010/10402-comic-symbols.pdf ). It really looks like comics-style swearwords to me Fred From leob at mailcom.com Tue Mar 1 12:10:53 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 1 Mar 2016 10:10:53 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: I have a less disruptive proposal than to encode an unprecedented combining emoji. How about adding variation sequences + U+FE0F VS16 to signify BANKNOTE with ? Leo On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > For the pound emoji, throw in ~90M Egyptians. > > --J?rg Knappen > > *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr > *Von:* "Leo Broukhis" > *An:* "Mark Davis ??" > *Cc:* "unicode Unicode Discussion" > *Betreff:* Re: Enclosing BANKNOTE emoji? > The emojiexpress.com site is useful to check which new emoji or > combinations people actually use, but the stats are likely skewed by only > measuring input from one platform. > > Another way to look at the emojitracker.com stats: > > 339M people in the Eurozone : 389K uses of Euro emoji > 126M people in Japan : 354K uses of Yen emoji > 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in > for Lira) : 515K uses of pound emoji > > The total is 605M people : 1258K uses of non-dollar emoji > Assuming the same average frequency of use, 2933K uses of the dollar emoji > would be produced by 1411M people, out of which us + canada + mexico + > australia (500M) + other countries using $ as (part of) the sign for > their currency are way less than a half. This means that substantially more > than 500M people are using the dollar emoji by default, instead of emoji of > their national currencies. Assuming a lesser frequency of use will result > in a greater estimate of the affected population. > > Leo > > > On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? wrote: >> >> Look at http://www.emojixpress.com/stats/. The stats are different, >> since they collect data from keyboards not twitter posts, but they have a >> nice button to view only the news emoji. >> >> (The numbers on the new ones will be smaller, just because it takes time >> for systems to support them, and people to start using them. However, they >> bear out my predication that the most popular would be the eyes-rolling >> face). >> >> >> Mark >> >> >> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>> >>> A caveat about using emojitracker.com : it doesn't count newer emoji >>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>> added, their counts will be skewed. >>> >>> Leo >>> >>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: >>>> >>>> Thank you for the links, quite mesmerizing! >>>> >>>> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >>>> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >>>> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >>>> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >>>> 10x more than the lowest counts, and about the same frequency as various >>>> individual clock faces). >>>> >>>> It is quite evident that the dollar banknote emoji serves as a stand-in >>>> for at least half a dozen of various currencies. >>>> >>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>> wrote: >>>> >>>>> I would suggest that you first gather statistics and present >>>>> statistics on how often the current combinations are used compared to other >>>>> emoji, eg by consulting sources such as: >>>>> >>>>> http://www.emojixpress.com/stats/ >>>>> or >>>>> http://emojitracker.com/ >>>>> >>>>> >>>>> Mark >>>>> >>>>> >>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>> wrote: >>>>>> >>>>>> There are >>>>>> >>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>> >>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>> currency sign, would produce the corresponding banknote, to forestall >>>>>> requests for individual emoji for banknotes with remaining currency >>>>>> signs. >>>>>> >>>>>> Leo >>>>>> >>>>> >>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.jacobs at xs4all.nl Tue Mar 1 12:31:35 2016 From: chris.jacobs at xs4all.nl (Chris Jacobs) Date: Tue, 01 Mar 2016 19:31:35 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: Message-ID: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> How would the system distinguish between US and Canada dollar? Both would be <$> + U+FE0F VS16 Chris Leo Broukhis schreef op 2016-03-01 19:10: > I have a less disruptive proposal than to encode an unprecedented combining emoji. > How about adding variation sequences + U+FE0F VS16 to signify BANKNOTE with ? > > Leo > > On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > > For the pound emoji, throw in ~90M Egyptians. > > --J?rg Knappen > > GESENDET: Dienstag, 09. Februar 2016 um 23:46 Uhr > VON: "Leo Broukhis" > AN: "Mark Davis ??" > CC: "unicode Unicode Discussion" > BETREFF: Re: Enclosing BANKNOTE emoji? > > The emojiexpress.com [1] site is useful to check which new emoji or combinations people actually use, but the stats are likely skewed by only measuring input from one platform. > Another way to look at the emojitracker.com [2] stats: > 339M people in the Eurozone : 389K uses of Euro emoji 126M people in Japan : 354K uses of Yen emoji 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in for Lira) : 515K uses of pound emoji > The total is 605M people : 1258K uses of non-dollar emoji Assuming the same average frequency of use, 2933K uses of the dollar emoji would be produced by 1411M people, out of which us + canada + mexico + australia (500M) + other countries using $ as (part of) the sign for their currency are way less than a half. This means that substantially more than 500M people are using the dollar emoji by default, instead of emoji of their national currencies. Assuming a lesser frequency of use will result in a greater estimate of the affected population. > Leo > > On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? wrote: > > Look at http://www.emojixpress.com/stats/. The stats are different, since they collect data from keyboards not twitter posts, but they have a nice button to view only the news emoji. > > (The numbers on the new ones will be smaller, just because it takes time for systems to support them, and people to start using them. However, they bear out my predication that the most popular would be the eyes-rolling face). > > Mark > > On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: > > A caveat about using emojitracker.com [2] : it doesn't count newer emoji yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are added, their counts will be skewed. > Leo > > On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: > > Thank you for the links, quite mesmerizing! > > On emojitracker.com [2] (cumulative counts, but only on twitter, AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but 10x more than the lowest counts, and about the same frequency as various individual clock faces). > It is quite evident that the dollar banknote emoji serves as a stand-in for at least half a dozen of various currencies. > > On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? wrote: > > I would suggest that you first gather statistics and present statistics on how often the current combinations are used compared to other emoji, eg by consulting sources such as: > > http://www.emojixpress.com/stats/ > or > http://emojitracker.com/ > > Mark > > On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis wrote: There are > > ?? U+01F4B4 Banknote With Yen Sign > ?? U+01F4B5 Banknote With Dollar Sign > ?? U+01F4B6 Banknote With Euro Sign > ?? U+01F4B7 Banknote With Pound Sign > > This is clearly an incomplete set. It makes sense to have a generic > "enclosing banknote" emoji character which, when combined with a > currency sign, would produce the corresponding banknote, to forestall > requests for individual emoji for banknotes with remaining currency > signs. > > Leo Links: ------ [1] http://emojiexpress.com [2] http://emojitracker.com [3] http://mark at macchiato.com [4] http://leob at mailcom.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From leob at mailcom.com Tue Mar 1 12:35:12 2016 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 1 Mar 2016 10:35:12 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> References: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> Message-ID: It doesn't have to. How does the system distinguish between US and Canada dollar in plain text? Both are <$>. Leo On Tue, Mar 1, 2016 at 10:31 AM, Chris Jacobs wrote: > How would the system distinguish between US and Canada dollar? > > Both would be <$> + U+FE0F VS16 > > Chris > > > Leo Broukhis schreef op 2016-03-01 19:10: > > I have a less disruptive proposal than to encode an unprecedented > combining emoji. > How about adding variation sequences + U+FE0F VS16 to > signify BANKNOTE with ? > > Leo > > On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: > >> For the pound emoji, throw in ~90M Egyptians. >> >> --J?rg Knappen >> >> *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr >> *Von:* "Leo Broukhis" >> *An:* "Mark Davis ??" >> *Cc:* "unicode Unicode Discussion" >> *Betreff:* Re: Enclosing BANKNOTE emoji? >> The emojiexpress.com site is useful to check which new emoji or >> combinations people actually use, but the stats are likely skewed by only >> measuring input from one platform. >> >> Another way to look at the emojitracker.com stats: >> >> 339M people in the Eurozone : 389K uses of Euro emoji >> 126M people in Japan : 354K uses of Yen emoji >> 140M people in UK + Turkey (likely users of the Pound emoji as a stand-in >> for Lira) : 515K uses of pound emoji >> >> The total is 605M people : 1258K uses of non-dollar emoji >> Assuming the same average frequency of use, 2933K uses of the dollar >> emoji would be produced by 1411M people, out of which us + canada + mexico >> + australia (500M) + other countries using $ as (part of) the sign for >> their currency are way less than a half. This means that substantially more >> than 500M people are using the dollar emoji by default, instead of emoji of >> their national currencies. Assuming a lesser frequency of use will result >> in a greater estimate of the affected population. >> >> Leo >> >> >> On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? >> wrote: >>> >>> Look at http://www.emojixpress.com/stats/. The stats are different, >>> since they collect data from keyboards not twitter posts, but they have a >>> nice button to view only the news emoji. >>> >>> (The numbers on the new ones will be smaller, just because it takes time >>> for systems to support them, and people to start using them. However, they >>> bear out my predication that the most popular would be the eyes-rolling >>> face). >>> >>> >>> Mark >>> >>> >>> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>>> >>>> A caveat about using emojitracker.com : it doesn't count newer emoji >>>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>>> added, their counts will be skewed. >>>> >>>> Leo >>>> >>>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis wrote: >>>>> >>>>> Thank you for the links, quite mesmerizing! >>>>> >>>>> On emojitracker.com (cumulative counts, but only on twitter, AFAICS), >>>>> U+1F4B5 ($) had quite a respectable count of 2932622 (well above the middle >>>>> of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around 30%ile), >>>>> and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around 20%ile, but >>>>> 10x more than the lowest counts, and about the same frequency as various >>>>> individual clock faces). >>>>> >>>>> It is quite evident that the dollar banknote emoji serves as a >>>>> stand-in for at least half a dozen of various currencies. >>>>> >>>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>>> wrote: >>>>> >>>>>> I would suggest that you first gather statistics and present >>>>>> statistics on how often the current combinations are used compared to other >>>>>> emoji, eg by consulting sources such as: >>>>>> >>>>>> http://www.emojixpress.com/stats/ >>>>>> or >>>>>> http://emojitracker.com/ >>>>>> >>>>>> >>>>>> Mark >>>>>> >>>>>> >>>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>>> wrote: >>>>>>> >>>>>>> There are >>>>>>> >>>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>>> >>>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>>> currency sign, would produce the corresponding banknote, to forestall >>>>>>> requests for individual emoji for banknotes with remaining currency >>>>>>> signs. >>>>>>> >>>>>>> Leo >>>>>>> >>>>>> >>>>>> -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From doug at ewellic.org Tue Mar 1 12:49:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Mar 2016 11:49:29 -0700 Subject: Girl, 12, charged for threatening her school with emojis Message-ID: <20160301114929.665a7a7059d7ee80bb4d670165c8327d.336797b96f.wbe@email03.secureserver.net> Asmus Freytag wrote: >> . Well emojis were initially designed to track amotions and form a >> sort of new language, > > E-moji means "picture-character" in Japanese, has nothing to do (at > first) with emotions. I wonder if it would help some folks to remember that "mojibake," a term many of us are familiar with, contains the same root "moji" ("character"). Exercise: consider "emojibake." ???? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From leoboiko at namakajiri.net Tue Mar 1 13:44:48 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Tue, 1 Mar 2016 16:44:48 -0300 Subject: Girl, 12, charged for threatening her school with emojis In-Reply-To: References: <56D3D15B.4070705@khwilliamson.com> <56D3E4D0.9030902@ix.netcom.com> <003901d172c0$d1626970$74273c50$@xencraft.com> <56D40CED.2080406@ix.netcom.com> <56D4C520.20709@ix.netcom.com> Message-ID: Ah but that is a "majority" by a dictionary/type count. Due to Zipf's Law, in language matters we should always distinguish dictionary counts from actual usage. E.g. Twitter is very popular in Japan, and I think we'll all agree that the top used kanji are predominantly modal: http://emojitracker.com/ Thomas Dimson's great distributional analysis for Instagram gives us hashtags that are equivalent to emoji; again, I think it's clear that their primary use is for modality. http://instagram-engineering.tumblr.com/post/117889701472/emojineering-part-1-machine-learning-for-emoji . What's more, a lot of emoji which seem to have no "clear emotional referent" is appropriated for modal purposes. For example, this thread's ?? ?? ?? are graphical depictions of objects, but I think you'll all agree that the girl was expressing a mood; she wasn't saying "gun, knife, bomb". I'm told that U+1F481, INFORMATION DESK PERSON ??, was taken to be "sassy girl" or "hair flick", and from that it became a modality indicator for sassiness, sarcasm, fabulousness etc. (I suspect that another major use of emoji, besides modality, is deictic: "I'm at Tokyo Tower" + Tokyo Tower emoji, "Merry Christmas" + Christmas-related emoji. Emotional mood still seems to be to be clearly the dominant use.) 2016-02-29 21:25 GMT-03:00 Garth Wallace : > Some are used to express emotions but many are not: food items, > animals, landmarks, activities, etc. I think the majority do not have > clear emotional referents. The original set introduced in Unicode 6.0 > included things like ROASTED SWEET POTATO and TOKYO TOWER. > > On Mon, Feb 29, 2016 at 4:04 PM, Philippe Verdy > wrote: > > Today's Japanese emojis are (for most of them) recent inventions; may be > > there are some earlier tracks in Japanese comics, but you may as well > find > > them in comics of America or Europe since the about the 1940's. > > > > All these icons were *later* renamed emojis in English and Unicode, but > > there's a long history of using icons for such emotions Look at the > little > > heart drawn near the signature on an handwritten letter or discrete > > messages, or similar symbols carved by lovers on walls and trees. Or long > > before as a sign of recognition such as the fish for the first > Christians in > > the Roman Empire, or even before in some hieroglyphic inscriptions in > antic > > Egyptian, Mayan, and Chinese civilizations since Bronze Age or before. > > > > In fact you could also add all the symbols (not necessarily with > religious > > meaning) found on graves for expressing that the remaining family of > friend > > is missing the defunct. > > You could also add the similar symbols on jewelry for showing we love > > someone, or warrior paintings on faces. > > > > The modern Japanese Emojis were not the first pictograpic signs to > express > > emotions (even if now they have been extended to many other things and > they > > are now widespreading the rest of the world with these extensions). Still > > their main usage remains for emotions ; starting in the 1970's these were > > ASCII art symbols such as the famous :-) > > > > > > > > 2016-02-29 23:24 GMT+01:00 Asmus Freytag (t) : > >> > >> On 2/29/2016 1:55 PM, Philippe Verdy wrote: > >> > >> . Well emojis were initially designed to track amotions and form a sort > of > >> new language, > >> > >> > >> E-moji means "picture-character" in Japanese, has nothing to do (at > first) > >> with emotions. > >> > >> A./ > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 2 09:49:17 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Mar 2016 08:49:17 -0700 Subject: Enclosing BANKNOTE =?UTF-8?Q?emoji=3F?= Message-ID: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> On February 8, Leo Broukhis wrote: > This is clearly an incomplete set. It makes sense to have a generic > "enclosing banknote" emoji character which, when combined with a > currency sign, would produce the corresponding banknote, to forestall > requests for individual emoji for banknotes with remaining currency > signs. I'm not wildly opposed to these -- maybe more so to the more recent idea of variation selectors to transform currency symbols into emoji -- but I wonder if there is really a demand for such images, especially at the small size normally associated with emoji, or if this is simply speculation. At least in principle, "expected usage level" is supposed to be one factor that speaks for or against encoding. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From leob at mailcom.com Wed Mar 2 10:34:40 2016 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 2 Mar 2016 08:34:40 -0800 Subject: Enclosing BANKNOTE emoji? In-Reply-To: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> References: <20160302084917.665a7a7059d7ee80bb4d670165c8327d.92d12b70ac.wbe@email03.secureserver.net> Message-ID: Per se, the level of use is quite respectable. On emojitracker (not yet updated with newer emoji), :dollar: is at #330/845, and the lowest of the group, :yen:, is #688. My calculations based on the usage count and population of the countries using corresponding signs shows that :dollar: is way out of proportion, which means that it is used by default quite a lot. Speaking of "enclosing banknote" vs variation selector, the shorthands (:dollar:, :yen:, etc.) suggest that Twitter treats the banknote emoji as emoji-style of the currency signs, and a new character would be superfluous. Leo On Wed, Mar 2, 2016 at 7:49 AM, Doug Ewell wrote: > On February 8, Leo Broukhis wrote: > > > This is clearly an incomplete set. It makes sense to have a generic > > "enclosing banknote" emoji character which, when combined with a > > currency sign, would produce the corresponding banknote, to forestall > > requests for individual emoji for banknotes with remaining currency > > signs. > > I'm not wildly opposed to these -- maybe more so to the more recent idea > of variation selectors to transform currency symbols into emoji -- but I > wonder if there is really a demand for such images, especially at the > small size normally associated with emoji, or if this is simply > speculation. At least in principle, "expected usage level" is supposed > to be one factor that speaks for or against encoding. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Mar 2 17:35:30 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Mar 2016 00:35:30 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> Message-ID: Both are $ in plain text yes, but they are in textual context. Emojis are to be used alone and interpreted mostly by themselves. They are also highly pictographic and represent the actual object in a realistic way. So a neutral <$> in a banknote emoji would not distinguish the (green) US dollar from the Canadian dollar. In fact that backnote would typically not use the <$> symbol itself, but the actual banknote, or as a fallback, a small version of the country flag, or the letters "US" inside (just like country flag icons). You can still have a banknote emoji based on the currency sign but it will only represent that currency sign and not the actual currency unit (except those currency units whose symbols are strongly tied to the the currency such the the euro sign, or the symbols for the new shekkel, or the new ruppiah, not used for other currencies...). Using variation selectors would not be a solution. In my opinion it would be best to combine a generic currency sign with the other symbol, tied todether using the same technics as those used for emojis representing people or group or people, i.e. a format control hinting the presence of a ligature. 2016-03-01 19:35 GMT+01:00 Leo Broukhis : > It doesn't have to. > > How does the system distinguish between US and Canada dollar in plain > text? Both are <$>. > > Leo > > > On Tue, Mar 1, 2016 at 10:31 AM, Chris Jacobs > wrote: > >> How would the system distinguish between US and Canada dollar? >> >> Both would be <$> + U+FE0F VS16 >> >> Chris >> >> >> Leo Broukhis schreef op 2016-03-01 19:10: >> >> I have a less disruptive proposal than to encode an unprecedented >> combining emoji. >> How about adding variation sequences + U+FE0F VS16 to >> signify BANKNOTE with ? >> >> Leo >> >> On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: >> >>> For the pound emoji, throw in ~90M Egyptians. >>> >>> --J?rg Knappen >>> >>> *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr >>> *Von:* "Leo Broukhis" >>> *An:* "Mark Davis ??" >>> *Cc:* "unicode Unicode Discussion" >>> *Betreff:* Re: Enclosing BANKNOTE emoji? >>> The emojiexpress.com site is useful to check which new emoji or >>> combinations people actually use, but the stats are likely skewed by only >>> measuring input from one platform. >>> >>> Another way to look at the emojitracker.com stats: >>> >>> 339M people in the Eurozone : 389K uses of Euro emoji >>> 126M people in Japan : 354K uses of Yen emoji >>> 140M people in UK + Turkey (likely users of the Pound emoji as a >>> stand-in for Lira) : 515K uses of pound emoji >>> >>> The total is 605M people : 1258K uses of non-dollar emoji >>> Assuming the same average frequency of use, 2933K uses of the dollar >>> emoji would be produced by 1411M people, out of which us + canada + mexico >>> + australia (500M) + other countries using $ as (part of) the sign for >>> their currency are way less than a half. This means that substantially more >>> than 500M people are using the dollar emoji by default, instead of emoji of >>> their national currencies. Assuming a lesser frequency of use will result >>> in a greater estimate of the affected population. >>> >>> Leo >>> >>> >>> On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? >>> wrote: >>>> >>>> Look at http://www.emojixpress.com/stats/. The stats are different, >>>> since they collect data from keyboards not twitter posts, but they have a >>>> nice button to view only the news emoji. >>>> >>>> (The numbers on the new ones will be smaller, just because it takes >>>> time for systems to support them, and people to start using them. However, >>>> they bear out my predication that the most popular would be the >>>> eyes-rolling face). >>>> >>>> >>>> Mark >>>> >>>> >>>> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>>>> >>>>> A caveat about using emojitracker.com : it doesn't count newer emoji >>>>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>>>> added, their counts will be skewed. >>>>> >>>>> Leo >>>>> >>>>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis >>>>> wrote: >>>>>> >>>>>> Thank you for the links, quite mesmerizing! >>>>>> >>>>>> On emojitracker.com (cumulative counts, but only on twitter, >>>>>> AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above >>>>>> the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around >>>>>> 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around >>>>>> 20%ile, but 10x more than the lowest counts, and about the same frequency >>>>>> as various individual clock faces). >>>>>> >>>>>> It is quite evident that the dollar banknote emoji serves as a >>>>>> stand-in for at least half a dozen of various currencies. >>>>>> >>>>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>>>> wrote: >>>>>> >>>>>>> I would suggest that you first gather statistics and present >>>>>>> statistics on how often the current combinations are used compared to other >>>>>>> emoji, eg by consulting sources such as: >>>>>>> >>>>>>> http://www.emojixpress.com/stats/ >>>>>>> or >>>>>>> http://emojitracker.com/ >>>>>>> >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>>>> wrote: >>>>>>>> >>>>>>>> There are >>>>>>>> >>>>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>>>> >>>>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>>>> currency sign, would produce the corresponding banknote, to >>>>>>>> forestall >>>>>>>> requests for individual emoji for banknotes with remaining currency >>>>>>>> signs. >>>>>>>> >>>>>>>> Leo >>>>>>>> >>>>>>> >>>>>>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From verdy_p at wanadoo.fr Wed Mar 2 17:39:16 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 3 Mar 2016 00:39:16 +0100 Subject: Enclosing BANKNOTE emoji? In-Reply-To: References: <95ca449690e088a9c5f276d8e16e196e@xs4all.nl> Message-ID: Both are $ in plain text yes, but they are in textual context. Emojis are to be used alone and interpreted mostly by themselves. They are also highly pictographic and represent the actual object in a realistic way. So a neutral "$" sign in a banknote emoji would not distinguish the (green) US dollar from the Canadian dollar. In fact that backnote emoji for the US dollar would typically not use the "$" currency sign itself (not alone), but would be actual green banknote (it will probably be encoded by itself, just like the one for the yen), or as a fallback, a small version of the country flag, or the letters "US" inside (just like country flag icons). You can still have a banknote emoji based on the currency sign but it will only represent that currency sign and not the actual currency unit (except those currency units whose symbols are strongly tied to the the currency such the the euro sign, or the symbols for the new shekkel, or the new ruppiah, not used for other currencies...). Using variation selectors would not be a solution. In my opinion it would be best to combine a generic/blank banknote emoji with the other symbol representing a currency sign or country flag, tied together using the same technics as those used for emojis representing people or group or people, i.e. a format control hinting the presence of a ligature. 2016-03-01 19:35 GMT+01:00 Leo Broukhis : > It doesn't have to. > > How does the system distinguish between US and Canada dollar in plain > text? Both are <$>. > > Leo > > > On Tue, Mar 1, 2016 at 10:31 AM, Chris Jacobs > wrote: > >> How would the system distinguish between US and Canada dollar? >> >> Both would be <$> + U+FE0F VS16 >> >> Chris >> >> >> Leo Broukhis schreef op 2016-03-01 19:10: >> >> I have a less disruptive proposal than to encode an unprecedented >> combining emoji. >> How about adding variation sequences + U+FE0F VS16 to >> signify BANKNOTE with ? >> >> Leo >> >> On Wed, Feb 10, 2016 at 1:38 AM, "J?rg Knappen" wrote: >> >>> For the pound emoji, throw in ~90M Egyptians. >>> >>> --J?rg Knappen >>> >>> *Gesendet:* Dienstag, 09. Februar 2016 um 23:46 Uhr >>> *Von:* "Leo Broukhis" >>> *An:* "Mark Davis ??" >>> *Cc:* "unicode Unicode Discussion" >>> *Betreff:* Re: Enclosing BANKNOTE emoji? >>> The emojiexpress.com site is useful to check which new emoji or >>> combinations people actually use, but the stats are likely skewed by only >>> measuring input from one platform. >>> >>> Another way to look at the emojitracker.com stats: >>> >>> 339M people in the Eurozone : 389K uses of Euro emoji >>> 126M people in Japan : 354K uses of Yen emoji >>> 140M people in UK + Turkey (likely users of the Pound emoji as a >>> stand-in for Lira) : 515K uses of pound emoji >>> >>> The total is 605M people : 1258K uses of non-dollar emoji >>> Assuming the same average frequency of use, 2933K uses of the dollar >>> emoji would be produced by 1411M people, out of which us + canada + mexico >>> + australia (500M) + other countries using $ as (part of) the sign for >>> their currency are way less than a half. This means that substantially more >>> than 500M people are using the dollar emoji by default, instead of emoji of >>> their national currencies. Assuming a lesser frequency of use will result >>> in a greater estimate of the affected population. >>> >>> Leo >>> >>> >>> On Tue, Feb 9, 2016 at 8:51 AM, Mark Davis ?? >>> wrote: >>>> >>>> Look at http://www.emojixpress.com/stats/. The stats are different, >>>> since they collect data from keyboards not twitter posts, but they have a >>>> nice button to view only the news emoji. >>>> >>>> (The numbers on the new ones will be smaller, just because it takes >>>> time for systems to support them, and people to start using them. However, >>>> they bear out my predication that the most popular would be the >>>> eyes-rolling face). >>>> >>>> >>>> Mark >>>> >>>> >>>> On Tue, Feb 9, 2016 at 5:19 PM, Leo Broukhis wrote: >>>>> >>>>> A caveat about using emojitracker.com : it doesn't count newer emoji >>>>> yet (e.g. U+1F37E bottle with popping cork is absent), thus, when they are >>>>> added, their counts will be skewed. >>>>> >>>>> Leo >>>>> >>>>> On Tue, Feb 9, 2016 at 2:00 AM, Leo Broukhis >>>>> wrote: >>>>>> >>>>>> Thank you for the links, quite mesmerizing! >>>>>> >>>>>> On emojitracker.com (cumulative counts, but only on twitter, >>>>>> AFAICS), U+1F4B5 ($) had quite a respectable count of 2932622 (well above >>>>>> the middle of the page, around 70%ile), U+1F4B7 (pound) had 514536 (around >>>>>> 30%ile), and U+1F4B4 and U+1F4B6 had around 353K and 388K resp. (around >>>>>> 20%ile, but 10x more than the lowest counts, and about the same frequency >>>>>> as various individual clock faces). >>>>>> >>>>>> It is quite evident that the dollar banknote emoji serves as a >>>>>> stand-in for at least half a dozen of various currencies. >>>>>> >>>>>> On Mon, Feb 8, 2016 at 10:25 PM, Mark Davis ?? >>>>>> wrote: >>>>>> >>>>>>> I would suggest that you first gather statistics and present >>>>>>> statistics on how often the current combinations are used compared to other >>>>>>> emoji, eg by consulting sources such as: >>>>>>> >>>>>>> http://www.emojixpress.com/stats/ >>>>>>> or >>>>>>> http://emojitracker.com/ >>>>>>> >>>>>>> >>>>>>> Mark >>>>>>> >>>>>>> >>>>>>> On Mon, Feb 8, 2016 at 8:34 PM, Leo Broukhis >>>>>>> wrote: >>>>>>>> >>>>>>>> There are >>>>>>>> >>>>>>>> ?? U+01F4B4 Banknote With Yen Sign >>>>>>>> ?? U+01F4B5 Banknote With Dollar Sign >>>>>>>> ?? U+01F4B6 Banknote With Euro Sign >>>>>>>> ?? U+01F4B7 Banknote With Pound Sign >>>>>>>> >>>>>>>> This is clearly an incomplete set. It makes sense to have a generic >>>>>>>> "enclosing banknote" emoji character which, when combined with a >>>>>>>> currency sign, would produce the corresponding banknote, to >>>>>>>> forestall >>>>>>>> requests for individual emoji for banknotes with remaining currency >>>>>>>> signs. >>>>>>>> >>>>>>>> Leo >>>>>>>> >>>>>>> >>>>>>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: blocked.gif Type: image/gif Size: 118 bytes Desc: not available URL: From mandel59 at gmail.com Thu Mar 3 13:42:02 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 04:42:02 +0900 Subject: Failure on Japanese dolls emoji Message-ID: <56D8938A.2090000@gmail.com> Hello, Unicode 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: ?? Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. Thanks, Ryusei -------------- next part -------------- An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Thu Mar 3 16:59:33 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 3 Mar 2016 23:59:33 +0100 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8938A.2090000@gmail.com> References: <56D8938A.2090000@gmail.com> Message-ID: On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: > > Hello, Unicode > > 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 > > Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so you are expecting a particular visual, which is not promised anywhere. Is a bit like if I complained that some "MOUNTAIN" emojis are wrong because they don't look like Monte Bianco. Cheers From mandel59 at gmail.com Thu Mar 3 18:57:29 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 09:57:29 +0900 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DA94.4040509@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> Message-ID: <56D8DD79.9070102@gmail.com> On 2016/03/04 7:59, Pierpaolo Bernardi wrote: > On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: >> Hello, Unicode >> >> 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji:http://mandel59.hateblo.jp/entry/2016/03/04/041437 >> >> Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. > But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so > you are expecting a particular visual, which is not promised anywhere. > > Is a bit like if I complained that some "MOUNTAIN" emojis are wrong > because they don't look like Monte Bianco. > > Cheers JAPANESE DOLLS in Unicode is collected from the character sets of KDDI and SoftBank, Japanese telecom companies, and the emoji is named as ?? ? or ???? (both are hina-matsuri) in these specs. Here is a capture of Chart with FPDAM8 data and glyphs via https://sites.google.com/site/unicodesymbols/Home/emoji-symbols And the NamesList.txt of Unicode Character Database gives the description: Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities to let the emoji look like hina-matsuri? Ryusei -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 48630 bytes Desc: not available URL: From alolita.sharma at gmail.com Thu Mar 3 19:14:18 2016 From: alolita.sharma at gmail.com (Alolita Sharma) Date: Thu, 3 Mar 2016 17:14:18 -0800 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DD79.9070102@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: Hi Ryusei, I provided your useful feedback to the Emoji design team at Twitter and they will update the twemoji for Japanese dolls. Thanks for providing excellent examples in your post. Best, Alolita On Thu, Mar 3, 2016 at 4:57 PM, Ryusei Yamaguchi wrote: > On 2016/03/04 7:59, Pierpaolo Bernardi wrote: > > On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: > > Hello, Unicode > > 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji: http://mandel59.hateblo.jp/entry/2016/03/04/041437 > > Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. > > But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so > you are expecting a particular visual, which is not promised anywhere. > > Is a bit like if I complained that some "MOUNTAIN" emojis are wrong > because they don't look like Monte Bianco. > > Cheers > > > JAPANESE DOLLS in Unicode is collected from the character sets of KDDI and > SoftBank, Japanese telecom companies, and the emoji is named as ??? or ???? > (both are hina-matsuri) in these specs. Here is a capture of Chart with > FPDAM8 data and glyphs > > via > https://sites.google.com/site/unicodesymbols/Home/emoji-symbols > > > And the NamesList.txt of Unicode Character Database gives the description: > Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities > to let the emoji look like hina-matsuri? > > Ryusei > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 48630 bytes Desc: not available URL: From olopierpa at gmail.com Thu Mar 3 19:20:08 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Fri, 4 Mar 2016 02:20:08 +0100 Subject: Failure on Japanese dolls emoji In-Reply-To: <56D8DD79.9070102@gmail.com> References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: On Fri, Mar 4, 2016 at 1:57 AM, Ryusei Yamaguchi wrote: > And the NamesList.txt of Unicode Character Database gives the description: Japanese Hinamatsuri or girls' doll festival. Aren't they the authorities to let the emoji look like hina-matsuri? OK. Then you are right in your complaint! Cheers From mandel59 at gmail.com Thu Mar 3 21:12:15 2016 From: mandel59 at gmail.com (Ryusei Yamaguchi) Date: Fri, 4 Mar 2016 12:12:15 +0900 Subject: Failure on Japanese dolls emoji In-Reply-To: References: <56D8938A.2090000@gmail.com> <56D8DA94.4040509@gmail.com> <56D8DD79.9070102@gmail.com> Message-ID: <56D8FD0F.7080503@gmail.com> Thank you, Alolita :) Ryusei On 2016/03/04 10:14, Alolita Sharma wrote: > Hi Ryusei, > > I provided your useful feedback to the Emoji design team at Twitter and > they will update the twemoji for Japanese dolls. > Thanks for providing excellent examples in your post. > > Best, > Alolita > > > > On Thu, Mar 3, 2016 at 4:57 PM, Ryusei Yamaguchi > wrote: > > On 2016/03/04 7:59, Pierpaolo Bernardi wrote: >> On Thu, Mar 3, 2016 at 8:42 PM, Ryusei Yamaguchi wrote: >>> Hello, Unicode >>> >>> 3rd March is hina-matsuri (???; Doll's Day) in Japan, and there is an emoji for it: Japanese Dolls. I wrote an article on failures of that emoji:http://mandel59.hateblo.jp/entry/2016/03/04/041437 >>> >>> Some vendors ship Japanese Dolls emoji that don't seem to be hina-matsuri dolls. I wish difficulty of implementation of culture-dependent emoji be given wider publicity by this post. >> But, the name of the emoji is "JAPANESE DOLLS", not hina-matsuri, so >> you are expecting a particular visual, which is not promised anywhere. >> >> Is a bit like if I complained that some "MOUNTAIN" emojis are wrong >> because they don't look like Monte Bianco. >> >> Cheers > > JAPANESE DOLLS in Unicode is collected from the character sets of > KDDI and SoftBank, Japanese telecom companies, and the emoji is > named as ??? or ???? (both are hina-matsuri) in these specs. > Here is a capture of Chart with FPDAM8 data and glyphs > > via > https://sites.google.com/site/unicodesymbols/Home/emoji-symbols > > > And the NamesList.txt of Unicode Character Database gives the > description: Japanese Hinamatsuri or girls' doll festival. Aren't > they the authorities to let the emoji look like hina-matsuri? > > Ryusei > > From doug at ewellic.org Fri Mar 4 10:51:38 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Mar 2016 09:51:38 -0700 Subject: Failure on Japanese dolls emoji Message-ID: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> Pierpaolo Bernardi wrote: >> And the NamesList.txt of Unicode Character Database gives the >> description: Japanese Hinamatsuri or girls' doll festival. Aren't >> they the authorities to let the emoji look like hina-matsuri? > > OK. Then you are right in your complaint! FWIW, I agree that annotations in NamesList.txt are a better justification for prescribing the glyph design of a Unicode character, even an emoji, than tribal knowledge about the history or origin of the character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? ?? From asmus-inc at ix.netcom.com Fri Mar 4 11:19:33 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 4 Mar 2016 09:19:33 -0800 Subject: Failure on Japanese dolls emoji In-Reply-To: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> References: <20160304095138.665a7a7059d7ee80bb4d670165c8327d.fc7bd41270.wbe@email03.secureserver.net> Message-ID: <56D9C3A5.7080606@ix.netcom.com> An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Mar 6 22:56:02 2016 From: prosfilaes at gmail.com (David Starner) Date: Mon, 07 Mar 2016 04:56:02 +0000 Subject: Mammal emoji Message-ID: Seeing the presence of foxes on the upcoming emoji list, I remembered the Audubon Mammals (North America) app has silhouettes of mammals on the browse by shape tab. So let's see if they're covered: Armored Mammals (-): Okay, we're off to a bad start. The image here is sort of porcupine-ish, and there's two distinct creatures under the label, the porcupine(-) and armadillo(-). Neither of which are in Unicode. Bats (N): In the new list. Which is good; they're sort of iconic. Bears(+) Cats(+): Several varieties Chipmunks, Squirrels and Prairie Dogs(+): Breaking down more than icons the app uses, there is a Chipmunk(+) emoji, no Squirrel(-) emoji; that might be an oversight. Prairie dogs(-) probably aren't. Hoofed Mammals(+): Breaking it down more Bison(-), Sheep(+), Reindeer (-) (and that's sort of surprising), Peccary(-), Deer(N), Moose(-) (aka Elk in Europe), Ox(+) (actually Muskox ... and I'm pretty sure that's a distinction Unicode doesn't want to worry about), Pronghorn (-) (nor antelope(-), or the actually related giraffe(-) and okapi(-). Probably covered by the unrelated deer.) Boar(+), Horse(+) Large Rodents(-): Beaver(-), Muskrat(-), Marmot(-), Nutria(-) Marine Mammals(+): Dolphin(+), Whale(+), Seal (-), Sea Lion(-), Walrus(-), Manatee(-) Mice and Rats(+): Mouse(+), Rat(+) Opossum(-): Otters(-): Rabbits and Hares(+): Raccoons and Their Kin(-): Shrews and Moles(-): Voles, Lemmings, Pikas, and Pocket Gophers(-): Weasels, Skunks and Their Kin(-): While a disparate group, badgers(-), skunks(-), ferrets(-), weasels(-) and wolverines(-) all have arguments for encoding. Wolves, Foxes, and Coyote(+): Fox(+), Dog(+), Wolf(+), Coyote(-) So nine icons out of the 17 have a reasonable encoding in Unicode. To cover the set would need an armadillo or porcupine, a beaver, a possum, an otter, a raccoon, a shrew, a lemming, and a weasel or skunk. Beavers (O Canada!), raccoons, ferrets/weasel (popular pet) and skunk (emoji uses abound) probably have the best encoding arguments there. (This is not an actual proposal, but feel free to forward it on if anyone might want to make one. Just a discussion of a set of icons in the reflection of emoji.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Mon Mar 7 14:58:45 2016 From: petercon at microsoft.com (Peter Constable) Date: Mon, 7 Mar 2016 20:58:45 +0000 Subject: Mammal emoji In-Reply-To: References: Message-ID: I know you?re not proposing anything and just providing info for discussion. I want to make sure it?s clear to others that there is no requirement for encoded emoji in Unicode to provide comprehensive coverage (by any measure) of any semantic or conceptual domain. So, if there isn?t any raccoon emoji in Unicode, that doesn?t imply that there must or ever will be a raccoon emoji. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Starner Sent: Sunday, March 6, 2016 8:56 PM To: Unicode Mailing List Subject: Mammal emoji Seeing the presence of foxes on the upcoming emoji list, I remembered the Audubon Mammals (North America) app has silhouettes of mammals on the browse by shape tab. So let's see if they're covered: Armored Mammals (-): Okay, we're off to a bad start. The image here is sort of porcupine-ish, and there's two distinct creatures under the label, the porcupine(-) and armadillo(-). Neither of which are in Unicode. Bats (N): In the new list. Which is good; they're sort of iconic. Bears(+) Cats(+): Several varieties Chipmunks, Squirrels and Prairie Dogs(+): Breaking down more than icons the app uses, there is a Chipmunk(+) emoji, no Squirrel(-) emoji; that might be an oversight. Prairie dogs(-) probably aren't. Hoofed Mammals(+): Breaking it down more Bison(-), Sheep(+), Reindeer (-) (and that's sort of surprising), Peccary(-), Deer(N), Moose(-) (aka Elk in Europe), Ox(+) (actually Muskox ... and I'm pretty sure that's a distinction Unicode doesn't want to worry about), Pronghorn (-) (nor antelope(-), or the actually related giraffe(-) and okapi(-). Probably covered by the unrelated deer.) Boar(+), Horse(+) Large Rodents(-): Beaver(-), Muskrat(-), Marmot(-), Nutria(-) Marine Mammals(+): Dolphin(+), Whale(+), Seal (-), Sea Lion(-), Walrus(-), Manatee(-) Mice and Rats(+): Mouse(+), Rat(+) Opossum(-): Otters(-): Rabbits and Hares(+): Raccoons and Their Kin(-): Shrews and Moles(-): Voles, Lemmings, Pikas, and Pocket Gophers(-): Weasels, Skunks and Their Kin(-): While a disparate group, badgers(-), skunks(-), ferrets(-), weasels(-) and wolverines(-) all have arguments for encoding. Wolves, Foxes, and Coyote(+): Fox(+), Dog(+), Wolf(+), Coyote(-) So nine icons out of the 17 have a reasonable encoding in Unicode. To cover the set would need an armadillo or porcupine, a beaver, a possum, an otter, a raccoon, a shrew, a lemming, and a weasel or skunk. Beavers (O Canada!), raccoons, ferrets/weasel (popular pet) and skunk (emoji uses abound) probably have the best encoding arguments there. (This is not an actual proposal, but feel free to forward it on if anyone might want to make one. Just a discussion of a set of icons in the reflection of emoji.) -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Mon Mar 7 15:11:31 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 7 Mar 2016 13:11:31 -0800 Subject: Mammal emoji In-Reply-To: References: Message-ID: <56DDEE83.5080504@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 7 19:02:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 8 Mar 2016 02:02:28 +0100 Subject: Mammal emoji In-Reply-To: <56DDEE83.5080504@ix.netcom.com> References: <56DDEE83.5080504@ix.netcom.com> Message-ID: 2016-03-07 22:11 GMT+01:00 Asmus Freytag (t) : > Sometimes looking at semantic domains points out candidates to consider. The ultimate reason for requesting another mammal emoji would rest on the need to included in communications Is there an emoji for the concept of "overbooked/too much work"? which is the last state (and cause) before either: - at best, (it is not unexpectable if people care about each other and themselves!) the sudden abandon/dismiss to do something else, or - at worse, (if it was not personally ancitipated, and other people didn't care) personal breakdown (with deep, costly and durable consequences). This case of breakdown caused by earlier overbooking at work has now a popular term "burnout" (another candidate emoji, but more difficult to represent graphically as you could represent the state of someone depressed, but not its cause). It is becoming popular today with the (ongoing) regulation of conditions of work and prevention of risks by organisations (basically by better distribution of responsabilities, better scheduling of tasks, preservation of personal lifetime of workers, choice to delay some works, and accepting that everything cannot be done with existing resources, even if it could "pay" in the short term). I think that many Unicoders on this list may be at this early step, they have troubles to follow everything in the pipe of incoming requests or proposals, and the UTC is probably under-resourced. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ori at avtalion.name Wed Mar 9 15:17:17 2016 From: ori at avtalion.name (Ori Avtalion) Date: Wed, 9 Mar 2016 23:17:17 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 Message-ID: Unicode includes the following symbols as "Go Markers": * U+2686 ? WHITE CIRCLE WITH DOT RIGHT * U+2687 ? WHITE CIRCLE WITH TWO DOTS * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS It is unclear what they are for. I hope someone could explain. 1) I could not find any Go notation that uses dots inside the stones. 2) Why are there no symbols for white/black stones without dots? 3) An earlier proposal [1] suggested additional symbols: * GRAY CIRCLE WITH GRAY DOT RIGHT * GRAY CIRCLE WITH GRAY TWO DOTS * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT * GRAY FILLED CIRCLE WITH WHITE TWO DOTS what was their purpose? Any why are Go Markers proposed as "Mathematical symbols"? Are they meant for mathematical research of the game of Go and not for actual notation? [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf Thanks in advance! From kenwhistler at att.net Wed Mar 9 16:52:34 2016 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 9 Mar 2016 14:52:34 -0800 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: Message-ID: <56E0A932.8010909@att.net> I don't know the answer to this. But I suspect that that the source was from one of the collection of fonts associated with the STIX project research that led to the collection of mathematical symbols additions noted in L2/01-067 (superseded by L2/01-142), as well as the earlier mathematical symbols proposals with the bulk of the symbols that were added to Unicode 3.2. Given that context, it is, indeed, most likely that the symbols were associated with some publication(s) in game theory, rather than with professional Go notations per se. See, for example, Mathematical Go: Chilling Gets the Last Point: http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 I don't see black/white circles with dots in the bit of that publication scanned on Amazon, but it does use a black circle with a delta symbol as part of the game notation for discussion, as well as black and white circles with numbers, denoting sequences of stone placements. But to know for sure, you would probably have to get confirmation of original sources from Barbara Beeton and/or Patrick Ion, who collected together symbol candidates from a multitude of print sources back in the 1998 - 2001 time frame. --Ken On 3/9/2016 1:17 PM, Ori Avtalion wrote: > Unicode includes the following symbols as "Go Markers": > * U+2686 ? WHITE CIRCLE WITH DOT RIGHT > * U+2687 ? WHITE CIRCLE WITH TWO DOTS > * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT > * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS > > It is unclear what they are for. I hope someone could explain. > > 1) I could not find any Go notation that uses dots inside the stones. > 2) Why are there no symbols for white/black stones without dots? > 3) An earlier proposal [1] suggested additional symbols: > * GRAY CIRCLE WITH GRAY DOT RIGHT > * GRAY CIRCLE WITH GRAY TWO DOTS > * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT > * GRAY FILLED CIRCLE WITH WHITE TWO DOTS > what was their purpose? Any why are Go Markers proposed as > "Mathematical symbols"? Are they meant for mathematical research of > the game of Go and not for actual notation? > > [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf > > Thanks in advance! > > From jtauber at jtauber.com Wed Mar 9 17:17:55 2016 From: jtauber at jtauber.com (James Tauber) Date: Wed, 9 Mar 2016 18:17:55 -0500 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: Black (and white) circle with "delta"/triangle is common in general Go books as is black and white circle with numbers (up into the hundreds). I've also seen circles and squares inside the black and while circle. A quick look at the 10 or so printed Go books I have don't have any examples of those 4 Go Markers U+2686 to U+2689. James On Wed, Mar 9, 2016 at 5:52 PM, Ken Whistler wrote: > I don't know the answer to this. But I suspect that that the source > was from one of the collection of fonts associated with the STIX > project research that led to the collection of mathematical symbols > additions noted in L2/01-067 (superseded by L2/01-142), as well > as the earlier mathematical symbols proposals with the bulk of > the symbols that were added to Unicode 3.2. > > Given that context, it is, indeed, most likely that the symbols were > associated with some publication(s) in game theory, rather than > with professional Go notations per se. See, for example, > Mathematical Go: Chilling Gets the Last Point: > > http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 > > I don't see black/white circles with dots in the bit of that publication > scanned on Amazon, but it does use a black circle with a delta > symbol as part of the game notation for discussion, as well as > black and white circles with numbers, denoting sequences of stone > placements. > > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. > > --Ken > > > On 3/9/2016 1:17 PM, Ori Avtalion wrote: > >> Unicode includes the following symbols as "Go Markers": >> * U+2686 ? WHITE CIRCLE WITH DOT RIGHT >> * U+2687 ? WHITE CIRCLE WITH TWO DOTS >> * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT >> * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS >> >> It is unclear what they are for. I hope someone could explain. >> >> 1) I could not find any Go notation that uses dots inside the stones. >> 2) Why are there no symbols for white/black stones without dots? >> 3) An earlier proposal [1] suggested additional symbols: >> * GRAY CIRCLE WITH GRAY DOT RIGHT >> * GRAY CIRCLE WITH GRAY TWO DOTS >> * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT >> * GRAY FILLED CIRCLE WITH WHITE TWO DOTS >> what was their purpose? Any why are Go Markers proposed as >> "Mathematical symbols"? Are they meant for mathematical research of >> the game of Go and not for actual notation? >> >> [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf >> >> Thanks in advance! >> >> >> > -- James Tauber http://jtauber.com/ @jtauber on Twitter -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu Mar 10 01:00:57 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 10 Mar 2016 16:00:57 +0900 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: <56E11BA9.4030703@it.aoyama.ac.jp> On 2016/03/10 07:52, Ken Whistler wrote: > I don't know the answer to this. But I suspect that that the source > was from one of the collection of fonts associated with the STIX > project research that led to the collection of mathematical symbols > additions noted in L2/01-067 (superseded by L2/01-142), as well > as the earlier mathematical symbols proposals with the bulk of > the symbols that were added to Unicode 3.2. > > Given that context, it is, indeed, most likely that the symbols were > associated with some publication(s) in game theory, rather than > with professional Go notations per se. See, for example, > Mathematical Go: Chilling Gets the Last Point: > > http://www.amazon.com/Mathematical-Go-Chilling-Gets-Point/dp/1568810326 I own and have read the actual book. For examples of the characters mentioned, please see e.g. pp. 17, 21,.... I think the grey stones in the earlier proposal were left out because in the book, there are board diagrams with e.g. 1/4 of a stone gray,... So yes, these symbols are used for for mathematical research of the game of Go, and not as far as I know for actual notation. The research is in combinatorial game theory, where very weird infinitesimal numbers (e.g. greater than 0 but smaller than any positive number!) are often used. These numbers are part of the 'Surreal Numbers' introduced in Donald Knuth's 1974 book of the same name. And while I have only seen the symbols in mathematical work, that theory can be highly relevant in actual endgames, and at least professional players should be aware of it (the theory, not the symbols), because often games can be decided by the last point won or lost in the endgame. > I don't see black/white circles with dots in the bit of that publication > scanned on Amazon, but it does use a black circle with a delta > symbol as part of the game notation for discussion, as well as > black and white circles with numbers, denoting sequences of stone > placements. As James said, the circles with numbers are extremely widely used; it's the basic way to show games (because stones are not moved around and only very rarely removed from the board, the main notation for Go is not a list of moves with coordinates (as e.g. in Chess), but just a diagram of the final (or intermediate) board position with every move labeled with a number. But because these numbers can go up to the 200s, it doesn't make sense to register them all as characters (one would need over 500!). Regards, Martin. > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. > > --Ken > > On 3/9/2016 1:17 PM, Ori Avtalion wrote: >> Unicode includes the following symbols as "Go Markers": >> * U+2686 ? WHITE CIRCLE WITH DOT RIGHT >> * U+2687 ? WHITE CIRCLE WITH TWO DOTS >> * U+2688 ? BLACK CIRCLE WITH WHITE DOT RIGHT >> * U+2689 ? BLACK CIRCLE WITH TWO WHITE DOTS >> >> It is unclear what they are for. I hope someone could explain. >> >> 1) I could not find any Go notation that uses dots inside the stones. >> 2) Why are there no symbols for white/black stones without dots? >> 3) An earlier proposal [1] suggested additional symbols: >> * GRAY CIRCLE WITH GRAY DOT RIGHT >> * GRAY CIRCLE WITH GRAY TWO DOTS >> * GRAY FILLED CIRCLE WITH WHITE DOT RIGHT >> * GRAY FILLED CIRCLE WITH WHITE TWO DOTS >> what was their purpose? Any why are Go Markers proposed as >> "Mathematical symbols"? Are they meant for mathematical research of >> the game of Go and not for actual notation? >> >> [1] http://www.unicode.org/L2/L2001/01067-n2318-mathadd4.pdf >> >> Thanks in advance! >> >> > > . > From andrewcwest at gmail.com Thu Mar 10 03:17:14 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 09:17:14 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E11BA9.4030703@it.aoyama.ac.jp> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > So yes, these symbols are used for for mathematical research of the game of > Go, and not as far as I know for actual notation. Which indicates how absurd the proposal to emojify these four characters is. http://www.unicode.org/L2/L2016/16021-game-pieces-emoji.pdf Andrew From andrewcwest at gmail.com Thu Mar 10 05:26:05 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 11:26:05 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E11BA9.4030703@it.aoyama.ac.jp> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > because these numbers can go up to the 200s, it doesn't make sense to > register them all as characters (one would need over 500!). I don't get why that would make no sense. We already have CIRCLED NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and these characters are widely used (in East Asian contexts, at least) for representing note numbers in text. In my opinion it would be eminently sensible to extend both series up to 999, which would cover the needs of Go notation and as well as note numbering for the vast majority of users. Andrew From leoboiko at gmail.com Thu Mar 10 05:34:30 2016 From: leoboiko at gmail.com (Leonardo Boiko) Date: Thu, 10 Mar 2016 08:34:30 -0300 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE? 2016/03/10 8:30 "Andrew West" : > On 10 March 2016 at 07:00, Martin J. D?rst wrote: > > > > because these numbers can go up to the 200s, it doesn't make sense to > > register them all as characters (one would need over 500!). > > I don't get why that would make no sense. We already have CIRCLED > NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and > these characters are widely used (in East Asian contexts, at least) > for representing note numbers in text. In my opinion it would be > eminently sensible to extend both series up to 999, which would cover > the needs of Go notation and as well as note numbering for the vast > majority of users. > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Mar 10 06:00:38 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 12:00:38 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 10 March 2016 at 11:34, Leonardo Boiko wrote: > Isn't it better to use some sort of COMBINING ENCLOSING CIRCLE? Of course that approach is possible, but it is quite problematic, both from the perspective of the font developer and the end user, because the circle would have to be able to combine with an indefinite number of preceding characters, and it is not easy to either determine where the boundary is (in the font) or specify the boundary (by the end user). For example, given a text string of "1234" what does the combining circle combine with? Unitary characters would be just way simpler and more reliable. Andrew From everson at evertype.com Thu Mar 10 06:17:24 2016 From: everson at evertype.com (Michael Everson) Date: Thu, 10 Mar 2016 12:17:24 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: <2A330A52-29E8-4A40-837C-C8979171C670@evertype.com> On 10 Mar 2016, at 11:26, Andrew West wrote: > > On 10 March 2016 at 07:00, Martin J. D?rst wrote: >> >> because these numbers can go up to the 200s, it doesn't make sense to register them all as characters (one would need over 500!). > > I don't get why that would make no sense. We already have CIRCLED NUMBER 1 through 50, and NEGATIVE CIRCLED NUMBER 1 through 20, and these characters are widely used (in East Asian contexts, at least) > for representing note numbers in text. In my opinion it would be eminently sensible to extend both series up to 999, which would cover the needs of Go notation and as well as note numbering for the vast majority of users. Good ideas don?t always get past the UTC. Remember when we wanted to encode 256 two-letter codes for the country flags? That was replaced by the ?emoji flag alphabet?. Now some people want to use combinations for currency emojis, and evidently that (with some combining emoji banknote character) would have been easier with 256 atomic codes than it would be with the flag alphabet. Michael Everson * http://www.evertype.com/ From oren.watson at gmail.com Wed Mar 9 21:08:14 2016 From: oren.watson at gmail.com (Oren Watson) Date: Wed, 9 Mar 2016 22:08:14 -0500 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: I was surprised to find out that there are gaps in the Mathematical alphanumeric symbols block (U+1d400 to u+1d7ff). The gaps are associated with the inclusion of similar symbols in other blocks, chiefly the Letterlike Symbols Block. Examples of such gaps include U+1d49d, U+1d506, etc. But as a matter of convenience and simplicity, these missing codepoints could have been defined, as decomposing directly to the equivalents in Letterlike symbols, in the same manner that the ?ngstr?m sign decomposes to the letter ?. That would make these ranges contiguous. Is there a policy about leaving gaps in otherwise contiguous ranges of codepoints? --Oren Watson -------------- next part -------------- An HTML attachment was scrubbed... URL: From ori at avtalion.name Thu Mar 10 10:35:16 2016 From: ori at avtalion.name (Ori Avtalion) Date: Thu, 10 Mar 2016 18:35:16 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E0A932.8010909@att.net> References: <56E0A932.8010909@att.net> Message-ID: On Thu, Mar 10, 2016 at 12:52 AM, Ken Whistler wrote: > But to know for sure, you would probably have to get confirmation > of original sources from Barbara Beeton and/or Patrick Ion, > who collected together symbol candidates from a multitude > of print sources back in the 1998 - 2001 time frame. I have emailed Barbara with the question, and pointed to this thread. Will report back when I get a response. On Twitter, someone pointed out an example of the two-dot notation, and even a center-dot notation (instead of the "right dot" of U+2686 ?). See page 4 (printed page number 206) of this PDF: http://library.msri.org/books/Book29/files/kim.pdf From asmus-inc at ix.netcom.com Thu Mar 10 10:43:29 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Mar 2016 08:43:29 -0800 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: References: Message-ID: <56E1A431.2030005@ix.netcom.com> An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Thu Mar 10 13:09:05 2016 From: oren.watson at gmail.com (Oren Watson) Date: Thu, 10 Mar 2016 14:09:05 -0500 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: <56E1A431.2030005@ix.netcom.com> References: <56E1A431.2030005@ix.netcom.com> Message-ID: Thank you for the detailed explanation, Asmus. Is there a standard denoting which characters are part of each "mathematical variable alphabet"? There is a table on Wikipedia < https://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols#Latin_letters> but the placement of characters into the gaps is unsourced. Perhaps I'm overthinking this, but I don't think it's necessarily obvious that the character BLACK-LETTER CAPITAL C should be used as the nonexistent character *MATHEMATICAL FRAKTUR CAPITAL C. Is there a document clarifying this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Mar 10 13:48:03 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 19:48:03 +0000 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: References: <56E1A431.2030005@ix.netcom.com> Message-ID: On 10 March 2016 at 19:09, Oren Watson wrote: > > Is there a standard denoting which characters are part of each "mathematical > variable alphabet"? There is a table on Wikipedia > > but the placement of characters into the gaps is unsourced. Perhaps I'm > overthinking this, but I don't think it's necessarily obvious that the > character BLACK-LETTER CAPITAL C should be used as the nonexistent character > *MATHEMATICAL FRAKTUR CAPITAL C. Is there a document clarifying this? Yes, the code charts in the Unicode Standard: http://www.unicode.org/charts/PDF/U1D400.pdf The annotation for each reserved code point refers to the character that logically belongs there. Andrew From doug at ewellic.org Thu Mar 10 14:49:17 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 13:49:17 -0700 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Andrew West replied to Oren Watson: >> Is there a standard denoting which characters are part of each >> "mathematical variable alphabet"? There is a table on Wikipedia [...] > > Yes, the code charts in the Unicode Standard: > > http://www.unicode.org/charts/PDF/U1D400.pdf > > The annotation for each reserved code point refers to the character > that logically belongs there. NamesList.txt also has this information, and unlike the others, it's both official and machine-readable: 1D505 MATHEMATICAL FRAKTUR CAPITAL B # 0042 latin capital letter b 1D506 x (black-letter capital c - 212D) -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From andrewcwest at gmail.com Thu Mar 10 15:00:46 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 10 Mar 2016 21:00:46 +0000 Subject: Gaps in Mathematical Alphanumeric Symbols In-Reply-To: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Message-ID: On 10 March 2016 at 20:49, Doug Ewell wrote: > >> >> http://www.unicode.org/charts/PDF/U1D400.pdf >> >> The annotation for each reserved code point refers to the character >> that logically belongs there. > > NamesList.txt also has this information, and unlike the others, it's > both official and machine-readable: It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is machine-readable, although the file specifically warns that "this file should not be parsed for machine-readable information". Andrew From kenwhistler at att.net Thu Mar 10 15:40:47 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 10 Mar 2016 13:40:47 -0800 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> Message-ID: <56E1E9DF.9060405@att.net> On 3/10/2016 1:00 PM, Andrew West wrote: > It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is > machine-readable, although the file specifically warns that "this file > should not be parsed for machine-readable information". > NamesList.txt is just a structured text file, so of course it is "machine-readable". The problem is that because it is machine-readable, people tend to jump to the conclusion that all the information they need can simply be reliably parsed out of that file. It can't be. The reason is that NamesList.txt is itself the result of a complicated merge of code point, name, and decomposition mapping information from UnicodeData.txt, of listings of standardized variation sequences from StandardizedVariants.txt, and then a very long list of annotational material, including names list subhead material, etc., maintained in other sources. If people actually want to get reliably parsed data on code points, names, and decomposition mappings, they should get that directly from UnicodeData.txt. Likewise for information about standardized variation sequences, from StandardizedVariants.txt. The *reason* that NamesList.txt exists at all is to drive the tool, unibook, that formats the full Unicode code charts for posting. It is only posted in the Unicode Character Database at all as a matter of convenience, to give people access to a text only version of the names list that appears in the fully formatted pdf versions of the code charts that contain all the representative glyphs. NamesList.txt should *not* be data mined. Well, nobody can stop people from attempting to do so, of course, but they tend to end up confused and disappointed, because their assumptions going in don't match the editorial realities that affect the development of the annotational content added to the names list and the actual use for which NamesList.txt was created in the first place. --Ken From doug at ewellic.org Thu Mar 10 15:48:10 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 14:48:10 -0700 Subject: Gaps in Mathematical Alphanumeric Symbols Message-ID: <20160310144810.665a7a7059d7ee80bb4d670165c8327d.acb8ba6be9.wbe@email03.secureserver.net> Andrew West wrote: > It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is > machine-readable, although the file specifically warns that "this file > should not be parsed for machine-readable information". Yes, I saw that mattress tag. I could not find any other files in the UCD proper that reference the unassigned code points within the MAS block, except for DerivedGeneralCategory.txt, which simply says the code points are unassigned. MathClass-*.txt and MathClassEx-*.txt have the information in question in a machine-readable format, if only in comments: 1D505;A #1D506=212D;A 1D505;A;d?".;Bfr;ISOMFRK;;MATHEMATICAL FRAKTUR CAPITAL B #1D506=212D;A;;Cfr;ISOMFRK;;FRAKTUR CAPITAL C These files are not part of the UCD, and aren't updated with every Unicode release, but might be a better reference. Perhaps UTC members can offer a recommendation here. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu Mar 10 16:14:09 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 10 Mar 2016 15:14:09 -0700 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) Message-ID: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Ken Whistler wrote: > NamesList.txt should *not* be data mined. And yet it was the only Unicode data file utilized by MSKLC. There are many possible reasons for this approach, which we will probably never know. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Thu Mar 10 19:05:43 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Mar 2016 17:05:43 -0800 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> References: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Message-ID: <56E219E7.7030708@ix.netcom.com> An HTML attachment was scrubbed... URL: From js_choi at icloud.com Thu Mar 10 19:49:52 2016 From: js_choi at icloud.com (=?utf-8?Q?=22J=2E=C2=A0S=2E_Choi=22?=) Date: Thu, 10 Mar 2016 19:49:52 -0600 Subject: NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols) In-Reply-To: <56E1E9DF.9060405@att.net> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> Message-ID: <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> > On Mar 10, 2016, at 3:40 PM, Ken Whistler wrote: > > On 3/10/2016 1:00 PM, Andrew West wrote: >> It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is >> machine-readable, although the file specifically warns that "this file >> should not be parsed for machine-readable information". >> > > NamesList.txt is just a structured text file, so of course it is "machine-readable". > The problem is that because it is machine-readable, people tend to jump > to the conclusion that all the information they need can simply be > reliably parsed out of that file. > > It can't be. > > The reason is that NamesList.txt is itself the result of a complicated merge > of code point, name, and decomposition mapping information from > UnicodeData.txt, of listings of standardized variation sequences from > StandardizedVariants.txt, and then a very long list of annotational > material, including names list subhead material, etc., maintained in > other sources. > > If people actually want to get reliably parsed data on code points, names, > and decomposition mappings, they should get that directly from > UnicodeData.txt. Likewise for information about standardized variation > sequences, from StandardizedVariants.txt. > > The *reason* that NamesList.txt exists at all is to drive the tool, unibook, > that formats the full Unicode code charts for posting. It is only > posted in the Unicode Character Database at all as a matter of > convenience, to give people access to a text only version of the > names list that appears in the fully formatted pdf versions of the code charts > that contain all the representative glyphs. > > NamesList.txt should *not* be data mined. Well, nobody can stop > people from attempting to do so, of course, but they tend to end > up confused and disappointed, because their assumptions going in > don't match the editorial realities that affect the development of > the annotational content added to the names list and the actual > use for which NamesList.txt was created in the first place. > > --Ken > > On Mar 10, 2016, at 7:05 PM, Asmus Freytag (t) wrote: > > On 3/10/2016 2:14 PM, Doug Ewell wrote: >> Ken Whistler wrote: >> >> >>> NamesList.txt should *not* be data mined. >>> >> And yet it was the only Unicode data file utilized by MSKLC. >> >> There are many possible reasons for this approach, which we will >> probably never know. >> >> > > Extracting information from namelist.txt that was added to that file based on information from the UCD is plain folly - not least because it uses a secondary source instead of a primary source. What may not have come across from Ken's description is that the process for incorporating this data is under editorial control - and some values or entries may be suppressed for readability. There is explicitly not guarantee for completeness. > > There is some information that *only* exists in the nameslist.txt file. This includes, informal aliases for character names, cross references, etc.. The problem with extracting this information blindly (that is, not mediated by a human) is, again, that the level of consistency of presentation is that appropriate for a human reader, not for an extraction algorithm. > > For example, to reduce clutter, cross references are not symmetric or transitive, even though the relationship that gave rise to the cross reference in te first place (e.g. similarity) would normally be one that is symmetric and transitive. The human reader can be trusted to determine that, for example "<" is the "main" entry and that from there all the other, same or similar characters are referenced, but by not listing the reverse direction everywhere, the level of clutter in the rest of the nameslist is reduced, making additional cross references stand out more. > > Those are just the intentional inconsistencies. > > There is a historical development in the annotations - over time, more characters get annotated. However, annotations are not always backported, so the level of annotations can be inconsistent for reasons of incremental development. > > Now, for the x-refs on gaps, a human reader could extract and verify the set, but relying blindly on an algorithm to extract the data is fraught with peril. (Other gaps may have slightly different origin and status, yet also carry an annotation). > > Using the mathematical data files for this is a step up, because the data there is focused on a single use case. The downside is that the information is in a comment field. > > A./ One thing about NamesList.txt is that, as far as I have been able to tell, it?s the only machine-readable, parseable source of those annotations and cross-references. As part of the Unicode Standard and the UCD, the name lists? annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification?s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people?i.e., using screen-reader-friendly HTML rather than PDF?while making clear that the annotations are merely references to the original, normative Standard?s actual code charts and name lists. What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources? data have been published, then, for better or for worse, the name list is all that is available for much information on many code points? usage. Sincerely, J. S. Choi From asmusf at ix.netcom.com Thu Mar 10 20:13:21 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 10 Mar 2016 18:13:21 -0800 Subject: NamesList.txt as data source In-Reply-To: <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> Message-ID: <56E229C1.6020105@ix.netcom.com> On 3/10/2016 5:49 PM, "J. S. Choi" wrote: > One thing about NamesList.txt is that, as far as I have been able to tell, it?s the only machine-readable, parseable source of those annotations and cross-references. There are explanations about character use that are only maintained in the PDF of the core specification, where this information is packaged in a way that can be understood by a human reader, but is not amenable to be extracted by machine. While the annotations, comments, cross references etc. in Namelist.txt appear, formally, to be machine extractable, the way they are created and managed make them just as much "human-accessible" only as the core specification. The goal getting a complete and machine-readable description of character behavior is illusory. > > As part of the Unicode Standard and the UCD, the name lists? annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification?s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people?i.e., using screen-reader-friendly HTML rather than PDF?while making clear that the annotations are merely references to the original, normative Standard?s actual code charts and name lists. This is a different issue. The nameslist.txt is a reasonable source for driving other _formatting_ programs than just Unibook. In fact, the possibility of reuse in this context probably among the unstated rationales for making the information and syntax available in the first place. Let's understand this properly: using the file to translate it into a "human-readable" output format is a proper use of this data, even if that translation is done using a mechanical too, as long as the format is a) a format that benefits from the special shortcuts taken in selecting the information present in the namelist.txt file, b) a format intended to be interpreted by a observant and intelligent human reader, and not c) a format intended as direct input to any text-processing algorithm, or any algorithm that "understands" the contents > > What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources? data have been published, then, for better or for worse, the name list is all that is available for much information on many code points? usage. See my first through third paragraph. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Fri Mar 11 11:37:38 2016 From: oren.watson at gmail.com (Oren Watson) Date: Fri, 11 Mar 2016 12:37:38 -0500 Subject: NamesList.txt as data source In-Reply-To: <56E229C1.6020105@ix.netcom.com> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> <56E229C1.6020105@ix.netcom.com> Message-ID: Ok, so let me see if I understand this correctly. Suppose I'm writing a editor for math equations, and I want the user to be able to press a "Doublestruck" button and then type an C or D to get a ? or ?? respectively. There is apparently no official source containing a machine-readable table of the doublestruck equivalents of each character that has such an equivalent. Such a table might also include { -> ? and such. This seems like something that would be very convenient to have centralized and standardized. --Oren Watson -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Mar 11 12:24:29 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 11 Mar 2016 10:24:29 -0800 Subject: NamesList.txt as data source In-Reply-To: References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <45F6A345-9A0E-4605-BCFB-3BD5A8D0A2BE@icloud.com> <56E229C1.6020105@ix.netcom.com> Message-ID: <56E30D5D.3080006@att.net> On 3/11/2016 9:37 AM, Oren Watson wrote: > Ok, so let me see if I understand this correctly. Suppose I'm writing > a editor for math equations, and I want the user to be able to press a > "Doublestruck" button and then type an C or D to get a ? or ?? > respectively. There is apparently no official source containing a > machine-readable table of the doublestruck equivalents of each > character that has such an equivalent. Such a table might also include > { -> ? and such. > > This seems like something that would be very convenient to have > centralized and standardized. > O.k., it is taking more time to talk about this than to just make the lists. See attached list, which took about 5 minutes to cull. That lists the 24 "unifications" mentioned on page 7 of UTR #25, Unicode Support for Mathematics: http://www.unicode.org/reports/tr25/ It matches the 24 explicit cross-references listed in the Unicode names list. If the ability to pull out such a list and make it "machine-readable" in a few minutes doesn't suffice, and you need something which counts as a more "official source", then the best way forward would be to engage with the UTC during the next update cycle for UTR #25, when its associated data table needs to be checked for the 9.0 repertoire additions, and advocate that some further documentation be made explicitly for those 24 mappings. BTW, all 24 *are* already present in MathClassEx-14.txt: http://www.unicode.org/Public/math/revision-14/MathClassEx-14.txt as commented-out entry lines. So an even faster way to get a centralized (if not "official") list, is to take MathClassEx-14.txt and % grep #1D MathClassEx-14.txt | grep reserved > maplistout.txt See also attached. As for starting down the road of suggesting additional equivalences, e.g. for double-struck parentheses, that is certainly something somebody could do, and might be interesting content to add to UTR #25 -- but it goes beyond the formal unification issue for the 24 mathematical alphabet letters already encoded in the 2100 block. --Ken -------------- next part -------------- 1D455 ; 210E # planck constant 1D49D ; 212C # script capital b 1D4A0 ; 2130 # script capital e 1D4A1 ; 2131 # script capital f 1D4A3 ; 210B # script capital h 1D4A4 ; 2110 # script capital i 1D4A7 ; 2112 # script capital l 1D4A8 ; 2133 # script capital m 1D4AD ; 211B # script capital r 1D4BA ; 212F # script small e 1D4BC ; 210A # script small g 1D4C4 ; 2134 # script small o 1D506 ; 212D # black-letter capital c 1D50B ; 210C # black-letter capital h 1D50C ; 2111 # black-letter capital i 1D515 ; 211C # black-letter capital r 1D51D ; 2128 # black-letter capital z 1D53A ; 2102 # double-struck capital c 1D53F ; 210D # double-struck capital h 1D545 ; 2115 # double-struck capital n 1D547 ; 2119 # double-struck capital p 1D548 ; 211A # double-struck capital q 1D549 ; 211D # double-struck capital r 1D551 ; 2124 # double-struck capital z -------------- next part -------------- #1D455=210E;N;;;;;ITALIC SMALL H #1D49D=212C;A;;Bscr;ISOMSCR;;SCRIPT CAPITAL B #1D4A0=2130;A;;Escr;ISOMSCR;;SCRIPT CAPITAL E #1D4A1=2131;A;;Fscr;ISOMSCR;;SCRIPT CAPITAL F #1D4A3=210B;A;;Hscr;ISOMSCR;;SCRIPT CAPITAL H #1D4A4=2110;A;;Iscr;ISOMSCR;;SCRIPT CAPITAL I #1D4A7=2112;A;;Lscr;ISOMSCR;;SCRIPT CAPITAL L #1D4A8=2133;A;;Mscr;ISOMSCR;;SCRIPT CAPITAL M #1D4AD=211B;A;;Rscr;ISOMSCR;;SCRIPT CAPITAL R #1D4BA=212F;A;;escr;ISOMSCR;;SCRIPT SMALL E #1D4BC=210A;A;;gscr;ISOMSCR;;SCRIPT SMALL G #1D4C4=2134;A;;oscr;ISOMSCR;;SCRIPT SMALL O #1D506=212D;A;;Cfr;ISOMFRK;;FRAKTUR CAPITAL C #1D50B=210C;A;;Hfr;ISOMFRK;;FRAKTUR CAPITAL H #1D50C=2111;A;;Ifr;ISOMFRK;;FRAKTUR CAPITAL I #1D515=211C;A;;Rfr;ISOMFRK;;FRAKTUR CAPITAL R #1D51D=2128;A;;Zfr;ISOMFRK;;FRAKTUR CAPITAL Z #1D53A=2102;A;;Copf;ISOMOPF;;DOUBLE-STRUCK CAPITAL C #1D53F=210D;A;;Hopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL H #1D545=2115;A;;Nopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL N #1D547=2119;A;;Popf;ISOMOPF;;DOUBLE-STRUCK CAPITAL P #1D548=211A;A;;Qopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL Q #1D549=211D;A;;Ropf;ISOMOPF;;DOUBLE-STRUCK CAPITAL R #1D551=2124;A;;Zopf;ISOMOPF;;DOUBLE-STRUCK CAPITAL Z From verdy_p at wanadoo.fr Fri Mar 11 20:35:30 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 12 Mar 2016 03:35:30 +0100 Subject: Easter island inscriptions Message-ID: What is the encoding status of this script, found on inscriptions of Easter Island ? http://www.jps.auckland.ac.nz/document?wid=115 -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Mar 12 18:29:02 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 01:29:02 +0100 (CET) Subject: NamesList.txt as data source In-Reply-To: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> References: <20160310151409.665a7a7059d7ee80bb4d670165c8327d.825e1df1cb.wbe@email03.secureserver.net> Message-ID: <1736223179.14515.1457828942560.JavaMail.www@wwinf1f18> On Thu, 10 Mar 2016 15:14:09 -0700, Doug Ewell wrote: > Ken Whistler wrote: > > > NamesList.txt should *not* be data mined. > > And yet it was the only Unicode data file utilized by MSKLC. > > There are many possible reasons for this approach, which we will > probably never know. Sadly it is too late to ask Michael Kaplan the question. To add one more answer in his place: I never doubted that NamesList.txt was the best choice for MSKLC, which parses the file for code points and character names to generate a human readable display and output as defined by Asmus Freytag on Thu, 10 Mar 2016 18:13:21 -0800. This would have been similarly achieved by parsing UnicodeData.txt. However the main difference between using NamesList vs. UnicodeData in the MSKLC as I see it, is the cultural benefit for the end-user. Consistently, the Names List is shipped in the root directory of MSKLC, beside a copy of the EULA, and then copied to a safe location at %User%\AppData\Local\MSKLC (where I recently updated it to some 8.0.0 version of its French translation), so that the user can view it?and even alter it without disturbing the tool. It?s sort of a pocket version of the Code Charts? textual information, thus likely to satisfy both the (human) keyboard editor and the creator (software). Extrapolating from my case, I believe that the >2 million downloads of MSKLC [1] surely contributed to some extent to spread the knowledge about Unicode, and to give people the desire to learn more?because indeed, Ken Whistler warned on Thu, 10 Mar 2016 13:40:47 -0800, and the Code Charts Disclaimer clearly states that they ?do not provide all the information needed to fully support individual scripts using the Unicode Standard.? And they can?t even. On Thu, 10 Mar 2016 18:13:21 -0800, Asmus Freytag wrote: > The goal getting a complete and machine-readable description of > character behavior is illusory. Marcel [1] Kaplan, M. S. (2013, October 4). The story of MSKLC | Sorting it all Out, v2! Retrieved August 18, 2015, from http://www.siao2.com/2013/10/04/10454264.aspx From charupdate at orange.fr Sat Mar 12 18:35:23 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 01:35:23 +0100 (CET) Subject: Easter island inscriptions In-Reply-To: References: Message-ID: <2023910531.14519.1457829323896.JavaMail.www@wwinf1f18> On Sat, 12 Mar 2016 03:35:30 +0100, Philippe Verdy wrote: > What is the encoding status of this script, found on inscriptions of Easter > Island ? > > http://www.jps.auckland.ac.nz/document?wid=115 It is in the pipeline and has already a codespace in project, but I guess that there must also be some more _actual_ scripts not yet encoded. Kind regards, Marcel From charupdate at orange.fr Sat Mar 12 19:42:27 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 02:42:27 +0100 (CET) Subject: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX? Message-ID: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> AFAICS the iconic representation of U+202F NNBSP for use on keyboards and in keyboard documentation is not yet encoded. In the 2010-08-27[1] and 2012-09-07[2] proposals to encode symbols for use on on-screen keyboards and in documentation, *U+2432 was aimed for this symbol. Actual usage probably relies on formatting, PUA, or icons; the latter epecially for on-screen keyboards as I imagine them, because local applications usually have icon libraries and thus must not rely on plain text only. Why mapping invisible characters to plain text symbols for local use should be any easier than mapping them to the already standardized icons, is out of reach of my understanding. Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? [For the representative glyph, the question is about giving it the same width as that of U+237D by lengthening the shoulders, or reducing the overall width to the same extent as the width of the box. In the first case it might be called SHOULDERED NARROW OPEN BOX, in the second case rather NARROW SHOULDERED OPEN BOX. ISO/IEC?9995-7 standardized the first glyph.] My personal opinion is quite in favor of a symbol to represent the narrow no-break space. By contrast, though I?m using part of the keyboard symbols, I?m not likely to utilize the whole mass of symbols (partly of little obvious use) introduced by ISO/IEC?9995-7:2009 and its 2012 amendment, because for most of them whether I can?t see any reality behind, or I?feel they are way too confusing for end-users, or they are better replaced with more concrete representations?e.g. a letter with hook instead of a symbol for hook applicator, as for dead keys I generally prefer real letters instead of isolated diacritics whatever they are represented with. Let alone that we never can have usable keyboards with all those deadkeys on them, so that we have to rely on compose sequences, that can be documented in natural language and are far more mnemonic, e.g. ?compose?}? for palatal hook; ?compose?{? for retroflex hook; ?compose?]? for hook above; ?compose?[? for a hook as on U+0187..U+0188. A model of practicity are Keyman keyboard layouts, that may use ASCII characters only, to enter whatever letters and diacritics. So I believe that if the NNBSP symbol hadn?t been buried in a bunch of other late ISO/IEC?9995-7 symbols, it would now be a part of Unicode. BTW U+202F NNBSP had been encoded three years before the release of ISO/IEC?9995-7:2002. Best regards, Marcel [1] http://www.unicode.org/L2/L2010/10351-n3897-jtc1sc35n1579.pdf [2] http://www.unicode.org/L2/L2012/12302-wg1-%209995-7-n4317.pdf From gwalla at gmail.com Sat Mar 12 19:52:47 2016 From: gwalla at gmail.com (Garth Wallace) Date: Sat, 12 Mar 2016 17:52:47 -0800 Subject: Easter island inscriptions In-Reply-To: References: Message-ID: On Fri, Mar 11, 2016 at 6:35 PM, Philippe Verdy wrote: > What is the encoding status of this script, found on inscriptions of Easter > Island ? > > http://www.jps.auckland.ac.nz/document?wid=115 It's called Rongorongo, and according to the Roadmap about 40 columns have been provisionally set aside in the SMP but no proposal has been submitted yet. From charupdate at orange.fr Sun Mar 13 00:13:37 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 07:13:37 +0100 (CET) Subject: Easter island inscriptions In-Reply-To: References: Message-ID: <1077012845.181.1457849617600.JavaMail.www@wwinf1m08> On Sat, 12 Mar 2016 17:52:47 -0800, Garth Wallace wrote: > On Fri, Mar 11, 2016 at 6:35 PM, Philippe Verdy wrote: > > What is the encoding status of this script, found on inscriptions of Easter > > Island ? > > > > http://www.jps.auckland.ac.nz/document?wid=115 > > It's called Rongorongo, and according to the Roadmap about 40 columns > have been provisionally set aside in the SMP but no proposal has been > submitted yet. Though no _formal_ proposal is found, the draft proposal from Michael Everson [1][2] is ready to feed in since a long time. From there and the reserved code space I concluded that it is ?in the pipeline?, but perhaps I?was somewhat too optimistic. Marcel [1] http://www.unicode.org/L2/L1999/rongorongo.pdf [2] http://www.evertype.com/standards/iso10646/pdf/rongorongo.pdf From jsbien at mimuw.edu.pl Sun Mar 13 00:55:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sun, 13 Mar 2016 07:55:24 +0100 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <56E1E9DF.9060405@att.net> (Ken Whistler's message of "Thu, 10 Mar 2016 13:40:47 -0800") References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> Message-ID: <864mcaeujn.fsf_-_@mimuw.edu.pl> On Thu, Mar 10 2016 at 22:40 CET, kenwhistler at att.net writes: > The *reason* that NamesList.txt exists at all is to drive the tool, > unibook, that formats the full Unicode code charts for posting. [...] On Fri, Mar 11 2016 at 3:13 CET, asmusf at ix.netcom.com writes: > On 3/10/2016 5:49 PM, "J. S. Choi" wrote: >> One thing about NamesList.txt is that, as far as I have been able to >> tell, it?s the only machine-readable, parseable source of those >> annotations and cross-references. [...] > This is a different issue. The nameslist.txt is a reasonable source > for driving other formatting programs than just Unibook. Exactly. A student of mine wrote a font sampling program producing output in a Unibook-like form. For this purpose he wrote also a converter from NamesList format to XML: https://github.com/ppablo28/fntsample_ucd_comments https://github.com/ppablo28/ucd_xml_parser I use the XML version of NamesList to provide my own comments to characters (work in progress): https://bitbucket.org/jsbien/parkosz-font/downloads/Parkosz1907draft.pdf Other examples of NamesList.txt use are http://www.fileformat.info/info/unicode/ https://codepoints.net/ Although not exactly the formatting programs, in my opinion they constitute also a valid use. > In fact, the possibility of reuse in this context probably among the > unstated rationales for making the information and syntax available in > the first place. I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? [...] >> What are these other primary sources that maintain these other >> annotation data; are they publicly available? If the name list is the >> only place where these sources? data have been published, then, for >> better or for worse, the name list is all that is available for much >> information on many code points? usage. > See my first through third paragraph. You wrote: [...] > There are explanations about character use that are only maintained in > the PDF of the core specification, where this information is packaged > in a way that can be understood by a human reader, but is not amenable > to be extracted by machine. > > While the annotations, comments, cross references etc. in Namelist.txt > appear, formally, to be machine extractable, the way they are created > and managed make them just as much "human-accessible" only as the core > specification. I'm afraid it's not clear for me. Let's take an example. Sometime ago I inquired about a controversial alias for U+018D: http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html Can I really find anything about "reversed Polish-hook o" in the core specification which is not a literal copy of the information from NamesList.txt? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From eric.muller at efele.net Sun Mar 13 10:14:33 2016 From: eric.muller at efele.net (Eric Muller) Date: Sun, 13 Mar 2016 08:14:33 -0700 Subject: Emoji Feminism - The New York Times Message-ID: <56E583D9.2070708@efele.net> http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0 From bortzmeyer at nic.fr Sun Mar 13 10:31:56 2016 From: bortzmeyer at nic.fr (Stephane Bortzmeyer) Date: Sun, 13 Mar 2016 16:31:56 +0100 Subject: Emoji Feminism - The New York Times In-Reply-To: <56E583D9.2070708@efele.net> References: <56E583D9.2070708@efele.net> Message-ID: <20160313153156.GA4440@nic.fr> On Sun, Mar 13, 2016 at 08:14:33AM -0700, Eric Muller wrote a message of 1 lines which said: > http://www.nytimes.com/2016/03/13/opinion/sunday/emoji-feminism.html?_r=0 Funny (I love penguins), but the New York Times should read UTR #51, section 2.1 http://www.unicode.org/reports/tr51/tr51-3.html#Gender From charupdate at orange.fr Sun Mar 13 12:13:28 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 13 Mar 2016 18:13:28 +0100 (CET) Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <864mcaeujn.fsf_-_@mimuw.edu.pl> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <864mcaeujn.fsf_-_@mimuw.edu.pl> Message-ID: <109301831.6849.1457889208287.JavaMail.www@wwinf1d36> On Sun, 13 Mar 2016 07:55:24 +0100, Janusz S. Bie? wrote: > For this purpose he wrote also a converter from NamesList format to XML That goes straight into the direction I?suggested past year as a beta feedback item[1], but I?never thought that it could be so simple. > I understand there is no intention to make an official XML version of the file as it would require changes in Unibook? The difference however between homemade databases and official ones is that the latter raise much higher expectations. Asmus Freytag outlined in this thread?as well as in his comments on my feedback?that *no* ?complete? UCD version, regardless of how complete it effectively might be, can ever meet the assumptions people inevitably would make on it. Further, experience shows that the actually provided information is way more than most people are able to mentally process. E.g. most online character information providers do not display the formal aliases, so that in the best case some aware users add that information using the comment facility. I don?t cite any: These are free tools and platforms that must not be criticized. When we imagine a hypothetical UCD containing detailed information about the usage of any existing language, not only Polish but also Czech, Romanian, Portugese, Vietnamese, Devanagari, Tirhuta, just to cite some few, the result would be a data mass of which I?m not sure that it would pay back the cost induced at collection, nor that it would really be useful. For the NamesList, the TXT format is superior to XML at least in that, it prevents from forgetting that NamesList.txt is the source of the Code Charts. Not less, not more. Marcel [1] http://www.unicode.org/review/pri297/feedback.html Date/Time: Sat May 2 07:10:09 CDT 2015 ???Opt Subject: PRI #297: UnicodeXData.txt Date/Time: Wed May 6 08:03:04 CDT 2015 ???Opt Subject: PRI #297: feedback on XML files From c933103 at gmail.com Sun Mar 13 13:39:46 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 14 Mar 2016 02:39:46 +0800 Subject: Proposed Unicode 10.0 emoji U+1F961 Takeout Box In-Reply-To: References: Message-ID: Its sample glyph and emoji description say oyster pail, which according to Wikipedia it seems to be a mostly American things. Would it be better to create emoji for other takeout boxes like Chinese Ricebox (Not the American style one), Japanese Bento, and pizza box, or alternatively provide selector for the emoji to change it to different style? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 13 14:03:20 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 13 Mar 2016 13:03:20 -0600 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: References: Message-ID: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> My point is that of J.S. Choi and Janusz Bie?: the problem with declaring NamesList off-limits is that it does contain information that is either: ? not available in any other UCD file, or ? available, but only in comments (like the MAS mappings), which aren't supposed to be parsed either. Ken wrote: > [ .. ] NamesList.txt is itself the result of a complicated merge > of code point, name, and decomposition mapping information from > UnicodeData.txt, of listings of standardized variation sequences from > StandardizedVariants.txt, and then a very long list of annotational > material, including names list subhead material, etc., maintained in > other sources. But sometimes an implementer really does need a piece of information that exists only in those "other sources." When that happens, sometimes the only choices are to resort to NamesList or to create one's own data file, as Ken did by parsing the comment lines from the math file. Both of these are equally distasteful when trying to be conformant. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Sun Mar 13 16:24:55 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 13 Mar 2016 15:24:55 -0600 Subject: Proposed Unicode 10.0 emoji U+1F961 Takeout Box Message-ID: <07573F85A1A945ACA6C37EE3973B58E0@DougEwell> gfb hjjhjh wrote: > or alternatively provide > selector for the emoji to change it to different style? http://www.unicode.org/reports/tr52/ -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Sun Mar 13 21:14:05 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 14 Mar 2016 03:14:05 +0100 (CET) Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> Message-ID: <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > My point is that of J.S. Choi and Janusz Bie?: the problem with > declaring NamesList off-limits is that it does contain information that > is either: > > ? not available in any other UCD file, or > ? available, but only in comments (like the MAS mappings), which aren't > supposed to be parsed either. > > Ken wrote: > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > of code point, name, and decomposition mapping information from > > UnicodeData.txt, of listings of standardized variation sequences from > > StandardizedVariants.txt, and then a very long list of annotational > > material, including names list subhead material, etc., maintained in > > other sources. > > But sometimes an implementer really does need a piece of information > that exists only in those "other sources." When that happens, sometimes > the only choices are to resort to NamesList or to create one's own data > file, as Ken did by parsing the comment lines from the math file. Both > of these are equally distasteful when trying to be conformant. If so, then extending the XML UCD with all the information that is actually missing in it while available in the Code Charts and NamesList.txt, ends up being a good idea. But it still remains that such a step would exponentially increase the amount of data, because items that were not meant to be systematically provided, must be. Further I?see that once this is completed, other requirements could need to tackle the same job on the core specs. The point would be to know whether in Unicode implementation and i18n, those needs are frequent. E.g. the last Apostrophe thread showed that full automatization is sometimes impossible anyway. Marcel From asmus-inc at ix.netcom.com Sun Mar 13 22:32:03 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 13 Mar 2016 20:32:03 -0700 Subject: annotations In-Reply-To: <864mcaeujn.fsf_-_@mimuw.edu.pl> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <864mcaeujn.fsf_-_@mimuw.edu.pl> Message-ID: <56E630B3.5000405@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 14 02:23:18 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 14 Mar 2016 08:23:18 +0100 Subject: annotations (was: NamesList.txt as data source) In-Reply-To: <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> <789396192.13264.1457921645967.JavaMail.www@wwinf1k33> Message-ID: is the term "exponentially" really appropriate ? the NamesList file is not so large, and the grow would remain linear. Anyway, this file (current CSV format or XML format) does not need to be part of the core UCD files, they can be in a separate download for people needing it. One benefit I would see is that this conversion to XML using an automated tool could ensure that it is properly formated. But I believe that Unibook is already parsing it to produce consistent code charts so its format is already checked. And this advantage is not really effective. But the main benefit would be that the file could be edited and updated using standard tools. XML is not the only choice available, JSON today is simpler to parse, easier to read (and even edit) by humans, it can embed indentation whitespaces (outside quoted strings) that won't be considered part of the data (unlike XML where they "pollute" the DOM with extra text elements). In fact I belive that the old CSV formats used in the original UCD may be deprecated in favor of JSON (the old format could be automatically generated for applications that want them. It could unify all formats with a single parser in all tools. Files in older CSV or tabulated formats would be in a separate derived collection. Then users would choose which format they prefer (legacy now derived, JSON, or XML if people really want it). The advantage of XML however is the stability for later updates that may need to insert additional data or annotations (with JSON or CSV/tabulated formats, the number of columns is fixed, all columns must be fed at least with an empty data, even if it is is not significant). Note that legacy formats also have comments after hash signs, but many comments found at end of data lines also have some parsable meaning, so they are structured, and may be followed by an extra hash sign for a real comment) The advantage of existing XSV/tabulated formats is that they are extremely easy to import in a spreadsheet for easier use by a human (I won't requiest the UTC to provide these files in XLS/XLSX or ODC format...). But JSON and XML could as well be imported provided that the each data file remains structured as a 2D grid without substructures within cells (otherwise you need to provide an explicit schema). But note that some columns is frequently structured: those containing the code point key is frequently specifying a code range using an additional separator; as well those whose value is an ordered list of code points, using space separator and possibly a leading subtag (such as decomposition data): in XML you would translate them into separate subelements or into additional attributes, and in JSON, you'll need to structure these structured cells using subarrays. So the data is *already* not strictly 2D (converting them to a pure 2D format, for relational use, would require adding additional key or referencing "ID" columns and those converted files would be much less easier to read/edit by humans, in *any* format: CSV/tabular, JSON or XML). Other candidate formats also include Turtle (generally derived from OWL, but replacing the XML envelope format by a tabulated "2.5D" format which is much easier than XML to read/edit and much more compact than XML-based formats and easier to parse)... 2016-03-14 3:14 GMT+01:00 Marcel Schneider : > On Sun, 13 Mar 2016 13:03:20 -0600, Doug Ewell wrote: > > > My point is that of J.S. Choi and Janusz Bie?: the problem with > > declaring NamesList off-limits is that it does contain information that > > is either: > > > > ? not available in any other UCD file, or > > ? available, but only in comments (like the MAS mappings), which aren't > > supposed to be parsed either. > > > > Ken wrote: > > > > > [ .. ] NamesList.txt is itself the result of a complicated merge > > > of code point, name, and decomposition mapping information from > > > UnicodeData.txt, of listings of standardized variation sequences from > > > StandardizedVariants.txt, and then a very long list of annotational > > > material, including names list subhead material, etc., maintained in > > > other sources. > > > > But sometimes an implementer really does need a piece of information > > that exists only in those "other sources." When that happens, sometimes > > the only choices are to resort to NamesList or to create one's own data > > file, as Ken did by parsing the comment lines from the math file. Both > > of these are equally distasteful when trying to be conformant. > > > If so, then extending the XML UCD with all the information that is > actually missing in it while available in the Code Charts and > NamesList.txt, ends up being a good idea. But it still remains that such a > step would exponentially increase the amount of data, because items that > were not meant to be systematically provided, must be. > > Further I see that once this is completed, other requirements could need > to tackle the same job on the core specs. > > The point would be to know whether in Unicode implementation and i18n, > those needs are frequent. E.g. the last Apostrophe thread showed that full > automatization is sometimes impossible anyway. > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Mar 14 11:19:35 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 14 Mar 2016 09:19:35 -0700 Subject: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX? In-Reply-To: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> References: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> Message-ID: <56E6E497.1000902@att.net> U+23FF is already assigned to OBSERVER EYE SYMBOL, which is already under ballot for 10646 (and approved by the UTC). http://www.unicode.org/alloc/Pipeline.html Please always first check that page before suggesting code points for prospective new characters. --Ken On 3/12/2016 5:42 PM, Marcel Schneider wrote: > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? > > From kenwhistler at att.net Mon Mar 14 12:01:46 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 14 Mar 2016 10:01:46 -0700 Subject: annotations In-Reply-To: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> References: <0FD539A45B144C25BBFADDC2EAB795E2@DougEwell> Message-ID: <56E6EE7A.9010109@att.net> On 3/13/2016 12:03 PM, Doug Ewell wrote: > My point is that of J.S. Choi and Janusz Bie?: the problem with > declaring NamesList off-limits is that it does contain information > that is either: > > ? not available in any other UCD file, or > ? available, but only in comments (like the MAS mappings), which aren't > supposed to be parsed either. NamesList.txt is not "off-limits". The information in it is there because it is useful for publication in the Unicode code charts, to help with the identification and interpretation of the characters in the standard. And because NamesList.txt itself is published as part of the UCD, nobody is going to stop you (or anybody else) from parsing information out of it. The trick is this: the status of annotational data in NamesList.txt is different than that of normative data like the code points, names, formal name aliases, decomposition mappings, and standardized variation sequences. Annotations are -- well, annotational -- and there are no guarantees about their completeness or stability, and so on. They emerge from a kind of ongoing rugby scrum between the UTC members, national body comments on 10646 amendments, public suggestions via feedback and email lists, and the ability of editors to accommodate reasonable suggestions that might help the readability and usefulness of the names list without larding it up to heavily with extraneous information that would make it *harder* to use. People who parse NamesList.txt for data almost inevitably and immediately end up expecting it to do things it does not (and cannot reasonably) do. See this thread right here for pertinent examples. *That* is the problem I see, because it then tends to lead to frustrated clamoring for NamesList.txt to be "fixed" to do things and carry information that it wasn't (and isn't) designed to do. --Ken From doug at ewellic.org Mon Mar 14 13:22:14 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 14 Mar 2016 11:22:14 -0700 Subject: annotations Message-ID: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Ken Whistler wrote: > The trick is this: the status of annotational data in NamesList.txt > is different than that of normative data like the code points, names, > formal name aliases, decomposition mappings, and standardized > variation sequences. I get that. I am FAR more comfortable with that type of guideline: ? the data isn't normative (at least not all of it) ? the format isn't set in stone ? don't ask for additions or changes ? caveat emptor than with any sort of blanket statement about "don't parse this file." I hereby promise to use NamesList.txt responsibly and with all of the above conditions in mind. Hopefully others will too. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Mon Mar 14 13:33:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 14 Mar 2016 19:33:24 +0100 Subject: annotations In-Reply-To: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Message-ID: <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> Quote/Cytat - Doug Ewell (pon, 14 mar 2016, 19:22:14): > Ken Whistler wrote: > >> The trick is this: the status of annotational data in NamesList.txt >> is different than that of normative data like the code points, names, >> formal name aliases, decomposition mappings, and standardized >> variation sequences. > > I get that. I am FAR more comfortable with that type of guideline: > > ? the data isn't normative (at least not all of it) > ? the format isn't set in stone > ? don't ask for additions or changes What about reporting possible mistakes? Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From doug at ewellic.org Mon Mar 14 14:30:59 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 14 Mar 2016 12:30:59 -0700 Subject: annotations Message-ID: <20160314123059.665a7a7059d7ee80bb4d670165c8327d.5035fe7d06.wbe@email03.secureserver.net> Janusz S. Bie? wrote: >> ? don't ask for additions or changes > > What about reporting possible mistakes? I'd assume that egregious, demonstrable errors, such as misspelled character names or incorrect individual code points, could be reported, and anything beyond that probably should not. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Mon Mar 14 16:05:05 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 14 Mar 2016 14:05:05 -0700 Subject: annotations In-Reply-To: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> Message-ID: <56E72781.3070807@ix.netcom.com> On 3/14/2016 11:22 AM, Doug Ewell wrote: > Ken Whistler wrote: > >> The trick is this: the status of annotational data in NamesList.txt >> is different than that of normative data like the code points, names, >> formal name aliases, decomposition mappings, and standardized >> variation sequences. > I get that. I am FAR more comfortable with that type of guideline: > > ? the data isn't normative (at least not all of it) > ? the format isn't set in stone > ? don't ask for additions or changes Additions and changes to annotations are considered all the time. There's just no implication that these must satisfy some arbitrary criteria of completeness and internal consistency. They are added when the editorial committee feels that the benefit outweighs the cost (bloat & clutter). The nature of all of these is more akin to comments - except that they are not presented using a comment syntax (and the xrefs look structured, instead of "see also code point XXXX"). Totally a perception issue. > ? caveat emptor Always! > > than with any sort of blanket statement about "don't parse this file." > > I hereby promise to use NamesList.txt responsibly and with all of the > above conditions in mind. Hopefully others will too. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > From asmus-inc at ix.netcom.com Mon Mar 14 16:05:27 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 14 Mar 2016 14:05:27 -0700 Subject: annotations In-Reply-To: <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> References: <20160314112214.665a7a7059d7ee80bb4d670165c8327d.17de0586a6.wbe@email03.secureserver.net> <20160314193324.20153zqsmwf1inec@mail.mimuw.edu.pl> Message-ID: <56E72797.90005@ix.netcom.com> On 3/14/2016 11:33 AM, Janusz S. Bien wrote: > Quote/Cytat - Doug Ewell (pon, 14 mar 2016, 19:22:14): > >> Ken Whistler wrote: >> >>> The trick is this: the status of annotational data in NamesList.txt >>> is different than that of normative data like the code points, names, >>> formal name aliases, decomposition mappings, and standardized >>> variation sequences. >> >> I get that. I am FAR more comfortable with that type of guideline: >> >> ? the data isn't normative (at least not all of it) >> ? the format isn't set in stone >> ? don't ask for additions or changes > > What about reporting possible mistakes? see my reply to Doug > > Regards > > Janusz > From doug at ewellic.org Tue Mar 15 09:42:30 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 15 Mar 2016 07:42:30 -0700 Subject: annotations Message-ID: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> Asmus Freytag wrote: >> ? don't ask for additions or changes > > Additions and changes to annotations are considered all the time. Well, yes. I meant additions and changes to the scope of the file. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Mar 15 10:34:09 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 15 Mar 2016 08:34:09 -0700 Subject: annotations In-Reply-To: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> References: <20160315074229.665a7a7059d7ee80bb4d670165c8327d.330f638e47.wbe@email03.secureserver.net> Message-ID: <56E82B71.6030008@ix.netcom.com> An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Tue Mar 15 16:21:51 2016 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 15 Mar 2016 21:21:51 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: On 15 March 2016 at 19:48, K.C.Saff wrote: > > I often see numbers roll over at 100, displayed on a new board, so even just > the full set of two digit forms adds a lot of utility for go games. This > seems to be a standard practice at Wikipedia ( > https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol#Game_4 ), Sensei's > Library and a lot of books that I've worked through. That's certainly true, although it is not hard to find examples which go over 100 (http://www.babelstone.co.uk/Ludus/Weiqi/FamousGames_279.jpg), and even the AlphaGo vs Lee Sedol Wikipedia page shows one game diagram that goes into the 200s. > Completing both sets > up to 99, adding "00", and including the most common markers (triangle, > square, etc.) seems like a good, useful compromise. Possibly. I certainly have very little expectation that a proposal to complete both sets to 999 (or even 399) would have any chance of success. I am currently working on a proposal for the triangle and square go markers, and am still considering the best approach to the circled numbers. Any feedback would be most welcome. http://www.babelstone.co.uk/Unicode/GoNotation.pdf Andrew From ori at avtalion.name Tue Mar 15 17:11:06 2016 From: ori at avtalion.name (Ori Avtalion) Date: Wed, 16 Mar 2016 00:11:06 +0200 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> Message-ID: I have received a response from Barbara Beeton, along with an approval to post it here. I have redacted the intro and outro where she admits it's "certainly not a real answer", but IMO it's still useful for documentation. Response from Barbara Beeton: (Date: Tue, 15 Mar 2016 16:49:14 -0400) On Thu, 10 Mar 2016, Ori Avtalion wrote: I'm trying to find an answer to the question in the subject line. I posted it to the Unicode mailing list [1], and was suggested to contact you, as you are one of the authors who proposed the symbols. [1] http://unicode.org/pipermail/unicode/2016-March/003412.html I can find no use of dots in common Go notations of games. What is the origin of the dots on the Go markers and what are they used for? I have researched the records of the STIX project and find the following. All the "regular" sources of symbols were recorded in a "master table" that has been kept up to date, but there have been few additions since about 2007. A somewhat earlier version, dated October 2006, can be found here: http://www.ams.org/STIX/bnb/stix-tbl.ascii-2006-10-20 Since this is simply a huge, column aligned ascii table, a layout guide is provided, which lists sources and other information including when codes were added: http://www.ams.org/STIX/bnb/stix-tbl.layout-2006-05-15 For the code range in question -- U+2686 - U+2689 -- the date of addition was 2000/02/01; in the same group are the six die faces, U+2680 - U+2685. As you can see, no sources are listed. Since there were also other, "irregular" sources, for which records exist only on paper, I also dug through those files. (Which is why it has taken so long to answer.) The only reference I can find is a document submitted to WG2 that includes that range: ISO/IEC JTC1/SC2/WG2 N2336 2001-04-02 The only mention of the range consists of a grid for 2680-26FF, blanked out except for the 10 symbols, and a page listing them in the form appropriate for inclusion in the Unicode charts; the content of that page is identical to what is in the chart for the 26xx range of Unicode 8.0 except for two comments (for 2680 and 2687). There may be an earlier document in the WG2 archives, probably dated in late 1999 or pre-February 1, 2000, that has more information, but I don't have a copy. The fact that die faces and (purported) go symbols were added at the same time may be helpful. What I surmise happened is that someone requested that these symbols be added to a submission-in-progress; since the collection of math symbols was rather diverse, a few more wouldn't be noticed, but it's unfortunate that nobody seems to have kept records. Perhaps someone who was active in the UTC at the time may have a memory; all I can attest to is that the request did *not* originate with the STIX project. From davidj_faulks at yahoo.ca Tue Mar 15 22:14:15 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Wed, 16 Mar 2016 03:14:15 +0000 (UTC) Subject: Variations and Unifications ? References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode. For example, the semisquare aspect is usually shown with a glyph that is identical to ? (U+2220 ANGLE). However, sometimes it looks like <, or like ? (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? The parallel aspect, similarily, sometimes looks like ? (U+2225 PARALLEL TO), but is often shown as // or ? (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way. There is also contra-parallel, which sometime is shown like ? (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal). The ?part of fortune? is sometimes a circled ?, or sometimes a circled +. Would it be better to have dedicated characters than to assume unifications in these cases? From frederic.grosshans at gmail.com Wed Mar 16 08:35:54 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 16 Mar 2016 14:35:54 +0100 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> Message-ID: <56E9613A.4030605@gmail.com> Le 15/03/2016 22:21, Andrew West a ?crit : > > Possibly. I certainly have very little expectation that a proposal to > complete both sets to 999 (or even 399) would have any chance of > success. And then, there are also the historical example of ideographic numbers used for the same purpose in historic texts (like here http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ or here http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod ). The above has been found with a quick google search, and I have no idea whether these symbols were used in the running text or not. Fr?d?ric From asmus-inc at ix.netcom.com Wed Mar 16 12:34:54 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 16 Mar 2016 10:34:54 -0700 Subject: Variations and Unifications ? In-Reply-To: <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> Message-ID: <56E9993E.1090902@ix.netcom.com> An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Mar 16 19:45:26 2016 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 17 Mar 2016 00:45:26 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56E9613A.4030605@gmail.com> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: Hi Fr?d?ric, The historic use of ideographic numbers for marking Go moves are discussed in the latest draft of my document: http://www.babelstone.co.uk/Unicode/GoNotation.pdf Andrew On 16 March 2016 at 13:35, Fr?d?ric Grosshans wrote: > Le 15/03/2016 22:21, Andrew West a ?crit : >> >> >> Possibly. I certainly have very little expectation that a proposal to >> complete both sets to 999 (or even 399) would have any chance of >> success. > > And then, there are also the historical example of ideographic numbers used > for the same purpose in historic texts (like here > http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ > or here > http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod > ). > > The above has been found with a quick google search, and I have no idea > whether these symbols were used in the running text or not. > > Fr?d?ric > From charupdate at orange.fr Wed Mar 16 20:00:35 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 17 Mar 2016 02:00:35 +0100 (CET) Subject: Proposal for *U+2427 NARROW SHOULDERED OPEN BOX (was: Re: Proposal for *U+23FF SHOULDERED NARROW OPEN BOX?) In-Reply-To: <56E6E497.1000902@att.net> References: <684203200.14565.1457833347759.JavaMail.www@wwinf1f18> <56E6E497.1000902@att.net> Message-ID: <297423993.28545.1458176435992.JavaMail.www@wwinf1e16> On Mon, 14 Mar 2016 09:19:35 -0700, Ken Whistler wrote: > U+23FF is already assigned to OBSERVER EYE SYMBOL, which is > already under ballot for 10646 (and approved by the UTC). > > http://www.unicode.org/alloc/Pipeline.html > > Please always first check that page before suggesting code points > for prospective new characters. > > --Ken > > On 3/12/2016 5:42 PM, Marcel Schneider wrote: > > Now in the block of U+237D SHOULDERED OPEN BOX there is _one_ scalar value left. Would it then be a good idea to propose *U+23FF SHOULDERED NARROW OPEN BOX for v10.0.0? > > Thank you. I remember OBSERVER EYE but didn?t notice its code point and forgot to do a search for ?23[F[F]]? on the Pipeline page. Sorry. Now I see that *U+2427 would be even better as it is both in the block of U+2423 OPEN BOX and in the originally intended block, except that now I dropped the other symbols and stay just with the NNBSP symbol to propose for the next free contiguous scalar value. I really hope that such a new or, more accurately, third proposal would be accepted, as the NARROW NO-BREAK SPACE is so important it must have its symbol encoded at some point, similarly to SPACE and NO-BREAK SPACE. About the proposed name, there is to say that first I changed it to the glyph-descriptional one as preferred in Unicode, rather than SYMBOL FOR NARROW NO-BREAK SPACE. And last I made it more analogous to the name of the symbolized character, by inverting ?SHOULDERED? and ?NARROW?. The original proposer cannot simply resume on that ?narrow? basis, being committed to consistency with ISO/IEC?9995-7, so that an individual like I am, might be good to send the proposal? However generally it would be better done by a NB, the more as this belongs to the international keyboard standard. Other countries might be interested that have a multilingual standard layout, and/or a national layout including U+202F. Another scenario would be that the French NB re-proposes a reduced set of additional symbols, which IMHO should comprise at least the NARROW SHOULDERED OPEN BOX, but ideally once it will have completed the revision of most parts of ISO/IEC?9995, including part?7. Best regards, Marcel From verdy_p at wanadoo.fr Thu Mar 17 01:11:33 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 07:11:33 +0100 Subject: Variations and Unifications ? In-Reply-To: <56E9993E.1090902@ix.netcom.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> Message-ID: "Disunification may be an answer?" We should avoid it as well. We have other solutions in Unicode - variation selectors (often used for sinograms when their unified shapes must be distinguished in some contexts such as people names or toponyms or trademark names or in other specific contexts), - or combining sequences (including in Arabic or Hebrew where many combining characters are not always represented visually, the same occuring as well in Latin with accents not always presented over capitals), - or sequences of multiple characters (like in Emojis for skin color variants, or sequences for encoding flags), - or other sequences using joiners (e.g. in South Asian scripts). Disunification is only acceptable when - there's a complete disunification of concepts and the "similar" shapes are also different even if one originates from the other (E.g. the Latin slashed o disunifiied from the Latin o, even if there's also the sequence o+combining slash, almost never used as its rendering is too much approximative in most cases) - or there's a clear distinction of semantics and properties (e.g. the Latin AE ligature, which is not appropriately represented by the two separate letters, not even with a "hinting" joiner, and that has specific properties as a plain letter, e.g. with mappings) Before disunifying a character, we should first study the alternative of their representation as sequences. 2016-03-16 18:34 GMT+01:00 Asmus Freytag (t) : > On 3/15/2016 8:14 PM, David Faulks wrote: > > As part of my investigations into astrological symbols, I'm beginning to wonder if glyph variations are justifications for separate encoding of symbols I would have previously considered the same or unifiable with symbols already in Unicode. > > For example, the semisquare aspect is usually shown with a glyph that is identical to ? (U+2220 ANGLE). However, sometimes it looks like <, or like ? (U+221F RIGHT ANGLE). Would this be better encoded as a separate codepoint? > > The parallel aspect, similarily, sometimes looks like ? (U+2225 PARALLEL TO), but is often shown as // or ? (U+2AFD DOUBLE SOLIDUS OPERATOR). This is not a typographical kludge since astrological fonts often show it this way. > There is also contra-parallel, which sometime is shown like ? (U+2226 NOT PARALLEL TO), but has varaint glyphs with slated lines (and the crossbar is often horizontal). > > The ?part of fortune? is sometimes a circled ?, or sometimes a circled +. > > Would it be better to have dedicated characters than to assume unifications in these cases? > > > > My take is that for symbols there's always that tension between encoding > the "concept" or encoding the shape. In my view, it is often impossible to > answer the question whether the different angles (for example) are merely > different "shapes" of one and the same "symbol", or whether it isn't the > case that there are different "conventions" (using different symbols for > the same concept). > > Disunification is useful, whenever different concepts require distinct > symbol shapes (even if there are some general similarities). If other > concepts make use of the same shapes interchangeably, it is then up to the > author to fix the convention by selecting one or the other shape. > Conceptually, that is similar to the decimal point: it can be either a > period, or a comma, depending on locale (read: depending on the convention > the author follows). > > Sometimes, concepts use multiple symbol shapes, but all of these shapes > map to the same concept (and other uses are not known). In that case, > unifying the shapes might be acceptable. The selection of shape is then a > matter of the font (and may not always be under the control of the author). > Conceptually, that is similar to the integral sign, which can be slanted or > upright. The choice is one of style. While authors or readers may prefer > one look over the other, the identity of the symbol is not in question, and > there's no impact on transmission of the contents of the text. > > Whenever we have the former case, that is, multiple conventional > presentations that are symbols in their own right in other contexts, then > encoding an additional "generic" shape should be avoided. Unicode > explicitly did not encode a generic "decimal point". If the convention that > is used matters, the author is better off being able to select a specific > shape. The results will be more predictable. The downside is that a search > will have to cover all the conventions. Conceptually, that is no different > from having to search for both "color" and "colour". > > The final case is where a convention for depicting a concept uses a symbol > that itself has some variability (for example when representing some other > concepts), such that some of its forms make it less than ideal for the > conventional use intended for the concept in question. Unicode has > historically not always been able to provide a solution. In some of these > cases, plain text (that is, without a fixed font association) may simply > not give the desired answer. If specialized fonts for the convention (e.g. > astrological fonts) do not usually exist or can't be expected, then > disunifying the symbol's shapes may be an answer. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Mar 17 02:20:06 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 17 Mar 2016 00:20:06 -0700 Subject: Variations and Unifications ? In-Reply-To: References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> Message-ID: <56EA5AA6.6040202@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 17 02:47:26 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 08:47:26 +0100 Subject: Variations and Unifications ? In-Reply-To: <56EA5AA6.6040202@ix.netcom.com> References: <1426199709.593202.1458098055262.JavaMail.yahoo.ref@mail.yahoo.com> <1426199709.593202.1458098055262.JavaMail.yahoo@mail.yahoo.com> <56E9993E.1090902@ix.netcom.com> <56EA5AA6.6040202@ix.netcom.com> Message-ID: One problem caused by disunification is the complexification of algorithms handling text. I forgot an important case where disunification also occured : combining sequences are the "normal" encoding, but legacy charsets encoded the precomposed character separately and Unicode had to map them for round trip compatibility purpose. This had a consequence : the creation of additional properties (i.e. for "canonical equivalences") in order to conciliate the two sets of encodings and allow some form for equivalence In fact this is general: each time we disunify a character, we have to add new properties, and possibly update the algorithms to take these properties into account and find some form of equivalences. So disunification solves one problem but creates others. We have to trade the benefits and costs of using the disunified characters with those using the "normal" characters (possibly in sequences). But given the number of cases where we have to support sequences (even if it's only combining sequences for canonical equivalences), we should really defavor the real need of disunifying characters: if it's possible with sequences, don't desunify. A famous example (based on a legacydecision which was bad in my opinion as the cost was not considered) was the desunification of Latin/Greek letters for mathematical purpose, only to force a specific style. But the alternative representation using sequences (using variation selectors for example, as the addition of specific modifier for "styles" like "bold", "italic" or "monospace" was rejected with good reasons) was not really analyzed in terms of benefits and costs, using the algorithms we already have (and that could have been updated). But mathemetical symbols are (normally...) not used at all in the same context as plain alphabetic letters (even if there's absolutely no warranty that they will be always distinctable from them when they occur in some linguistic text rendered with the same style...). The naive thinking that disunification will make things simpler is completely wrong (given that an application that would ignore all character properties and would use only isolated characters would break legitime rules in many cases, even for rendering purposes. It is in fact simpler to keep the possible sequences that are already encoded (or that could be extended to cover more cases: e.g. add new variation sequences, introduce some new modiers, not just new combining characters, and so on). We were strongly told : Unicode encodes characters, not glyphs. This should be remembered (and the argument of costs caused by disunification of distinct glyphs is also a good one against it). 2016-03-17 8:20 GMT+01:00 Asmus Freytag (t) : > On 3/16/2016 11:11 PM, Philippe Verdy wrote: > > "Disunification may be an answer?" We should avoid it as well. > > Disunification is only acceptable when > - there's a complete disunification of concepts.... > > > I think answering this question depends on the understanding of "concept", > and on understanding what it is that Unicode encodes. > > When it comes to *symbols*, which is where the discussion originated, > it's not immediately obvious what Unicode encodes. For example, I posit > that Unicode does not encode the "concept" for specific mathematical > operators, but the individual "symbols" that are used for them. > > For example PRIME and DOUBLE PRIME can be used for minutes and seconds > (both of time and arc) as well as for other purposes. Unicode correctly > does not encode "MINUTE OF ARC", but the symbol used for that -- leaving it > up to the notational convention to relate the concept and the symbol. > > Thus we have a case where multiple concepts match a single symbol. For the > converse, we take the well-known case of COMMA and FULL STOP which can both > be used to separate a decimal fraction. > > Only in those cases where a single concept is associated so exclusively > with a given symbol, do we find the situation that it makes sense to treat > variations in shape of that symbol as the same symbol, but with different > glyphs. > > For some astrological symbols that is the case, but for others it is not. > Therefore, the encoding model for astrological text cannot be uniform. > Where symbols have exclusive association with a concept, the natural > encoding is to encode symbols with an understood set of variant glyphs. > Where concepts are denoted with symbols that are also used otherwise, then > the association of concept to symbol must become a matter of notational > convention and cannot form the basis of encoding: the code elements have to > be on a lower level, and by necessity represent specific symbol shapes. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Thu Mar 17 11:43:33 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 12:43:33 -0400 Subject: =?UTF-8?Q?Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= Message-ID: <56EADEB5.9000406@bisharat.net> Odd result when copy/pasting text from a PDF: For some reason "ti" in the (English) text of the document at http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf is coded as "?". Looking more closely at the original text, it does appear that the glyph is a "ti" ligature (which afaik is not coded as such in Unicode). Out of curiosity, did a web search on "interna?onal" and got over 11k hits, apparently all PDFs. Anyone have any idea what's going on? Am assuming this is not a deliberate choice by diverse people creating PDFs and wanting "ti" ligatures for stylistic reasons. Note the document linked above is current, so this is not (just) an issue with older documents. Don Osborn From olopierpa at gmail.com Thu Mar 17 12:26:56 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 17 Mar 2016 18:26:56 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: <56EADEB5.9000406@bisharat.net> References: <56EADEB5.9000406@bisharat.net> Message-ID: That document displays correctly for me using both the pdf viewer built into chrome and the standalone Acrobat reader v.11. The problem could be in your PDF viewer? What are you viewing the document with? On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: > Odd result when copy/pasting text from a PDF: For some reason "ti" in the > (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does appear > that the glyph is a "ti" ligature (which afaik is not coded as such in > Unicode). > > Out of curiosity, did a web search on "interna?onal" and got over 11k hits, > apparently all PDFs. > > Anyone have any idea what's going on? Am assuming this is not a deliberate > choice by diverse people creating PDFs and wanting "ti" ligatures for > stylistic reasons. Note the document linked above is current, so this is not > (just) an issue with older documents. > > Don Osborn From leoboiko at namakajiri.net Thu Mar 17 12:37:05 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 17 Mar 2016 14:37:05 -0300 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: The PDF *displays* correctly. But try copying the string 'ti' from the text another application outside of your PDF viewer, and you'll see that the thing that *displays* as 'ti' is *coded* as ?, as Don Osborn said. 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi : > That document displays correctly for me using both the pdf viewer > built into chrome and the standalone Acrobat reader v.11. The problem > could be in your PDF viewer? What are you viewing the document with? > > On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: >> Odd result when copy/pasting text from a PDF: For some reason "ti" in the >> (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "?". Looking more closely at the original text, it does appear >> that the glyph is a "ti" ligature (which afaik is not coded as such in >> Unicode). >> >> Out of curiosity, did a web search on "interna?onal" and got over 11k hits, >> apparently all PDFs. >> >> Anyone have any idea what's going on? Am assuming this is not a deliberate >> choice by diverse people creating PDFs and wanting "ti" ligatures for >> stylistic reasons. Note the document linked above is current, so this is not >> (just) an issue with older documents. >> >> Don Osborn > From dzo at bisharat.net Thu Mar 17 12:45:34 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 13:45:34 -0400 Subject: =?UTF-8?Q?Re:_Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: <56EAED3E.1080601@bisharat.net> Thanks Leonardo, that is my initial observation. And it has implications for web searches. And there's more. Apparently this is one of a number of such substitutions, which taken together begin to look like the old pre-Unicode hacks of 8-bit fonts. And I found some of them via web search in a number of Google Books and pages on issuu.com. Evidently some kind of font issue, and not random assignments. From the same document: ff ligature = ? fl ligature = ? ft ligature = ? tt ligature = ? And perhaps others. Seems to defeat the intent of Unicode, as these documents and pages will not come up in typical web search on the normal spellings (unless maybe Google is incorporating an algorithm to include results for say "interna?onal" in a search on the term "international"?). Don On 3/17/2016 1:37 PM, Leonardo Boiko wrote: > The PDF *displays* correctly. But try copying the string 'ti' from > the text another application outside of your PDF viewer, and you'll > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > Osborn said. > > > 2016-03-17 14:26 GMT-03:00 Pierpaolo Bernardi : >> That document displays correctly for me using both the pdf viewer >> built into chrome and the standalone Acrobat reader v.11. The problem >> could be in your PDF viewer? What are you viewing the document with? >> >> On Thu, Mar 17, 2016 at 5:43 PM, Don Osborn wrote: >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in the >>> (English) text of the document at >>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>> is coded as "?". Looking more closely at the original text, it does appear >>> that the glyph is a "ti" ligature (which afaik is not coded as such in >>> Unicode). >>> >>> Out of curiosity, did a web search on "interna?onal" and got over 11k hits, >>> apparently all PDFs. >>> >>> Anyone have any idea what's going on? Am assuming this is not a deliberate >>> choice by diverse people creating PDFs and wanting "ti" ligatures for >>> stylistic reasons. Note the document linked above is current, so this is not >>> (just) an issue with older documents. >>> >>> Don Osborn From jknappen at web.de Thu Mar 17 12:57:15 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 17 Mar 2016 18:57:15 +0100 Subject: =?UTF-8?Q?Aw=3A_Joined_=22ti=22_coded_as_=22=C6=9F=22_in_PDF?= In-Reply-To: <56EADEB5.9000406@bisharat.net> References: <56EADEB5.9000406@bisharat.net> Message-ID: An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Thu Mar 17 13:02:19 2016 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 17 Mar 2016 19:02:19 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko wrote: > The PDF *displays* correctly. But try copying the string 'ti' from > the text another application outside of your PDF viewer, and you'll > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > Osborn said. Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about unicode. It uses the encoding of the fonts used. The ti ligature is a glyph in the font used in that document. Its code has nothing to do with anything unicode. It looks like a pre-unicode hack because unicode says nothing about font technologies, and hence nothing has changed in PDF because of unicode (nor could have, unicode does not mandate how to encode ligatures). From leoboiko at namakajiri.net Thu Mar 17 13:06:22 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 17 Mar 2016 15:06:22 -0300 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: Yeah, I've stumbled upon this a lot in academic Japanese/Chinese texts. I try to copy some Chinese character, only to find out that it's really a string of random ASCII characters. Is there only one of those crap PDF pseudo-encodings? If so, I'll use a conversor next time... 2016-03-17 14:57 GMT-03:00 "J?rg Knappen" : > I inspected the pdf file, and its font encoding is termed "Identity-H". I > couldn't reveal much about this encoding, but it seems to be a private > encoding of Adobe used especially for Asian fonts. > > --J?rg Knappen > > Gesendet: Donnerstag, 17. M?rz 2016 um 17:43 Uhr > Von: "Don Osborn" > An: unicode at unicode.org > Betreff: Joined "ti" coded as "?" in PDF > Odd result when copy/pasting text from a PDF: For some reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does > appear that the glyph is a "ti" ligature (which afaik is not coded as > such in Unicode). > > Out of curiosity, did a web search on "interna?onal" and got over 11k > hits, apparently all PDFs. > > Anyone have any idea what's going on? Am assuming this is not a > deliberate choice by diverse people creating PDFs and wanting "ti" > ligatures for stylistic reasons. Note the document linked above is > current, so this is not (just) an issue with older documents. > > Don Osborn From doug at ewellic.org Thu Mar 17 13:11:44 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 17 Mar 2016 11:11:44 -0700 Subject: Joined "ti" coded as =?UTF-8?Q?=22=C6=9F=22=20in=20PDF?= Message-ID: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> Don Osborn wrote: > Odd result when copy/pasting text from a PDF: For some reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original text, it does > appear that the glyph is a "ti" ligature (which afaik is not coded as > such in Unicode). When I copy and paste the PDF text in question into BabelPad, I get: > Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By > invita??on only) The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use character. Truncating this character to 16 bits, which is a Bad Thing?, yields U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either Don's clipboard or the editor he pasted it into is not fully Unicode-compliant. Don's point about using alternative characters to implement ligatures, thereby messing up web searches, remains valid. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From steve at swales.us Thu Mar 17 13:17:12 2016 From: steve at swales.us (Steve Swales) Date: Thu, 17 Mar 2016 11:17:12 -0700 Subject: =?utf-8?Q?Re=3A_Joined_=22ti=22_coded_as_=22=C6=9F=22_in_PDF?= In-Reply-To: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> Message-ID: <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. -steve > On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: > > Don Osborn wrote: > >> Odd result when copy/pasting text from a PDF: For some reason "ti" in >> the (English) text of the document at >> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >> is coded as "?". Looking more closely at the original text, it does >> appear that the glyph is a "ti" ligature (which afaik is not coded as >> such in Unicode). > > When I copy and paste the PDF text in question into BabelPad, I get: > >> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >> invita??on only) > > The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use > character. > > Truncating this character to 16 bits, which is a Bad Thing?, yields > U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either > Don's clipboard or the editor he pasted it into is not fully > Unicode-compliant. > > Don's point about using alternative characters to implement ligatures, > thereby messing up web searches, remains valid. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From charupdate at orange.fr Thu Mar 17 14:00:54 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 17 Mar 2016 20:00:54 +0100 (CET) Subject: =?UTF-8?Q?Re:_Joined_"ti"_coded_as_"=C6=9F"_in_PDF?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: <364670549.29314.1458241254719.JavaMail.www@wwinf1p21> On Thu, Mar 17, 2016 at 19:02:19, Pierpaolo Bernardi wrote: > unicode says nothing about font technologies It mentions them a little bit however in the core specifications: http://www.unicode.org/versions/Unicode8.0.0/ch23.pdf#G23126 > unicode does not mandate how to encode ligatures Probably because Unicode specifies that ?it is the task of the rendering system? to select ligature glyphs on the basis of characteristic sequences of characters in the text stream. While having found some of the mentioned oddities in an old PDF file (ffi ligature ending up as Y, ffl ligature as Z), I?m now really puzzled about actual practise. Marcel From verdy_p at wanadoo.fr Thu Mar 17 15:18:35 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 17 Mar 2016 21:18:35 +0100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: References: <56EADEB5.9000406@bisharat.net> Message-ID: 2016-03-17 19:02 GMT+01:00 Pierpaolo Bernardi : > On Thu, Mar 17, 2016 at 6:37 PM, Leonardo Boiko > wrote: > > The PDF *displays* correctly. But try copying the string 'ti' from > > the text another application outside of your PDF viewer, and you'll > > see that the thing that *displays* as 'ti' is *coded* as ?, as Don > > Osborn said. > > Ah. OK. Anyway this is not a Unicode problem. PDF knows nothing about > unicode. It uses the encoding of the fonts used. > That's correct, however the PDF specs contain guidelines for naming glyphs in fonts in such a way that the encoding can be deciphered. This is needed for example in applications such as PDF forms where user input is expected. When those PDF are generated from rich text, the fonts used may be built with TrueType (without any glyph name in them, only mappings of sequences of codepoints) or OpenType or Postscript. When OpenType fonts contain Postscript glyphs, their names may be completely arbitrary, it does not even matter if the font used was mapped to Unciode or if it used a legacy or proprietary encoding). If you see a "?" when copy-pasting from the PDF, it's because the font used to produce it did not follow these guidelines (or did not specify any glyphname, in which case this is a sort of OCR algorithm that attempts to decipher the glyph : the "ti" ligature is visually extremely near from the "?", and an OCR has lot of difficulties to disguish them, unless they also use some linguistic dictionnary searches and some hints about the script used in surrounding characters to enhance the guess). Note that PDF's (or DejaVu's) are not required to contain only text, or they could just embed a scanned and compressed bitmap image (if you want to see how an OCR can be wrong, look at how it fails with lots of errors, for example in the decoding projects for Wikibooks, working with scanned bitmaps of old books: OCR is just an helper, but there's still lot of work to correct what has been guessed and reencode the correct text; even if humans are smarter than OCR, this is a lot of work to perform manually : encoding the text of a single scanned old book still takes one or two months for an experienced editor, and there are still many errors to review later by someone else) Most PDFs were not created with the idea of decoding later their rendered texts. In fact they were intended to be read or printed "as is", including with their styles, colors, and decorations of fonts everywhere or text over photos. They were even created to be non modifiable and used then for archival. Some PDF tools will also cleanup from the PDF the additional metadata such as the original fonts used, instead these PDFs will locally embed pseudo-fonts containing sets of glyphs from various fonts (in mixed styles), in random order or sorted by frequency of use in the document or by order of occurence in the original text. These embedded fonts are generated on the fly to contain only the necessary glyphs for the document. When those embedded fonts are generated, there's a compression algorithme that drops lots of things from the original font, including its metadata such as the original "Postscript" glyph names. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Thu Mar 17 15:44:19 2016 From: dzo at bisharat.net (Don Osborn) Date: Thu, 17 Mar 2016 16:44:19 -0400 Subject: =?UTF-8?Q?Re:_Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= In-Reply-To: <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> Message-ID: <56EB1723.7030301@bisharat.net> Thanks all for the feedback. Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. So, when I did a web search on "interna?onal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? A web search on what you came up with - "Interna??onal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. > > -steve > >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >> >> Don Osborn wrote: >> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in >>> the (English) text of the document at >>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>> is coded as "?". Looking more closely at the original text, it does >>> appear that the glyph is a "ti" ligature (which afaik is not coded as >>> such in Unicode). >> When I copy and paste the PDF text in question into BabelPad, I get: >> >>> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >>> invita??on only) >> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >> character. >> >> Truncating this character to 16 bits, which is a Bad Thing?, yields >> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >> Don's clipboard or the editor he pasted it into is not fully >> Unicode-compliant. >> >> Don's point about using alternative characters to implement ligatures, >> thereby messing up web searches, remains valid. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> > From lang.support at gmail.com Thu Mar 17 18:34:04 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 18 Mar 2016 10:34:04 +1100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: <56EB1723.7030301@bisharat.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> Message-ID: There are a few things going on. In the first instance, it may be the font itself that is the source of the problem. My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. At least thatsmy current understanding. Andrew On 18 Mar 2016 7:47 am, "Don Osborn" wrote: > Thanks all for the feedback. > > Doug, It may well be my clipboard (running Windows 7 on this particular > laptop). Get same results pasting into Word and EmEditor. > > So, when I did a web search on "interna?onal," as previously mentioned, > and come up with a lot of results (mostly PDFs), were those also a > consequence of many not fully Unicode compliant conversions by others? > > A web search on what you came up with - "Interna??onal" - yielded many > more (82k+) results, again mostly PDFs, with terms like "interna onal" > (such as what Steve noted) and "interna nature of, or how Google interprets, the private use character?). > > Searching within the PDF document already mentioned, "international" comes > up with nothing (which is a major fail as far as usability). Searching the > PDF in a Firefox browser window, only "interna?onal" finds the occurrences > of what displays as "international." However after downloading the document > and searching it in Acrobat, only a search for "interna??onal" will find > what displays as "international." > > A separate web search on "E?ects" came up with 300+ results, including > some GoogleBooks which in the texts display "effects" (as far as I > checked). So this is not limited to Adobe? > > J?rg, With regard to "Identity H," a quick search gives the impression > that this encoding has had a fairly wide and not so happy impact, even if > on the surface level it may have facilitated display in a particular style > of font in ways that no one complains about. > > Altogether a mess, from my limited encounter with it. There must have been > a good reason for or saving grace of this solution? > > Don > > On 3/17/2016 2:17 PM, Steve Swales wrote: > >> Yes, it seems like your mileage varies with the PDF >> viewer/interpreter/converter. Text copied from Preview on the Mac replaces >> the ti ligature with a space. Certainly not a Unicode problem, per se, but >> an interesting problem nevertheless. >> >> -steve >> >> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >>> >>> Don Osborn wrote: >>> >>> Odd result when copy/pasting text from a PDF: For some reason "ti" in >>>> the (English) text of the document at >>>> >>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>>> is coded as "?". Looking more closely at the original text, it does >>>> appear that the glyph is a "ti" ligature (which afaik is not coded as >>>> such in Unicode). >>>> >>> When I copy and paste the PDF text in question into BabelPad, I get: >>> >>> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >>>> invita??on only) >>>> >>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >>> character. >>> >>> Truncating this character to 16 bits, which is a Bad Thing?, yields >>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >>> Don's clipboard or the editor he pasted it into is not fully >>> Unicode-compliant. >>> >>> Don's point about using alternative characters to implement ligatures, >>> thereby messing up web searches, remains valid. >>> >>> -- >>> Doug Ewell | http://ewellic.org | Thornton, CO ???? >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Mar 17 23:18:38 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 17 Mar 2016 21:18:38 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: There's another strategy for dealing with enclosed numbers, which is taken by the font Quivira in its PUA: encoding separate left-half-circle-enclosed and right-half-circle-enclosed digits. This would require 20 characters to cover the double digit range 00?99. Enclosed three digit numbers would require an additional 30 for left, center, and right thirds, though it may be possible to reuse the left and right half circle enclosed digits and assume that fonts will provide left half-center third-right half ligatures (Quivira provides "middle parts" though the result is a stadium instead of a true circle). It should be possible to do the same for enclosed ideographic numbers, I think. The problems I can see with this are confusability with the already encoded atomic enclosed numbers, and breaking in vertical text. On Wed, Mar 16, 2016 at 5:45 PM, Andrew West wrote: > Hi Fr?d?ric, > > The historic use of ideographic numbers for marking Go moves are > discussed in the latest draft of my document: > > http://www.babelstone.co.uk/Unicode/GoNotation.pdf > > Andrew > > > On 16 March 2016 at 13:35, Fr?d?ric Grosshans > wrote: >> Le 15/03/2016 22:21, Andrew West a ?crit : >>> >>> >>> Possibly. I certainly have very little expectation that a proposal to >>> complete both sets to 999 (or even 399) would have any chance of >>> success. >> >> And then, there are also the historical example of ideographic numbers used >> for the same purpose in historic texts (like here >> http://sns.91ddcc.com/t/54057, here http://pmgs.kongfz.com/item_pic_464349/ >> or here >> http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod >> ). >> >> The above has been found with a quick google search, and I have no idea >> whether these symbols were used in the running text or not. >> >> Fr?d?ric >> > From d3ck0r at gmail.com Fri Mar 18 01:28:18 2016 From: d3ck0r at gmail.com (J Decker) Date: Thu, 17 Mar 2016 23:28:18 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace wrote: > There's another strategy for dealing with enclosed numbers, which is > taken by the font Quivira in its PUA: encoding separate > left-half-circle-enclosed and right-half-circle-enclosed digits. This > would require 20 characters to cover the double digit range 00?99. > Enclosed three digit numbers would require an additional 30 for left, > center, and right thirds, though it may be possible to reuse the left > and right half circle enclosed digits and assume that fonts will > provide left half-center third-right half ligatures (Quivira provides > "middle parts" though the result is a stadium instead of a true > circle). It should be possible to do the same for enclosed ideographic > numbers, I think. > > The problems I can see with this are confusability with the already > encoded atomic enclosed numbers, and breaking in vertical text. > I suppose that's why things like this happen in appilcations.... Joined "ti" coded as "?" in PDF http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0084.html you get an encode of a series of codepoints, that results in an array of font glyph-points to render .... From gwalla at gmail.com Fri Mar 18 02:09:01 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 18 Mar 2016 00:09:01 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On Thu, Mar 17, 2016 at 11:28 PM, J Decker wrote: > On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace wrote: >> There's another strategy for dealing with enclosed numbers, which is >> taken by the font Quivira in its PUA: encoding separate >> left-half-circle-enclosed and right-half-circle-enclosed digits. This >> would require 20 characters to cover the double digit range 00?99. >> Enclosed three digit numbers would require an additional 30 for left, >> center, and right thirds, though it may be possible to reuse the left >> and right half circle enclosed digits and assume that fonts will >> provide left half-center third-right half ligatures (Quivira provides >> "middle parts" though the result is a stadium instead of a true >> circle). It should be possible to do the same for enclosed ideographic >> numbers, I think. >> >> The problems I can see with this are confusability with the already >> encoded atomic enclosed numbers, and breaking in vertical text. >> > > I suppose that's why things like this happen in appilcations.... > > Joined "ti" coded as "?" in PDF > > http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0084.html > > you get an encode of a series of codepoints, that results in an array > of font glyph-points to render .... What? I don't see what an apparent ligature matching or OCR glitch in PDFs has to do with this. From duerst at it.aoyama.ac.jp Fri Mar 18 02:43:56 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Fri, 18 Mar 2016 16:43:56 +0900 Subject: Swapcase for Titlecase characters Message-ID: <56EBB1BC.7040107@it.aoyama.ac.jp> I'm working on extending the case conversion methods for the programming language Ruby from the current ASCII only to cover all of Unicode. Ruby comes with four methods for case conversion. Three of them, upcase, downcase, and capitalize, are quite clear. But we have hit a question for the forth method, swapcase. What swapcase does is swap upper and lower case, so that e.g. 'Unicode Standard'.swapcase => 'uNICODE sTANDARD' I'm not sure myself where this method is actually used, but it also exists in Python (and maybe Ruby got it from there). Now the question I have is: What to do for titlecase characters? Several possibilities already have been floated: a) Leave as is, because there are neither upper nor lower case. b) Convert to upper (or lower), which may simplify implementation. c) Decompose the character into upper and lower case components, and apply swapcase to these. For example, '?insi' (jeans) would become '?INSI' with a), '?INSI' (or '?insi') with b), and 'd?INSI' with c). For another example, '???' would become '???' with a), '????' (or '???') with b), and '????' with c). It looks like Python 3 (3.4.3 in my case) is doing a). My guess is that from an user expectation point of view, c) is best, so I'm tending to go for c). There is no existing data from the Unicode Standard for this, but it seems pretty straightforward. But before I just implement something, I'd appreciate additional input, in particular from users closer to the affected language communities. Regards, Martin. From verdy_p at wanadoo.fr Fri Mar 18 03:08:50 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Mar 2016 09:08:50 +0100 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: That's a smart idea... Note that you could encode the middle digits so that their enclosure at top and bottom are by default only horizontal (no arcs of circle) when shown in isolation, and the left and right parts are just connecting by default horizontally to the top and bottom position of the middle digits. Allowing arbitrary number of characters. In order to create a real circle, you could use a joiner control to given a hint the renderer that he can create a ligature (possibly reducing the size of digits, or changing the dimension and shape of the connected segments so that they'll draw a circle instead of a "cartouche" rounded at start and end. You could even set the enclosing as a combining character around existing digits (even if those digits are not symbols by themselves, the combining character has this property, an idea similar to the arrow combining characters at top or bottom for mathematics notations), so that the content of the "circle" or "cartouche". The enclosure could also be something else than a circle (or arcs of circle): it could be a rectangle, hintable with joiners (like with circles) to create an enclosing square, or a rounded rectangle (hintable to create a rounded square). The enclosure shapes could be white or black, or could be drawn with double strokes. This is in fact similar to the combining low line or top line which are joining by default. However using a joiner between them instructs not really to join the top/bottom line (which is already the expected behavior for these low/top lines) but to create a ligature between the base characters in the middle. Then to create double enclosure, just "stack" several combining characters (in the order from inside to outside: the combining characters for enclosures should have the same high value for their combining class so that their relative order is kept, or could have combining class 0). The issues with line breaking (if you can use these combining around all characters, inclusing spaces, can be solved using unbreakable characters. Note that this addition would create a disunification with existing enclosed characters which are already ligatured into a single symbol (they won't be canonically equivalent, using only the decomposition properties), but this can be solved by adding another property ("ligature decomposition"), and mapping the existing enclosed characters to their "ligature decomposition" using normal base characters, the new combining characters for enclosure and the joining control between them. those mappings can be in a new properties file (which could then be useful for collation so that the "enclosed 79" symbol would collate like "79"). Advantage: with these, you can now enclose various numbers (not just natural integers) or abbreviations (e.g. chemical Symbols like "Au" for gold), or astrological symbols, or arbitrary words (using them to enclose full sentences would not be very practicle, but their use to enclose a person name such as the name of a Egyptian king "Ramses" is possible, even outside the context of Egyptian hieroglyphs)... It could be used to enclose a temperature such as "10?C", or a section heading number "1.1". And this is much less limited than the (very quirky) use of CSS or styles (in rich text or HTML) to add surrounding "borders" as the shapes are less restricted (in CSS you can create rounded borders). Some new shapes are possible such as diagonal left and right sides, or mixing a rounded left side and a square right side (though in this case it would be hard to use joiners and expect a ligature to be created for the enclosing shape (for example expect a triangular enclosure created by the ligature of two diagonal sides and horizontal top/bottom for characters in the middle, because this would absolutely require resizing all characters in the middle to preserve a consistent line height; but this is possible for pairs of base characters inside the enclosure). Note : the enclosing ligature "joiner" control is not the same as the one for joining base characters, as the intent is to join the enclosing shape fragments (possibly by reducing the size and repositioning the all characters in the middle), as characters in the middle are not ligatured themselves (if you enclose "AE" in such shapes created with combining characters, it should not produce a "AE" letter in the final enclosing shape. 2016-03-18 5:18 GMT+01:00 Garth Wallace : > There's another strategy for dealing with enclosed numbers, which is > taken by the font Quivira in its PUA: encoding separate > left-half-circle-enclosed and right-half-circle-enclosed digits. This > would require 20 characters to cover the double digit range 00?99. > Enclosed three digit numbers would require an additional 30 for left, > center, and right thirds, though it may be possible to reuse the left > and right half circle enclosed digits and assume that fonts will > provide left half-center third-right half ligatures (Quivira provides > "middle parts" though the result is a stadium instead of a true > circle). It should be possible to do the same for enclosed ideographic > numbers, I think. > > The problems I can see with this are confusability with the already > encoded atomic enclosed numbers, and breaking in vertical text. > > On Wed, Mar 16, 2016 at 5:45 PM, Andrew West > wrote: > > Hi Fr?d?ric, > > > > The historic use of ideographic numbers for marking Go moves are > > discussed in the latest draft of my document: > > > > http://www.babelstone.co.uk/Unicode/GoNotation.pdf > > > > Andrew > > > > > > On 16 March 2016 at 13:35, Fr?d?ric Grosshans > > wrote: > >> Le 15/03/2016 22:21, Andrew West a ?crit : > >>> > >>> > >>> Possibly. I certainly have very little expectation that a proposal to > >>> complete both sets to 999 (or even 399) would have any chance of > >>> success. > >> > >> And then, there are also the historical example of ideographic numbers > used > >> for the same purpose in historic texts (like here > >> http://sns.91ddcc.com/t/54057, here > http://pmgs.kongfz.com/item_pic_464349/ > >> or here > >> > http://www.weibo.com/p/1001593905063666976890?from=page_100106_profile&wvr=6&mod=wenzhangmod > >> ). > >> > >> The above has been found with a quick google search, and I have no idea > >> whether these symbols were used in the running text or not. > >> > >> Fr?d?ric > >> > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 18 11:59:48 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Mar 2016 17:59:48 +0100 Subject: Meteorological symbols for cloud conditions (on maps or elsewhere) Message-ID: See https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg I see these symbols for noting cloud types (here cirrus and altocumulus, one drawn in diagonal for middle altitude, another drawn horizontally for high altitudes). Note that the symbols may vary: see Altocumulus for example as found in French Wikipedia (note sure if it's accurate) which is different from the symbol found in the sampled notation on a map https://fr.wikipedia.org/wiki/Altocumulus Also other symbols on the similar page in English Wikipedia, are used to describe some cloud characteristics: https://en.wikipedia.org/wiki/Altocumulus_cloud Is there a well defined collection of these symbols, and are they in the encoding pipe ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 18 12:09:39 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Mar 2016 18:09:39 +0100 Subject: Meteorological symbols for cloud conditions (on maps or elsewhere) In-Reply-To: References: Message-ID: Some other resources (outside Wikipedia): - Kean University: http://www.kean.edu/~fosborne/resources/ex10g.htm - Documented by the NOAA in US (but I don't find the complete reference) - These symbols seem to be supported by an "international standard", but I don't know which one exactly. - Documented with other symbols (rain, ice, snow, thunder...) in Canada for flight planning https://flightplanning.navcanada.ca/cgi-bin/CreePage.pl?Langue=anglais&NoSession=NS_Inconnu&Page=wxsymbols&TypeDoc=wxsymb - http://www.visualdictionaryonline.com/earth/meteorology/international-weather-symbols/clouds.php 2016-03-18 17:59 GMT+01:00 Philippe Verdy : > See > https://fr.wikipedia.org/wiki/Carte_m%C3%A9t%C3%A9orologique#/media/File:Station_model_fr.svg > > I see these symbols for noting cloud types (here cirrus and altocumulus, > one drawn in diagonal for middle altitude, another drawn horizontally for > high altitudes). > > Note that the symbols may vary: see Altocumulus for example as found in > French Wikipedia (note sure if it's accurate) which is different from the > symbol found in the sampled notation on a map > > https://fr.wikipedia.org/wiki/Altocumulus > > Also other symbols on the similar page in English Wikipedia, are used to > describe some cloud characteristics: > > https://en.wikipedia.org/wiki/Altocumulus_cloud > > Is there a well defined collection of these symbols, and are they in the > encoding pipe ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Fri Mar 18 13:11:52 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 18 Mar 2016 11:11:52 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On Fri, Mar 18, 2016 at 1:08 AM, Philippe Verdy wrote: > That's a smart idea... Note that you could encode the middle digits so that > their enclosure at top and bottom are by default only horizontal (no arcs of > circle) when shown in isolation, and the left and right parts are just > connecting by default horizontally to the top and bottom position of the > middle digits. Allowing arbitrary number of characters. In order to create a > real circle, you could use a joiner control to given a hint the renderer > that he can create a ligature (possibly reducing the size of digits, or > changing the dimension and shape of the connected segments so that they'll > draw a circle instead of a "cartouche" rounded at start and end. Since left-right pairs and left-middle-right triples are intended to be used together, ZWJs would be redundant. I'm not sure about extending it to an arbitrary number of enclosed digits. It seems like that would require special support from rendering. Supporting only a well-defined set of combinations would work with just OpenType ligature lookup tables (and wouldn't even necessarily require ligatures in all cases). > You could even set the enclosing as a combining character around existing > digits (even if those digits are not symbols by themselves, the combining > character has this property, an idea similar to the arrow combining > characters at top or bottom for mathematics notations), so that the content > of the "circle" or "cartouche". > > The enclosure could also be something else than a circle (or arcs of > circle): it could be a rectangle, hintable with joiners (like with circles) > to create an enclosing square, or a rounded rectangle (hintable to create a > rounded square). I thought combining characters would not be suitable for things like white text on black. > The enclosure shapes could be white or black, or could be drawn with double > strokes. This is in fact similar to the combining low line or top line which > are joining by default. > > However using a joiner between them instructs not really to join the > top/bottom line (which is already the expected behavior for these low/top > lines) but to create a ligature between the base characters in the middle. > Then to create double enclosure, just "stack" several combining characters > (in the order from inside to outside: the combining characters for > enclosures should have the same high value for their combining class so that > their relative order is kept, or could have combining class 0). Double enclosure? I'm not sure what the purpose of that would be. This is getting into styling territory, I think. > The issues with line breaking (if you can use these combining around all > characters, inclusing spaces, can be solved using unbreakable characters. Line breaking isn't really a problem that I can see with the Quivira model. If they're given the usual line breaking properties for symbols, the Unicode line breaking algorithm would prevent a break between halves. East Asian vertical text is another story. In a font that just uses kerning to join halves (as Quivira does) you'd end up with the left half on top of the right in vertical text. I'm not sure how ligatures are handled in vertical text. From verdy_p at wanadoo.fr Fri Mar 18 13:48:45 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Mar 2016 19:48:45 +0100 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: 2016-03-18 19:11 GMT+01:00 Garth Wallace : > > The issues with line breaking (if you can use these combining around all > > characters, inclusing spaces, can be solved using unbreakable characters. > > Line breaking isn't really a problem that I can see with the Quivira > model. If they're given the usual line breaking properties for > symbols, the Unicode line breaking algorithm would prevent a break > between halves. East Asian vertical text is another story. In a font > that just uses kerning to join halves (as Quivira does) you'd end up > with the left half on top of the right in vertical text. I'm not sure > how ligatures are handled in vertical text. > East Asian vertical presentation does not just stack the elements on top of each other, very frequently they rotate them (including Latin/Greek/Cyrillic letters) So this is not really a new complication. The numbers however are used for noting or commenting a strategy, or the placement order during a party. However for game notations purpose, rotation plays a significant role (notably if those two part symbols are joined in a circle or disc: it can make the difference between several distinct sets of stones, or it could be used in a 4-players go variant (where black vs. white is not sufficient to distinguish the players). In reality the stones would have 4 colours (stones are not really numbered, they are all the same for the same player, or there's some special marked type of stone for each player in addition to their normal set) or sets would have some symbol or dot on top of them. There are also go variants using stones that take a territory and block the position but that cnanot be taken (both players can use them, but the territory taken is not counted for any player. These stones can also be placed randomly at start of the party over the board to complicate the game, or there's a limited set of blocking stones for each player that an choose when to play them instead of standard stones. Those blocking stones are visually distinct, but identical for the two players that have them at start of the party. Although the classic rules of go are extremely simple, this game has a lot of variants. In fact many players that don't know the exact classic rules are inventing their own variant. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri Mar 18 13:58:43 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 18 Mar 2016 11:58:43 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: <56EC4FE3.70704@ix.netcom.com> An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Mar 18 14:33:20 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 18 Mar 2016 20:33:20 +0100 (CET) Subject: Swapcase for Titlecase characters In-Reply-To: <56EBB1BC.7040107@it.aoyama.ac.jp> References: <56EBB1BC.7040107@it.aoyama.ac.jp> Message-ID: <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> On Fri, Mar 18, 2016, 08:43:56, Martin J. D?rst wrote: > I'm working on extending the case conversion methods for the programming > language Ruby from the current ASCII only to cover all of Unicode. > > Ruby comes with four methods for case conversion. Three of them, upcase, > downcase, and capitalize, are quite clear. But we have hit a question > for the forth method, swapcase. > > What swapcase does is swap upper and lower case, so that e.g. > > 'Unicode Standard'.swapcase => 'uNICODE sTANDARD' > > I'm not sure myself where this method is actually used, but it also > exists in Python (and maybe Ruby got it from there). > > > Now the question I have is: What to do for titlecase characters? Several > possibilities already have been floated: > > a) Leave as is, because there are neither upper nor lower case. > > b) Convert to upper (or lower), which may simplify implementation. > > c) Decompose the character into upper and lower case components, and > apply swapcase to these. > > > For example, '?insi' (jeans) would become '?INSI' with a), '?INSI' (or > '?insi') with b), and 'd?INSI' with c). For another example, '???' would > become '???' with a), '????' (or '???') with b), and '????' with c). > > It looks like Python 3 (3.4.3 in my case) is doing a). My guess is that > from an user expectation point of view, c) is best, so I'm tending to go > for c). There is no existing data from the Unicode Standard for this, > but it seems pretty straightforward. > > But before I just implement something, I'd appreciate additional input, > in particular from users closer to the affected language communities. As far as I can tell from my limited experience, the swapcase method is used only to convert ?inverted titlecase? to titlecase. I call ?inverted titlecase? the state of text produced by keyboard input while the caps lock toggle is accidentally on, and those words are ?inversely capitalized? where the user pressed the shift modifier. Therefore such examples would be most useful. Having said that, I know that this never occurs on many keyboards of English-speaking users who remapped that key to perform another action such as backspace, compose, or kana lock. Living myself in a country where the caps lock toggle is indispensable, I may be considered part of the aimed user communities, though unfortunately I don?t speak Croatian nor Greek. Looking at your examples, I would add a case that typically occurs for swapcase to be applied: ????? (cited [erroneously] as a result of option b) that is to be converted to ?????, and ??INSI?, that is to become ??insi?. As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as it?s unnecessary and users won?t expect it. I hope that helps. Kind regards, Marcel From gwalla at gmail.com Fri Mar 18 14:55:45 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 18 Mar 2016 12:55:45 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdy wrote: > 2016-03-18 19:11 GMT+01:00 Garth Wallace : >> >> > The issues with line breaking (if you can use these combining around all >> > characters, inclusing spaces, can be solved using unbreakable >> > characters. >> >> Line breaking isn't really a problem that I can see with the Quivira >> model. If they're given the usual line breaking properties for >> symbols, the Unicode line breaking algorithm would prevent a break >> between halves. East Asian vertical text is another story. In a font >> that just uses kerning to join halves (as Quivira does) you'd end up >> with the left half on top of the right in vertical text. I'm not sure >> how ligatures are handled in vertical text. > > > East Asian vertical presentation does not just stack the elements on top of > each other, very frequently they rotate them (including Latin/Greek/Cyrillic > letters) So this is not really a new complication. True. I suppose if the half-enclosed digits were defined as halfwidth, it would work. It makes intuitive sense too, if a complete numbered circle is assumed to fill an ideographic cell. I'm not sure if rotation of the numbers would be desired, though. > The numbers however are used for noting or commenting a strategy, or the > placement order during a party. > > However for game notations purpose, rotation plays a significant role > (notably if those two part symbols are joined in a circle or disc: it can > make the difference between several distinct sets of stones, or it could be > used in a 4-players go variant (where black vs. white is not sufficient to > distinguish the players). In reality the stones would have 4 colours (stones > are not really numbered, > they are all the same for the same player, or there's some special marked > type of stone for each player in addition to their normal set) or sets would > have some symbol or dot on top of them. Rotation is definitely not salient in standard go kifu like it is in fairy chess notation. Go variants for more than 2 players are uncommon enough that I don't think any sort of standardized notation exists. > There are also go variants using stones that take a territory and block the > position but that cnanot be taken (both players can use them, but the > territory taken is not counted for any player. > These stones can also be placed randomly at start of the party over the > board to complicate the game, or there's a limited set of blocking stones > for each player that an choose when to play them instead of standard stones. > Those blocking stones are visually distinct, but identical for the two > players that have them at start of the party. Do you have any links? I'm interested in game design. > Although the classic rules of go are extremely simple, this game has a lot > of variants. In fact many players that don't know the exact classic rules > are inventing their own variant. These are generally one-off inventions (or commercial products) so I don't think there's much need to consider their hypothetical variations on notation. From asmus-inc at ix.netcom.com Fri Mar 18 15:11:54 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 18 Mar 2016 13:11:54 -0700 Subject: Swapcase for Titlecase characters In-Reply-To: <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> References: <56EBB1BC.7040107@it.aoyama.ac.jp> <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> Message-ID: <56EC610A.4080702@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri Mar 18 15:14:53 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 18 Mar 2016 13:14:53 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: <56EC61BD.6010809@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Mar 18 15:23:20 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 19 Mar 2016 04:23:20 +0800 Subject: Swapcase for Titlecase characters In-Reply-To: <56EC610A.4080702@ix.netcom.com> References: <56EBB1BC.7040107@it.aoyama.ac.jp> <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> <56EC610A.4080702@ix.netcom.com> Message-ID: The 'swapcase' just sounds bizarre. What on earth is it for? My inclination would be to just do the simplest possible implementation that has the expected results for the 1:1 case pairs, and whatever falls out from the algorithm for the others. Mark On Sat, Mar 19, 2016 at 4:11 AM, Asmus Freytag (t) wrote: > On 3/18/2016 12:33 PM, Marcel Schneider wrote: > > As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, as it?s unnecessary and users won?t expect it. > > > That was my intuition as well, but based on a different line of argument. > If you add a feature to match behavior somewhere else, it rarely pays to > make that perform "better", because it just means it's now different and no > longer matches. > > The exception is a feature for which you can establish unambiguously that > there is a metric of correctness or a widely (universally?) shared > expectation by users as to the ideal behavior. In that case, being > compatible with a broken feature (or a random implementation of one) may in > fact be counter productive. > > The mere fact that you needed to ask here made me think that this would be > unlikely to be one of those exceptions: because in that case, you would > have easily be able to tap into a consensus that tells you what "better" > means. (And it the feature would probably have been more widely > implemented). > > This one is pretty bizarre on the face of it, but I like Marcel's > suggestion as to its putative purpose. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 18 16:19:19 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 18 Mar 2016 22:19:19 +0100 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: <56EC4FE3.70704@ix.netcom.com> References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> <56EC4FE3.70704@ix.netcom.com> Message-ID: Sequences were introduced long before. I know that they add their own complications everywhere, but they are already part of existing algorithms. If sequences (not just combining sequences) were not there, there would be much more characters encoded in the database and eveything would be encoded like sinograms (mostly one character per composite glyph) 2016-03-18 19:58 GMT+01:00 Asmus Freytag (t) : > On 3/18/2016 11:11 AM, Garth Wallace wrote: > > The enclosure could also be something else than a circle (or arcs of> circle): it could be a rectangle, hintable with joiners (like with circles)> to create an enclosing square, or a rounded rectangle (hintable to create a> rounded square). > > I thought combining characters would not be suitable for things like > white text on black. > > > Philippe seems to have an appetite for combining sequences that's not > shared by the UTC. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Fri Mar 18 18:49:33 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 18 Mar 2016 16:49:33 -0700 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On Thu, Mar 17, 2016 at 9:18 PM, Garth Wallace wrote: > There's another strategy for dealing with enclosed numbers, which is > taken by the font Quivira in its PUA: encoding separate > left-half-circle-enclosed and right-half-circle-enclosed digits. This > would require 20 characters to cover the double digit range 00?99. > Enclosed three digit numbers would require an additional 30 for left, > center, and right thirds, though it may be possible to reuse the left > and right half circle enclosed digits and assume that fonts will > provide left half-center third-right half ligatures (Quivira provides > "middle parts" though the result is a stadium instead of a true > circle). It should be possible to do the same for enclosed ideographic > numbers, I think. > > The problems I can see with this are confusability with the already > encoded atomic enclosed numbers, and breaking in vertical text. Correction: the 2-digit pairs would require 19 characters. There would be no need for a left half circle enclosed digit one, since the enclosed numbers 10?19 are already encoded. This would only leave enclosed 20 as a potential confusable. There would also be no need for a left third digit zero, saving one code point if the thirds are not unified with the halves, so there would be 29 thirds. And just to clarify, there would have to be separate half cirlced and negative half circled digits. So that would be 96 characters altogether, or 58 if left and right third-circles are unified with their half-circle equivalents. Not counting ideographic numbers. This may not work very well for ideographic numbers though. In the examples, they appear to be written vertically within their circles (AFAICT none of the moves in those diagrams are numbered 100 or above, although some are hard to read). From duerst at it.aoyama.ac.jp Sat Mar 19 01:05:59 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sat, 19 Mar 2016 15:05:59 +0900 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: <56ECEC47.8020008@it.aoyama.ac.jp> On 2016/03/19 04:55, Garth Wallace wrote: > On Fri, Mar 18, 2016 at 11:48 AM, Philippe Verdy wrote: >> 2016-03-18 19:11 GMT+01:00 Garth Wallace : > Rotation is definitely not salient in standard go kifu like it is in > fairy chess notation. Go variants for more than 2 players are uncommon > enough that I don't think any sort of standardized notation exists. The most frequent way to play Go with more than two players is to play in two teams, the players in each team taking turns when it's time for their team to play. But there's no need for any special notation for this case. Regards, Martin. From andrewcwest at gmail.com Sat Mar 19 06:09:49 2016 From: andrewcwest at gmail.com (Andrew West) Date: Sat, 19 Mar 2016 11:09:49 +0000 Subject: Purpose of and rationale behind Go Markers U+2686 to U+2689 In-Reply-To: References: <56E0A932.8010909@att.net> <56E11BA9.4030703@it.aoyama.ac.jp> <56E9613A.4030605@gmail.com> Message-ID: On 18 March 2016 at 23:49, Garth Wallace wrote: > > Correction: the 2-digit pairs would require 19 characters. There would > be no need for a left half circle enclosed digit one, since the > enclosed numbers 10?19 are already encoded. This would only leave > enclosed 20 as a potential confusable. There would also be no need for > a left third digit zero, saving one code point if the thirds are not > unified with the halves, so there would be 29 thirds. > > And just to clarify, there would have to be separate half cirlced and > negative half circled digits. So that would be 96 characters > altogether, or 58 if left and right third-circles are unified with > their half-circle equivalents. Not counting ideographic numbers. Thanks for your suggestion, I have added two new options to my draft proposal, one based on your suggestion (60 characters: 10 left, 10 middle and 10 right for normal and negative circles) and one more verdyesque (four enclosing circle format characters). To be honest, I don't think the UTC will go for either of these options, but I doubt they will be keen to accept any of the suggested options. > This may not work very well for ideographic numbers though. In the > examples, they appear to be written vertically within their circles > (AFAICT none of the moves in those diagrams are numbered 100 or above, > although some are hard to read). I have now added an example with circled ideographic numbers greater than 100. See Fig. 13 in http://www.babelstone.co.uk/Unicode/GoNotation.pdf In this example, numbers greater than 100 are written in two columns within the circle, with hundreds on the right. Andrew From duerst at it.aoyama.ac.jp Sat Mar 19 06:54:51 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sat, 19 Mar 2016 20:54:51 +0900 Subject: Swapcase for Titlecase characters In-Reply-To: <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> References: <56EBB1BC.7040107@it.aoyama.ac.jp> <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> Message-ID: <56ED3E0B.7030207@it.aoyama.ac.jp> Thanks everybody for the feedback. On 2016/03/19 04:33, Marcel Schneider wrote: > On Fri, Mar 18, 2016, 08:43:56, Martin J. D?rst wrote: >> b) Convert to upper (or lower), which may simplify implementation. >> For example, '?insi' (jeans) would become '?INSI' with a), '?INSI' (or >> '?insi') with b), and 'd?INSI' with c). For another example, '???' would >> become '???' with a), '????' (or '???') with b), and '????' with c). > Looking at your examples, I would add a case that typically occurs for swapcase to be applied: > ????? (cited [erroneously] as a result of option b) that is to be converted to ?????, and ??INSI?, that is to become ??insi?. First, what do you mean with "erroneously"? Second, did I get this right that your additional case (let's call it d)) would cycle through the three options where available: lower -> title -> upper -> lower. > As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, > as it?s unnecessary and users won?t expect it. Why do you say "users won't expect it"? For those users not aware of the encoding internals, I'd indeed guess that's what users would expect, at least in the Croatian case. For Greek, it may be different; it depends on the extent to which the iota is seen as a letter vs. seen as a mark. Regards, Martin. From doug at ewellic.org Sat Mar 19 11:40:06 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 19 Mar 2016 10:40:06 -0600 Subject: Swapcase for Titlecase characters Message-ID: Martin J. D?rst wrote: > Now the question I have is: What to do for titlecase characters? > [ ... ] > For example, '?insi' (jeans) would become '?INSI' with a), '?INSI' (or > '?insi') with b), and 'd?INSI' with c). For the Latin letters at least, my 0.02 cents' worth (you read that right) is that they are probably so infrequently used that option (b) would be just fine. As one anecdote (which is even less like "data" than two anecdotes), I could not find any of the characters ? ? ? ? ? ? ? ? ? ? ? or their hex equivalents in any of the CLDR keyboard definitions. I'd imagine that users just type the two characters separately, and that consequently most data in the real world is like that. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Sat Mar 19 11:40:43 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 19 Mar 2016 17:40:43 +0100 (CET) Subject: Swapcase for Titlecase characters In-Reply-To: <56ED3E0B.7030207@it.aoyama.ac.jp> References: <56EBB1BC.7040107@it.aoyama.ac.jp> <2117274786.12214.1458329600190.JavaMail.www@wwinf2214> <56ED3E0B.7030207@it.aoyama.ac.jp> Message-ID: <295680346.8104.1458405643417.JavaMail.www@wwinf1g23> On Sat Mar 19, 2016 12:54:51, Martin J. D?rst wrote: > On 2016/03/19 04:33, Marcel Schneider wrote: > > On Fri, Mar 18, 2016, 08:43:56, Martin J. D?rst wrote: > > >> b) Convert to upper (or lower), which may simplify implementation. > > >> For example, '?insi' (jeans) would become '?INSI' with a), '?INSI' (or > >> '?insi') with b), and 'd?INSI' with c). For another example, '???' would > >> become '???' with a), '????' (or '???') with b), and '????' with c). > > > Looking at your examples, I would add a case that typically occurs for swapcase to be applied: > > > ????? (cited [erroneously] as a result of option b) that is to be converted to ?????, and ??INSI?, that is to become ??insi?. > > First, what do you mean with "erroneously"? The intent of that bracketed word was just to give account of the fact that when ????? is converted to lower case as assumed in option ?b-lower?, it becomes ?????, while ????? is a typical candidate for swapcase, thus I could reutilize it ?as is? to illustrate the fourth case. > > Second, did I get this right that your additional case (let's call it > d)) would cycle through the three options where available: > lower -> title -> upper -> lower. I?m afraid that swapcase as I?saw it is not a roundtrip method, therefore I?got some awkward moments today when I thought about how to implement it. As far as I could see, there are two pairs: I: lowercase ? titlecase (needed to correct the initials where the user pressed the shift modifier) II: uppercase ? lowercase (needed to correct the body of the words input while caps lock was on) That typically matches what happens when caps lock is accidentally on and the user writes normally?on a keyboard that includes digraphs and uses the SGCaps feature for them, like this: Modifier; None; Shift CapsLock off; Lower; Title CapsLock on; Upper; Lower Correcting keyboard input done with the wrong caps lock state is the only situation I can see where swapcase is needed and thus is likely to be used. This is why the swapcase method is implemented in word processors, as a part of an optional autocorrect feature that neutralizes the effet of starting a sentence normally while caps lock is on: After completing the input of an uppercase word with an initial lowercase letter, the word is automatically swapcased and caps lock is turned off. However now that I tested it with the digraph of the examples (input through the composer of the keyboard layout), it doesn?t work at all in one word processor, while in another one it works but uppercases the initial lowercase digraph instead of titlecasing it. [That may be considered effects of ?streamlined? implementations that drop the less frequent cases.] I don?t believe that it would be useful to make swapcase a roundtrip method, and anyway it would be weird because of the letters with three case forms. The case conversion cycle you draw above usually applies to words (and doesn?t work correctly in neither of the two tested word processors when an initial ? digraph is present), while most letters have identical values for Titlecase_Mapping and Uppercase_Mapping, and usually there is no means to flag them with ?Titlecase_State?. This might be one more reason why current implementations of swapcase don?t match the expected behavior for digraphs. > > > As about decomposing digraphs and ypogegrammeni to apply swapcase: That probably would be doing no good, > > as it?s unnecessary and users won?t expect it. > > Why do you say "users won't expect it"? For those users not aware of the > encoding internals, I'd indeed guess that's what users would expect, at > least in the Croatian case. That depends on what is the expected result. If the swapcase method is to correct inverted casing, users wouldn?t like to see the digraphs decomposed, the less as in the considered languages, the ? digraph is a part of the alphabet between ?D? and ???, so that users are really aware. > For Greek, it may be different; it depends > on the extent to which the iota is seen as a letter vs. seen as a mark. Here again the user inputs a precomposed letter, with iota subscript because he just wants a capitalized word, not an uppercase one. And here again the autocorrect doesn?t work in one word processor, while in the other one it applies uppercasing with uppercase iota adscript?while the rest of the word is lowercase?instead of capitalization, with lowercase iota adscript or iota subcript, that depends on conventions and preferences. Let?s take that as a proof how hard it is to implement swapcase with digraph support. I can?t better conclude this reply than with Asmus Freytag?s words on Fri, 1st Jan 2016 12:09:13 -0800:?[1] > Unicode aims to be expressive enough to model all plain text. That means, it inherits the non-reducible complexity of text. Even the insight that the complexity is non-reducible would be a big step forward. Regards, Marcel [1] Re: Unicode in the Curriculum? from Asmus Freytag (t) on 2016-01-01. http://www.unicode.org/mail-arch/unicode-ml/y2016-m01/0001.html From dzo at bisharat.net Sat Mar 19 12:52:30 2016 From: dzo at bisharat.net (Don Osborn) Date: Sat, 19 Mar 2016 13:52:30 -0400 Subject: =?UTF-8?Q?Re:_Joined_=22ti=22_coded_as_=22=c6=9f=22_in_PDF?= In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> Message-ID: <56ED91DE.5080700@bisharat.net> Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate). The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point? The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs. Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text. Don On 3/17/2016 7:34 PM, Andrew Cunningham wrote: > > There are a few things going on. > > In the first instance, it may be the font itself that is the source of > the problem. > > My understanding is that PDF files contain a sequence of glyphs. A PDF > file will contain a ToUnicode mapping between glyphs and codepoints. > This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping > provides support for ligatures and variation sequences. > > I assume it uses the data in the font's cmap table. If the ligature > isn't mapped then you will have problems. I guess the problem could > be either the font or the font subsetting and embedding performed when > the PDF is generated. > > Although, it is worth noting that in opentype fonts not all glyphs > will have mappings in the cmap file. > > The remedy, is to extensively tag the PDF and add ActualText > attributes to the tags. > > But the PDF specs leave it up to the developer to decide what happens > in there is both a visible text layer and ActualText. So even in an > ideal PDF, tesults will vary from software to software when copying > text or searching a PDF. > > At least thatsmy current understanding. > > Andrew > > On 18 Mar 2016 7:47 am, "Don Osborn" > wrote: > > Thanks all for the feedback. > > Doug, It may well be my clipboard (running Windows 7 on this > particular laptop). Get same results pasting into Word and EmEditor. > > So, when I did a web search on "interna?onal," as previously > mentioned, and come up with a lot of results (mostly PDFs), were > those also a consequence of many not fully Unicode compliant > conversions by others? > > A web search on what you came up with - "Interna??onal" - yielded > many more (82k+) results, again mostly PDFs, with terms like > "interna onal" (such as what Steve noted) and "interna perhaps others (given the nature of, or how Google interprets, the > private use character?). > > Searching within the PDF document already mentioned, > "international" comes up with nothing (which is a major fail as > far as usability). Searching the PDF in a Firefox browser window, > only "interna?onal" finds the occurrences of what displays as > "international." However after downloading the document and > searching it in Acrobat, only a search for "interna??onal" will > find what displays as "international." > > A separate web search on "E?ects" came up with 300+ results, > including some GoogleBooks which in the texts display "effects" > (as far as I checked). So this is not limited to Adobe? > > J?rg, With regard to "Identity H," a quick search gives the > impression that this encoding has had a fairly wide and not so > happy impact, even if on the surface level it may have facilitated > display in a particular style of font in ways that no one > complains about. > > Altogether a mess, from my limited encounter with it. There must > have been a good reason for or saving grace of this solution? > > Don > > On 3/17/2016 2:17 PM, Steve Swales wrote: > > Yes, it seems like your mileage varies with the PDF > viewer/interpreter/converter. Text copied from Preview on the > Mac replaces the ti ligature with a space. Certainly not a > Unicode problem, per se, but an interesting problem nevertheless. > > -steve > > On Mar 17, 2016, at 11:11 AM, Doug Ewell > wrote: > > Don Osborn wrote: > > Odd result when copy/pasting text from a PDF: For some > reason "ti" in > the (English) text of the document at > http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf > is coded as "?". Looking more closely at the original > text, it does > appear that the glyph is a "ti" ligature (which afaik > is not coded as > such in Unicode). > > When I copy and paste the PDF text in question into > BabelPad, I get: > > Interna??onal Order and the Distribu??on of Iden??ty > in 1950 (By > invita??on only) > > The "ti" ligatures are implemented as U+10019F, a Plane 16 > private-use > character. > > Truncating this character to 16 bits, which is a Bad > Thing?, yields > U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it > looks like either > Don's clipboard or the editor he pasted it into is not fully > Unicode-compliant. > > Don's point about using alternative characters to > implement ligatures, > thereby messing up web searches, remains valid. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Sat Mar 19 17:30:11 2016 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Sat, 19 Mar 2016 22:30:11 +0000 (GMT) Subject: Joined "ti" coded as =?UTF-8?Q?=22=C6=9F=22?= in PDF References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> Message-ID: On 2016-03-19, Don Osborn wrote: > The details may or may not be relevant to the list topic, but as a user > of documents in PDF format, I fail to see the benefit of such obscure > mappings. And as a creator of PDFs ("save as") looking at others' PDFs Aren't you just being bitten by history? PDF derives from PostScript, which is not a language for representing plain text with typesetting information, but a language for type(and-graphic-)setting tout court. There's a lot of history of fonts using arbitrary codepoints; the idea that the underlying strings giving rise to the displayed graphics should also be a good plain text representation of the information is relatively novel. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From lang.support at gmail.com Sat Mar 19 18:06:29 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 20 Mar 2016 10:06:29 +1100 Subject: =?UTF-8?B?UmU6IEpvaW5lZCAidGkiIGNvZGVkIGFzICLGnyIgaW4gUERG?= In-Reply-To: <56ED91DE.5080700@bisharat.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> Message-ID: Hi Don, Latin is fine if you keep to simple well made fonts and avoid using more sophisticated typographic features available in some fonts. Dumb it down typographically and it works fine. PDF, despite all the current rhetoric coming from PDF software developers, is a preprint format. Not an archival format. The PDF format is less than ideal. But it is widely used, often in a way the format was never really created for. There are alternatives that preserve the text. But they have never really taken off (compared to PDF)for various reasons. Andrew On Sunday, 20 March 2016, Don Osborn wrote: > Thanks Andrew, Looking at the issue of ToUnicode mapping you mention, why in the 1-many mapping of ligatures (for fonts that have them) do the "many" not simply consist of the characters ligated? Maybe that's too simple (my understanding of the process is clearly inadequate). > > The "string of random ASCII characters" (per Leonardo) used in the Identity H system for hanzi raise other questions: (1) How are the ASCII characters interpreted as a 1-many sequence representing a hanzi rather than just a series of 1-1 mappings of themselves? (2) Why not just use the Unicode code point? > > The details may or may not be relevant to the list topic, but as a user of documents in PDF format, I fail to see the benefit of such obscure mappings. And as a creator of PDFs ("save as") looking at others' PDFs I've just encountered with these mappings, I'm wondering how these concerned about how the font & mapping results turned out as they did. It is certain that the creators of the documents didn't intend results that would not be searchable by normal text, but it seems possible their a particular font choice with these ligatures unwittingly produced these results. If the latter, the software at the very least should show a caveat about such mappings when generating PDFs. > > Maybe it's unrealistic to expect a simple implication of Unicode in PDFs (a topic we've discussed before but which I admit not fully grasping). Recalling I once had some wild results copy/pasting from an N'Ko PDF, and ended up having to obtain the .docx original to obtain text for insertion in a blog posting. But while it's not unsurprising to encounter issues with complex non-Latin scripts from PDFs, I'd gotten to expect predictability when dealing with most Latin text. > > Don > > > > On 3/17/2016 7:34 PM, Andrew Cunningham wrote: > > There are a few things going on. > > In the first instance, it may be the font itself that is the source of the problem. > > My understanding is that PDF files contain a sequence of glyphs. A PDF file will contain a ToUnicode mapping between glyphs and codepoints. This iseither a 1-1 mapping or a 1-many mapping. The 1-many mapping provides support for ligatures and variation sequences. > > I assume it uses the data in the font's cmap table. If the ligature isn't mapped then you will have problems. I guess the problem could be either the font or the font subsetting and embedding performed when the PDF is generated. > > Although, it is worth noting that in opentype fonts not all glyphs will have mappings in the cmap file. > > The remedy, is to extensively tag the PDF and add ActualText attributes to the tags. > > But the PDF specs leave it up to the developer to decide what happens in there is both a visible text layer and ActualText. So even in an ideal PDF, tesults will vary from software to software when copying text or searching a PDF. > > At least thatsmy current understanding. > > Andrew > > On 18 Mar 2016 7:47 am, "Don Osborn" wrote: >> >> Thanks all for the feedback. >> >> Doug, It may well be my clipboard (running Windows 7 on this particular laptop). Get same results pasting into Word and EmEditor. >> >> So, when I did a web search on "interna?onal," as previously mentioned, and come up with a lot of results (mostly PDFs), were those also a consequence of many not fully Unicode compliant conversions by others? >> >> A web search on what you came up with - "Interna??onal" - yielded many more (82k+) results, again mostly PDFs, with terms like "interna onal" (such as what Steve noted) and "interna> >> Searching within the PDF document already mentioned, "international" comes up with nothing (which is a major fail as far as usability). Searching the PDF in a Firefox browser window, only "interna?onal" finds the occurrences of what displays as "international." However after downloading the document and searching it in Acrobat, only a search for "interna??onal" will find what displays as "international." >> >> A separate web search on "E?ects" came up with 300+ results, including some GoogleBooks which in the texts display "effects" (as far as I checked). So this is not limited to Adobe? >> >> J?rg, With regard to "Identity H," a quick search gives the impression that this encoding has had a fairly wide and not so happy impact, even if on the surface level it may have facilitated display in a particular style of font in ways that no one complains about. >> >> Altogether a mess, from my limited encounter with it. There must have been a good reason for or saving grace of this solution? >> >> Don >> >> On 3/17/2016 2:17 PM, Steve Swales wrote: >>> >>> Yes, it seems like your mileage varies with the PDF viewer/interpreter/converter. Text copied from Preview on the Mac replaces the ti ligature with a space. Certainly not a Unicode problem, per se, but an interesting problem nevertheless. >>> >>> -steve >>> >>>> On Mar 17, 2016, at 11:11 AM, Doug Ewell wrote: >>>> >>>> Don Osborn wrote: >>>> >>>>> Odd result when copy/pasting text from a PDF: For some reason "ti" in >>>>> the (English) text of the document at >>>>> http://web.isanet.org/Web/Conferences/Atlanta%202016/Atlanta%202016%20-%20Full%20Program.pdf >>>>> is coded as "?". Looking more closely at the original text, it does >>>>> appear that the glyph is a "ti" ligature (which afaik is not coded as >>>>> such in Unicode). >>>> >>>> When I copy and paste the PDF text in question into BabelPad, I get: >>>> >>>>> Interna??onal Order and the Distribu??on of Iden??ty in 1950 (By >>>>> invita??on only) >>>> >>>> The "ti" ligatures are implemented as U+10019F, a Plane 16 private-use >>>> character. >>>> >>>> Truncating this character to 16 bits, which is a Bad Thing?, yields >>>> U+019F LATIN CAPITAL LETTER O WITH MIDDLE TILDE. So it looks like either >>>> Don's clipboard or the editor he pasted it into is not fully >>>> Unicode-compliant. >>>> >>>> Don's point about using alternative characters to implement ligatures, >>>> thereby messing up web searches, remains valid. >>>> >>>> -- >>>> Doug Ewell | http://ewellic.org | Thornton, CO ???? >>>> >>>> >>> >> > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Sun Mar 20 02:11:22 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 20 Mar 2016 08:11:22 +0100 Subject: Joined "ti" coded as =?utf-8?b?IsafIg==?= in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> Message-ID: <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> Quote/Cytat - Andrew Cunningham (Sun 20 Mar 2016 12:06:29 AM CET): > Hi Don, > > Latin is fine if you keep to simple well made fonts and avoid using more > sophisticated typographic features available in some fonts. > > Dumb it down typographically and it works fine. PDF, despite all the > current rhetoric coming from PDF software developers, is a preprint format. > Not an archival format. What about PDF/A, ISO 19005-1:2005 Document Management ? Electronic document file format for long term preservation? Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From lang.support at gmail.com Sun Mar 20 03:57:38 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 20 Mar 2016 19:57:38 +1100 Subject: =?UTF-8?B?Sm9pbmVkICJ0aSIgY29kZWQgYXMgIsafIiBpbiBQREY=?= In-Reply-To: <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> Message-ID: Janusz, It is all smoke and mirrors. For English .... you have to choose the roght font. Simple, no advanced features .... disable advanced typographic features in application if you can. Ensure the cmap table in the font is sufficiently comprehensive .... The issues Don raise still exist in PDF/A. You would need to make fundamental changes to the PDF spec for it to work for any language. For other languages, esp those in complex scripts the situation is more dire ... esp when glyphs have been reordered. The accepted work around is ActualText. But you don't necessarily need ActualText. Depends on font and language. But the rub is that it is left to implementors to decide if and when the ActualText is used. All aspects of the document ecosystem needs to be looked at. Which tools can use ActualText instead of the visible text layer. The PDF/UA spec is probably closer to the mark than the PDF/A spec. But since most archives have no control over pdf production, authors' or publishers' font selection, tools used, etc, then working with PDFs can be fairly hit and miss. For languages written in complex scripts, its usially a miss rather than a miss. I rarely see ActualText in PDF files ,even in those that need it. Andrew On Sunday, 20 March 2016, Janusz S. Bien wrote: > Quote/Cytat - Andrew Cunningham (Sun 20 Mar 2016 12:06:29 AM CET): > >> Hi Don, >> >> Latin is fine if you keep to simple well made fonts and avoid using more >> sophisticated typographic features available in some fonts. >> >> Dumb it down typographically and it works fine. PDF, despite all the >> current rhetoric coming from PDF software developers, is a preprint format. >> Not an archival format. > > What about PDF/A, ISO 19005-1:2005 Document Management ? Electronic document file format for long term preservation? > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) > Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Sun Mar 20 11:03:17 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Sun, 20 Mar 2016 17:03:17 +0100 Subject: Swapcase for Titlecase characters In-Reply-To: References: Message-ID: <56EEC9C5.2030104@uni-konstanz.de> Hello, Am 19.03.2016 um 17:40 schrieb Doug Ewell: > As one anecdote (which is even less like "data" than two anecdotes), I > could not find any of the characters ? ? ? ? ? ? ? ? ? ? ? or their hex > equivalents in any of the CLDR keyboard definitions. I'd imagine that > users just type the two characters separately, and that consequently > most data in the real world is like that. For ?IJ?, cf. . Regards, Otto From doug at ewellic.org Sun Mar 20 13:09:44 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 20 Mar 2016 12:09:44 -0600 Subject: Swapcase for Titlecase characters In-Reply-To: References: Message-ID: Otto Stolz wrote: >> [ ... ] I'd imagine that users just type the two characters >> [IJ or ij] separately, and that consequently most data in the real >> world is like that. > > For "IJ", > cf. . I can't make Edge or Acrobat Reader DC jump to the bookmark (suggestions off-list, please), but I guess Otto referred to this passage, which ends with the point I was trying to make: > Another pair of characters, U+0133 LATIN SMALL LIGATURE IJ and its > uppercase version, was provided to support the digraph "ij" in Dutch, > often termed a "ligature" in discussions of Dutch orthography. When > adding intercharacter spacing for line justification, the "ij" is kept > as a unit, and the space between the i and j does not increase. In > titlecasing, both the i and the j are uppercased, as in the word > "IJsselmeer." Using a single code point might simplify software > support for such features; however, because a vast amount of Dutch > data is encoded without this digraph character, under most > circumstances one will encounter an sequence. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Sun Mar 20 14:24:13 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 20 Mar 2016 12:24:13 -0700 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> Message-ID: <56EEF8DD.2090808@ix.netcom.com> An HTML attachment was scrubbed... URL: From tom at bluesky.org Sun Mar 20 14:52:09 2016 From: tom at bluesky.org (Tom Gewecke) Date: Sun, 20 Mar 2016 12:52:09 -0700 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: <56EEF8DD.2090808@ix.netcom.com> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> Message-ID: <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) wrote: > > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document. My understanding is that PDF/A-1a is supposed to be searchable. From verdy_p at wanadoo.fr Mon Mar 21 03:40:15 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Mar 2016 09:40:15 +0100 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> Message-ID: Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...) Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain). PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately). 2016-03-20 20:52 GMT+01:00 Tom Gewecke : > > > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) > wrote: > > > > Usually, the archive feature pertains only to the fact that you can > reproduce the final form, not to being able to get at the correct source > (plain text backbone) for the document. > > My understanding is that PDF/A-1a is supposed to be searchable. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 21 11:45:42 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Mar 2016 10:45:42 -0600 Subject: Swapcase for Titlecase characters Message-ID: I wrote: > As one anecdote (which is even less like "data" than two anecdotes), I > could not find any of the characters ? ? ? ? ? ? ? ? ? ? ? or their > hex equivalents in any of the CLDR keyboard definitions. I'd imagine > that users just type the two characters separately, and that > consequently most data in the real world is like that. Some off-list messages have helped to remind me that in the context of titlecase and swapcase, I should not have included ? and ? (U+0132 and U+0133) in that list. There is clearly no question about how swapcase should handle those. Sorry for the distraction. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From davidj_faulks at yahoo.ca Fri Mar 25 08:21:50 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Fri, 25 Mar 2016 13:21:50 +0000 (UTC) Subject: Some advice would be appreciated References: <699857364.627901.1458912110693.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <699857364.627901.1458912110693.JavaMail.yahoo@mail.yahoo.com> In putting together things for yet another submission of Astrology symbols, I've been putting aside symbols which seem very rare, or are used by only one or two astrology sites (even if they are more widely known). However, I've also come across symbols which are used/originate with one popular astrology program. For example, in the common Dutch astrology program Astrolab, I have some samples below: http://www.zwarte-maan.nl/horoscooppaginas/horoscoop-jpg-scan/jiddukrishnamurti-prog-tabel.jpg http://www.zwarte-maan.nl/horoscooppaginas/horoscoop-jpg-scan/femkehalsemaprog2007.jpg http://www.zwarte-maan.nl/horoscooppaginas/horoscoop-jpg-scan/femkehalsemaprog2009.jpg http://www.zwarte-maan.nl/horoscooppaginas/horoscoop-jpg-scan/sylviamillecam-prog1999.jpg http://www.zwarte-maan.nl/horoscooppaginas/horoscoop-jpg-scan/bernardhaitink-translijst.jpg It is pretty common to see versions of the Lunar Node symbols, ? U+260A ASCENDING NODE and ? U+260B DESCENDING NODE with ?T? inside them (to indicate a ?True? calculated location), but Astrolab is the only program I've seen that uses both T and non-T versions of ? at the same time. The ?White right-opening crescent on top of a cross? (or white ?) symbol is called Priapus, and many astrology programs claim to be able to use it, but it seems that only users of Astrolab do so (and post charts and listings) on a regular basis, for I can find no non-Astrolab samples. (except for http://www.astrologyweekly.com/learn-astrology/astrology-glyphs.php) Some Swiss/German astrology programs use a different symbol for Priapus, but I have only seen it in chartwheels, not tables or listings. For the point known as ?Black Moon Lilith? (?), Astrolab uses a Black Crescent (like many other astrology programs). This could be treated as a glyph variant, but for the ?True? version, A Black Crescent with a bar is used, which would need separate encoding. This crescent with bar seems restricted to Astrolab, and a much rarer Dutch astrology program, AstroScoop (which uses reversed crescents, however). In Russia, a ?White Moon Selena? is somewhat popular, the symbol usually looks like a reversed white ?. However, it turns out there are at least 3 ways to calculate this point. The popular ZET astrology program provides separate symbols for them, and I have seen rare charts (with listings) that have 2 of them at the same time. https://i.ytimg.com/vi/rsjaogbE6f0/maxresdefault.jpg Also, ZET uses a special symbol for ?True? Black Moon Lilith, which looks like a black diamont on top of a cross, or a filled in Pallas: http://astrogemma.ru/images/stories/151.gif http://astropro.ru/img/2ox2kjqsg/jtbfs8iiz6.jpg http://astropro.ru/img/5chcyvn6jo/rbd3api34.jpg These symbols are used only in ZET...well, mostly http://www.az-planet.ru/download/file.php?id=1347&sid=c8806ef143247a4804220a8aed1d43ad So, I'm wondering whether this is goof enough to justify encoding separate symbols for these cases. David From jsbien at mimuw.edu.pl Sat Mar 26 04:10:24 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sat, 26 Mar 2016 10:10:24 +0100 Subject: NamesList.txt as data source In-Reply-To: <56E1E9DF.9060405@att.net> (Ken Whistler's message of "Thu, 10 Mar 2016 13:40:47 -0800") References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> Message-ID: <86mvply58f.fsf@mimuw.edu.pl> On Thu, Mar 10 2016 at 22:40 CET, kenwhistler at att.net writes: [...] > The *reason* that NamesList.txt exists at all is to drive the tool, unibook, > that formats the full Unicode code charts for posting. It is only > posted in the Unicode Character Database at all as a matter of > convenience, to give people access to a text only version of the > names list that appears in the fully formatted pdf versions of the > code charts > that contain all the representative glyphs. > > NamesList.txt should *not* be data mined. I've just noticed that NamesList.txt is in a sense data mined by the Unicode consortium itself. I mean the "Unicode Utilities: Character Properties", which e.g. for LATIN SMALL LETTER P WITH FLOURISH (http://unicode.org/cldr/utility/character.jsp?a=A753) display in particular subhead: Medievalist addition Am I right that this information is available only in NamesList.txt? In my opinion this is important information and should be officially available for character data mining engines. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From doug at ewellic.org Sat Mar 26 22:00:31 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 26 Mar 2016 21:00:31 -0600 Subject: NamesList.txt as data source In-Reply-To: References: Message-ID: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> Janusz Bie? wrote: > Am I right that this information is available only in NamesList.txt? It probably comes from what Ken referred to as "a very long list of annotational material, including names list subhead material, etc., maintained in other sources." If you don't have access to those "other sources," then as far as I can tell, yes, it's available only in NamesList.txt. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Sat Mar 26 23:38:42 2016 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 26 Mar 2016 21:38:42 -0700 Subject: NamesList.txt as data source In-Reply-To: <86mvply58f.fsf@mimuw.edu.pl> References: <20160310134917.665a7a7059d7ee80bb4d670165c8327d.557a86dfa6.wbe@email03.secureserver.net> <56E1E9DF.9060405@att.net> <86mvply58f.fsf@mimuw.edu.pl> Message-ID: <56F763D2.6060606@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 27 13:38:53 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 27 Mar 2016 12:38:53 -0600 Subject: NamesList.txt as data source Message-ID: Asmus Freytag wrote: > Nobody disputes that subheaders are informative. However, subheaders > do not define a character property. Janusz was making a point that the CLDR data sometimes treats them as such, or at least as a kind of supplementary property. > There are several good reasons: > > 1. They do not "classify" characters in a uniform way: For some ranges > they give the purpose for which the character was encoded (as in your > example), for others, they give the type of character (vowel, > consonant), and in some cases they are free of information > ("Miscellaneous addition"). > > 2. Even where they give the purpose for which the character was > encoded, they do not necessarily attest that the characters in that > range are never used for other purposes. > > 3. The information is purely editorial, and as such, changed by the > editors as needed, not assigned as result of a vote in the Unicode > Technical Committee. > > 4. They appear to be more "formal" than they are, just because they > are presented with semantic markup in the input file to the code chart > layout tool; with the file being a rather structured file, only > because it describes a tabular presentation of data. However, see > points (1) through (3) on why this superficial appearance of formality > is misleading. It seems that the main concern about using NamesList.txt to obtain information beyond what is available in other UCD sources is that people might treat that additional information as normative and immutable, when it is not. It is understood that UTC members draw important distinctions between normative and informative material, and between material that is immutable and that which may change over time. For many purposes, these distinctions are crucial. However, there are uses for Unicode character data that do not depend on these distinctions. Often it is simply not a problem if, say, CAT FACE WITH WRY SMILE acquires a new informative cross-reference in one Unicode release, and that cross-reference suddenly changes or disappears in the next release. My suggestion to assuage these fears is for UTC to add additional warnings to the file header (right below "This file is semi-automatically derived...") or to NamesList.html, or both, basically stating that any information in NamesList.txt beyond that which can be found in other UCD files is informative and subject to change without notice. Then the burden, if such it is, will be on users to heed these warnings. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Sun Mar 27 13:57:48 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 27 Mar 2016 12:57:48 -0600 Subject: NamesList.txt as data source Message-ID: I do understand there are some folks, particularly in the media, who don't understand the difference between normative and informative, and treat any information from (or submitted to) the Unicode Consortium as gospel and dictum. IMHO the explicit health warnings I suggested would be an improvement in this regard over the current unwritten approach of "here's the data, but don't use it unless you're printing charts." -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sun Mar 27 16:04:55 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 27 Mar 2016 23:04:55 +0200 Subject: NamesList.txt as data source In-Reply-To: References: Message-ID: Le 27 mars 2016 20:47, "Doug Ewell" a ?crit : > > Asmus Freytag wrote: > >> Nobody disputes that subheaders are informative. However, subheaders >> do not define a character property. > > > Janusz was making a point that the CLDR data sometimes treats them as such, or at least as a kind of supplementary property. I'm very curious about where CLDR data depends on these subheaders or other annotations in NamesList.txt... Subheaders may only be used eventually as named anchors splitting a normative block onto several subparts (somtimes with several parts on the same heading) but th?se subblocks are not normative, notably because they are not correlated with other subbocks in additional blocks. And there's not even any warranty that cbaracters in these subblocks share some basic property, not even a script type, or a g?n?ral category. Thase are juste anchors for speaking about subblocks, and relat?s to the discussions that occured before these characters were encoded. If mater there are new characterd added these existing subblocks won't be sufficient. But the new characters will ne added at any convenient range available or in a new block. If needed, even these subblocks may ne subdivis?e and thus renamed. None of them are stable. For CLDR algorithms and data, these headings are not necessary and not used. Instead, character ranges or sets are used, specifying the characters directly, or one oor more of their properties in cimbinations but not this one. I juste hope that there's no algorithm depending on them and treating them as properties (for exemple in regular expressions with a custom property). If an algorithme must be created, it should define its own named subsets to d'?gine their own properties (many UAX algorithms do that constantly, e.g for text breakers or Bidi or text transforms) -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Mar 28 06:59:42 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 28 Mar 2016 13:59:42 +0200 Subject: NamesList.txt as data source In-Reply-To: References: Message-ID: > I'm very curious about where CLDR data depends on these subheaders or other annotations in NamesList.txt You're right. CLDR data doesn't. I think there is a misunderstanding because of the online utilities which have been, for convenience, hosted with the same server as the CLDR survey tool. So one sees "cldr" in the following URL, but that doesn't mean a particular association with CLDR. Example: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{sc=grek} This just filters characters to those with script = Greek. The listing has both the block name and the Nameslist subhead label in listing characters. One can also use the subhead labels in filtering, eg http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=Archaic%20letters} But subheads are *not* Unicode Character Properties. And repeating the caveats expressed earlier, the Nameslist data is designed for chart production, not as a reliable source of machine-readable data. While it may be in some cases useful to look at, the subheads are not designed to be a consistent source of data. For example, one couldn't use them effectively to find non-modern-use characters, because different terms are used for that, and the groupings mix in other characters. For example: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=/(?i)historic|archaic|obsolete/} Other examples: the NamesList data doesn't include all the case mappings, nor all the normative name aliases. It also lists the decomposition mapping, not the canonical and/or compatibility decompositions (which are *not* the same). And so on. One needs to use the UCD instead of trying to dig this information out of the NamesList.txt file ? because such information will be wrong and incomplete. Mark On Sun, Mar 27, 2016 at 11:04 PM, Philippe Verdy wrote: > > Le 27 mars 2016 20:47, "Doug Ewell" a ?crit : > > > > Asmus Freytag wrote: > > > >> Nobody disputes that subheaders are informative. However, subheaders > >> do not define a character property. > > > > > > Janusz was making a point that the CLDR data sometimes treats them as > such, or at least as a kind of supplementary property. > > I'm very curious about where CLDR data depends on these subheaders or > other annotations in NamesList.txt... > > Subheaders may only be used eventually as named anchors splitting a > normative block onto several subparts (somtimes with several parts on the > same heading) but th?se subblocks are not normative, notably because they > are not correlated with other subbocks in additional blocks. And there's > not even any warranty that cbaracters in these subblocks share some basic > property, not even a script type, or a g?n?ral category. Thase are juste > anchors for speaking about subblocks, and relat?s to the discussions that > occured before these characters were encoded. > If mater there are new characterd added these existing subblocks won't be > sufficient. But the new characters will ne added at any convenient range > available or in a new block. If needed, even these subblocks may ne > subdivis?e and thus renamed. None of them are stable. > > For CLDR algorithms and data, these headings are not necessary and not > used. Instead, character ranges or sets are used, specifying the characters > directly, or one oor more of their properties in cimbinations but not this > one. > > I juste hope that there's no algorithm depending on them and treating them > as properties (for exemple in regular expressions with a custom property). > If an algorithme must be created, it should define its own named subsets to > d'?gine their own properties (many UAX algorithms do that constantly, e.g > for text breakers or Bidi or text transforms) > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 28 13:18:25 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 28 Mar 2016 11:18:25 -0700 Subject: NamesList.txt as data source Message-ID: <20160328111825.665a7a7059d7ee80bb4d670165c8327d.6f74680d0b.wbe@email03.secureserver.net> Mark Davis wrote: > I think there is a misunderstanding because of the online utilities > which have been, for convenience, hosted with the same server as the > CLDR survey tool. So one sees "cldr" in the following URL, but that > doesn't mean a particular association with CLDR. Yes, that was my fault. > But subheads are *not* Unicode Character Properties. And repeating the > caveats expressed earlier, the Nameslist data is designed for chart > production, not as a reliable source of machine-readable data. While > it may be in some cases useful to look at, the subheads are not > designed to be a consistent source of data. For example, one couldn't > use them effectively to find non-modern-use characters, because > different terms are used for that, and the groupings mix in other > characters. I don't recall anyone asking for that. > Other examples: the NamesList data doesn't include all the case > mappings, nor all the normative name aliases. Nor that. > One needs to use the UCD instead of trying to dig this information out > of the NamesList.txt file ? because such information will be wrong and > incomplete. I don't recall anyone suggesting to use data from NamesList in preference to other UCD files. The issue is when NamesList is the only source. To circle back to the original topic, I suggested using NamesList data to find the cross-references from holes in the Mathematical Alphanumeric Symbols to existing BMP characters, in preference to using (a) comments in the (b) non-UCD MathClass* files. Both (a) and (b) prevent this scenario from being a matter of "use the UCD." Sorry to keep dragging this out, but I think there are still some misunderstandings and mischaracterizations surrounding the expectations of stability, formality, comprehensiveness, etc. of this data and its availability in other places. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmusf at ix.netcom.com Mon Mar 28 20:32:03 2016 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 28 Mar 2016 18:32:03 -0700 Subject: NamesList.txt as data source In-Reply-To: References: Message-ID: <56F9DB13.5000304@ix.netcom.com> On 3/28/2016 4:59 AM, Mark Davis ?? wrote: > The listing has both the block name and the Nameslist subhead label in > listing characters. One can also use the subhead labels in filtering, eg > > http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{subhead=Archaic%20letters} > > > But subheads are /not/ Unicode Character Properties. Effectively, the utilities support "searching" the code charts. But the way it is syntactically expressed (using the \p operator)makes it look like a "porperty". Now, if the utilities were able to search the core spec (and all UAXs) and look up under what headers (in what sections) the character is described of discussed, that would make it clearer that this is a search. It would also be beyond nifty. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 28 22:30:37 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 28 Mar 2016 21:30:37 -0600 Subject: NamesList.txt as data source In-Reply-To: <56F9DB13.5000304@ix.netcom.com> References: <56F9DB13.5000304@ix.netcom.com> Message-ID: Asmus Freytag wrote: > Now, if the utilities were able to search the core spec (and all UAXs) > and look up under what headers (in what sections) the character is > described of discussed, that would make it clearer that this is a > search. > > It would also be beyond nifty. It would indeed! But again, I don't think anyone has asked for that, or expects that. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From jsbien at mimuw.edu.pl Mon Mar 28 23:40:02 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 29 Mar 2016 06:40:02 +0200 Subject: NamesList.txt as data source References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> Message-ID: <86h9fp3nj1.fsf@mimuw.edu.pl> On Mon, Mar 28 2016 at 13:59 CEST, mark at macchiato.com writes: [...] > But subheads are not Unicode Character Properties. As it was already said by Doug, nobody claims this. > And repeating the caveats expressed earlier, There was a lot of repetitions in this thread... > the Nameslist data is designed for chart production, not as a reliable > source of machine-readable data. I guess you understand "machine-readable data" (and in consequence "data mining") in a specific very narrow way. > While it may be in some cases useful to look at, the subheads are not > designed to be a consistent source of data. Can we agree that Nameslist is a reliable source of machine-readable data about the Unicode *charts*? On Sun, Mar 27 2016 at 6:38 CEST, asmus-inc at ix.netcom.com writes: [...] > 3 The information is purely editorial, and as such, changed by the > editors as needed, not assigned as result of a vote in the Unicode > Technical Committee. Changes are not a problem if properly documented, but this is another topic. Let's now be more specific: On Sun, Mar 27 2016 at 5:00 CEST, doug at ewellic.org writes: > Janusz Bie? wrote: > >> Am I right that this information is available only in NamesList.txt? > > It probably comes from what Ken referred to as "a very long list of > annotational material, including names list subhead material, etc., > maintained in other sources." > > If you don't have access to those "other sources," See below. > then as far as I > can tell, yes, it's available only in NamesList.txt. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > On Sun, Mar 27 2016 at 6:38 CEST, asmus-inc at ix.netcom.com writes: > On 3/26/2016 2:10 AM, Janusz S. "Bie?" wrote: [...] > I've just noticed that NamesList.txt is in a sense data mined by the > Unicode consortium itself. I mean the "Unicode Utilities: Character > Properties", which e.g. for LATIN SMALL LETTER P WITH FLOURISH > (http://unicode.org/cldr/utility/character.jsp?a=A753) display in > particular > > subhead: Medievalist addition [...] > > If you seriously wanted to present "all that is known about a > character" you would need to excerpt all mentions of it in the core > specification, as well as (potentially) any additional details > presented in the version of the proposal document that was approved by > the UTC as part of encoding the character. Exactly. The essential information for LATIN SMALL LETTER P WITH FLOURISH is that in Medieval manuscripts it is used for "pro" or "por". This information is available only in http://www.unicode.org/L2/L2006/06027-n3027-medieval.pdf Is this a static and permanent link? What is the copyright status of the document? For example: Can it be redistributed and replicated on other sites? Can it be quoted literally in a Wikipedia entry? In general, what can be done to make access to such information easier? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From asmusf at ix.netcom.com Tue Mar 29 00:15:01 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 28 Mar 2016 22:15:01 -0700 Subject: NamesList.txt as data source In-Reply-To: <86h9fp3nj1.fsf@mimuw.edu.pl> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> Message-ID: <56FA0F55.5070704@ix.netcom.com> An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Tue Mar 29 02:16:35 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 29 Mar 2016 09:16:35 +0200 Subject: NamesList.txt as data source In-Reply-To: <56FA0F55.5070704@ix.netcom.com> (Asmus Freytag's message of "Mon, 28 Mar 2016 22:15:01 -0700") References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> Message-ID: <86y49121po.fsf@mimuw.edu.pl> On Tue, Mar 29 2016 at 7:15 CEST, asmusf at ix.netcom.com writes: > On 3/28/2016 9:40 PM, Janusz S. "Bie?" wrote: [...] > The terms of use (ostensibly for the entire site) are defined here: > > http://www.unicode.org/copyright.html > > The document archive has not been designated with anything more > restrictive, more specific or even explicit, but the documents > themselves do not carry copyrights. As far as the Consortium is > concerned, it requires the submitters to follow this policy > > http://www.unicode.org/policies/ipr_policy.html > > which gives the Consortium the rights to distribute submissions for > any purpose. > > For example: > > Can it be redistributed and replicated on other sites? > > The Consortium places restrictions on the use of material on "pay > sites". > > Can it be quoted > literally in a Wikipedia entry? > > Do you see anything that would restrict you, other than not having any > written policy that explicitly covers the Wikipedia? The document I refer to is a ISO/IEC document. As far as I know, ISO is quite crazy about copyright. Does the Unicode Consortium policy apply to this document? If so, then on which principle? An explicit agreement with ISO? > In general, what can be done to make access to such information easier? > > Over time, some of the information should move from the proposals to > the text of the core specification and / or into a technical report. > (For the mathematical characters, there exists a UTR that covers more > details than the core specification, but for completeness, the core > specification still contains some higher level stuff). > > This process can be user-driven or user\- assisted, by people > identifying gaps and either proposing text for the core specification > or writing a Unicode Technical Note or proposing a UTR to cover the > information. > > A UTN or UTR may be appropriate vehicles to collect information about > a particular field of application (e.g. medievalist use). An UTN (or UTR) seems a very good long term solution, but I wonder how many Unicode users are aware of UTN. Personally I tend to forgot about them :-) What about a simpler and more technical approach, like a character index with links to the relevant proposals? Doesn't such a thing already exist for internal use? Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From asmusf at ix.netcom.com Tue Mar 29 02:54:04 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 29 Mar 2016 00:54:04 -0700 Subject: NamesList.txt as data source In-Reply-To: <86y49121po.fsf@mimuw.edu.pl> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86y49121po.fsf@mimuw.edu.pl> Message-ID: <56FA349C.7090108@ix.netcom.com> An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Tue Mar 29 03:13:10 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 29 Mar 2016 10:13:10 +0200 Subject: NamesList.txt as data source In-Reply-To: <56FA349C.7090108@ix.netcom.com> (Asmus Freytag's message of "Tue, 29 Mar 2016 00:54:04 -0700") References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86y49121po.fsf@mimuw.edu.pl> <56FA349C.7090108@ix.netcom.com> Message-ID: <86twjp1z3d.fsf@mimuw.edu.pl> On Tue, Mar 29 2016 at 9:54 CEST, asmusf at ix.netcom.com writes: > On 3/29/2016 12:16 AM, Janusz S. "Bie?" wrote: > > The document I refer to is a ISO/IEC document. As far as I know, ISO is > quite crazy about copyright. Does the Unicode Consortium policy apply to > this document? If so, then on which principle? An explicit agreement > with ISO? > > This document looks like it's a simultaneous submission > > The document has an "L2" number, that is, an index in the UTC document > register and the "Action" says "For consideration by... and UTC". > > As such it is a submission to UTC and covered by all UTC policies. > > A./ Thanks for clarification. Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From andrewcwest at gmail.com Tue Mar 29 03:40:07 2016 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 29 Mar 2016 09:40:07 +0100 Subject: NamesList.txt as data source In-Reply-To: <56FA0F55.5070704@ix.netcom.com> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> Message-ID: On 29 March 2016 at 06:15, Asmus Freytag (c) wrote: > > What is the copyright status of the > document? > > The terms of use (ostensibly for the entire site) are defined here: > > http://www.unicode.org/copyright.html That refers to the Unicode Standard and data files and other pages produced and published by the Unicode Consortium. It does not and cannot refer to documents submitted to the Unicode Consortium by external entities or individuals. > The document archive has not been designated with anything more restrictive, more specific or even explicit, but the documents themselves do not carry copyrights. As far as the Consortium is concerned, it requires the submitters to follow this policy All documents submitted to WG2 and to L2 by individuals are copyright of the author(s) of the document. Documents do not need to carry a copyright notice to have copyright, and submitting the documents to Unicode Consortium and/or ISO does not affect the copyright status of documents. > http://www.unicode.org/policies/ipr_policy.html > > which gives the Consortium the rights to distribute submissions for any purpose. A non-exclusive right. > Can it be redistributed and replicated on other sites? Ask the individual authors of the particular documents you want to redistribute. > Can it be quoted literally in a Wikipedia entry? Within the normal Wikipedia rules for quoting copyrighted material. Andrew From jsbien at mimuw.edu.pl Tue Mar 29 10:19:04 2016 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 29 Mar 2016 17:19:04 +0200 Subject: NamesList.txt as data source In-Reply-To: (Andrew West's message of "Tue, 29 Mar 2016 09:40:07 +0100") References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> Message-ID: <86poud1fdj.fsf@mimuw.edu.pl> On Tue, Mar 29 2016 at 10:40 CEST, andrewcwest at gmail.com writes: > On 29 March 2016 at 06:15, Asmus Freytag (c) wrote: >> >> What is the copyright status of the >> document? >> [...] > All documents submitted to WG2 and to L2 by individuals are copyright > of the author(s) of the document. Documents do not need to carry a > copyright notice to have copyright, and submitting the documents to > Unicode Consortium and/or ISO does not affect the copyright status of > documents. > >> http://www.unicode.org/policies/ipr_policy.html Do you happen to know an analogical link for the ISO submissions? I was unable to find one quickly. >> >> which gives the Consortium the rights to distribute submissions for any purpose. > > A non-exclusive right. > >> Can it be redistributed and replicated on other sites? > > Ask the individual authors of the particular documents you want to > redistribute. OK > >> Can it be quoted literally in a Wikipedia entry? > > Within the normal Wikipedia rules for quoting copyrighted material. OK Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From andrewcwest at gmail.com Tue Mar 29 11:15:15 2016 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 29 Mar 2016 17:15:15 +0100 Subject: NamesList.txt as data source In-Reply-To: <86poud1fdj.fsf@mimuw.edu.pl> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86poud1fdj.fsf@mimuw.edu.pl> Message-ID: On 29 March 2016 at 16:19, Janusz S. Bie? wrote: > > > All documents submitted to WG2 and to L2 by individuals are copyright > > of the author(s) of the document. Documents do not need to carry a > > copyright notice to have copyright, and submitting the documents to > > Unicode Consortium and/or ISO does not affect the copyright status of > > documents. > > > >> http://www.unicode.org/policies/ipr_policy.html > > Do you happen to know an analogical link for the ISO submissions? I was > unable to find one quickly. ISO/IEC Directives Part 1 (6th ed., 2015) Section 2.13: In ISO and IEC, there is an understanding that original material contributed to become a part of an ISO, IEC or ISO/IEC publication can be copied and distributed within the ISO and/or IEC systems (as relevant) as part of the consensus building process, this being without prejudice to the rights of the original copyright owner to exploit the original text elsewhere. Where material is already subject to copyright, the right should be granted to ISO and/or IEC to reproduce and circulate the material. This is frequently done without recourse to a written agreement, or at most to a simple written statement of acceptance. Where contributors wish a formal signed agreement concerning copyright of any submissions they make to ISO and/or IEC, such requests must be addressed to ISO Central Secretariat or the IEC Central Office, respectively. Andrew From kenwhistler at att.net Tue Mar 29 13:24:14 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 29 Mar 2016 11:24:14 -0700 Subject: Character Index (was: Re: NamesList.txt as data source) In-Reply-To: <86y49121po.fsf@mimuw.edu.pl> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86y49121po.fsf@mimuw.edu.pl> Message-ID: <56FAC84E.9080705@att.net> On 3/29/2016 12:16 AM, Janusz S. Bie? wrote: > What about a simpler and more technical approach, like a character index > with links to the relevant proposals? Doesn't such a thing already exist > for internal use? No, and it is exceedingly *non*-trivial to produce such an index. There are now thousands of documents, extending over 27 years of history (and actually more when you go back to earlier work on 10646). Much of the early half of that document trail is paper only, in material that most of the participants have long ago mulched. The status of what a "character" even is can change during the development of proposals, as they morph over time. This is also exceedingly non-trivial in some cases, where argumentation about cases of unification and/or disunification of different source attestations might proceed over an extended period. That makes it pretty difficult to just willy-nilly produce a magical character index that points to exactly the right place. In recent years we have had some individuals who have tracked the specific documents associated with repertoire new to particular releases much more thoroughly than in prior years -- but truth to tell, the *majority* of people involved in maintenance of the Unicode Standard and ISO/IEC 10646 care little about the details of that history. Instead, they are basically focused on whatever happens to be the next thing to argue about. It is all about shinies -- not about piecing together dusty old artifacts. ;-) --Ken From asmusf at ix.netcom.com Tue Mar 29 13:56:34 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 29 Mar 2016 11:56:34 -0700 Subject: Character Index In-Reply-To: <56FAC84E.9080705@att.net> References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86y49121po.fsf@mimuw.edu.pl> <56FAC84E.9080705@att.net> Message-ID: <56FACFE2.5020702@ix.netcom.com> An HTML attachment was scrubbed... URL: From kent.karlsson14 at telia.com Tue Mar 29 18:14:59 2016 From: kent.karlsson14 at telia.com (Kent Karlsson) Date: Wed, 30 Mar 2016 00:14:59 +0100 Subject: Swapcase for Titlecase characters In-Reply-To: Message-ID: Den 2016-03-19 17:40, skrev "Doug Ewell" : > As one anecdote (which is even less like "data" than two anecdotes), I > could not find any of the characters ? ? ? ? ? ? ? ? ? ? ? or their hex (You missed the DZ "ligature" (which aren't really ligatures).) As mentioned, for the ? ? here (which sometimes ARE shown as ligatures, mostly in signage), there is no "titlecase" variant for these (and thus no problem for "swapcase"). For casing they behave just like ? ? and ? ?. While we are off-topic for this thread... (but still on-topic for this list): I still think ? should have the "soft-dotted" property (and that that property is finally implemented properly in various systems...). > equivalents in any of the CLDR keyboard definitions. I've heard that old typewriters used to have a key for ? ?. Maybe it should be reintroduced for Dutch computer keyboards, as well as used (for Dutch) in autocorrects (IJ -> ?, ij -> ?) or spell correctors (looking at the whole word rather than just two letters, and then not restricted to Dutch per se, but certain Dutch names regardless of the language for the surrounding text). That, in turn, would probably be a better approach than trying to have some special handling of the sequence "ij" in case mapping (for Dutch alone). /Kent K > I'd imagine that > users just type the two characters separately, and that consequently > most data in the real world is like that. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Wed Mar 30 11:59:58 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 30 Mar 2016 18:59:58 +0200 Subject: UTC makes the Colbert show Message-ID: Fredrik passed this on: https://www.youtube.com/watch?v=CfZE56E0Uts ; skip ahead to 1:30. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Wed Mar 30 12:18:42 2016 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Wed, 30 Mar 2016 19:18:42 +0200 Subject: NamesList.txt as data source In-Reply-To: References: <5E5E396485A9426EBF716F6D141C9F70@DougEwell> <86h9fp3nj1.fsf@mimuw.edu.pl> <56FA0F55.5070704@ix.netcom.com> <86poud1fdj.fsf@mimuw.edu.pl> Message-ID: <20160330191842.16243amvjjtrqkqa@mail.mimuw.edu.pl> Quote/Cytat - Andrew West (Tue 29 Mar 2016 06:15:15 PM CEST): > On 29 March 2016 at 16:19, Janusz S. Bie? wrote: >> >> > All documents submitted to WG2 and to L2 by individuals are copyright >> > of the author(s) of the document. Documents do not need to carry a >> > copyright notice to have copyright, and submitting the documents to >> > Unicode Consortium and/or ISO does not affect the copyright status of >> > documents. >> > >> >> http://www.unicode.org/policies/ipr_policy.html >> >> Do you happen to know an analogical link for the ISO submissions? I was >> unable to find one quickly. > > ISO/IEC Directives Part 1 (6th ed., 2015) > > Section 2.13: > > > In ISO and IEC, there is an understanding that original material > contributed to become a part of an ISO, > IEC or ISO/IEC publication can be copied and distributed within the > ISO and/or IEC systems (as relevant) > as part of the consensus building process, this being without > prejudice to the rights of the original > copyright owner to exploit the original text elsewhere. Where material > is already subject to copyright, > the right should be granted to ISO and/or IEC to reproduce and > circulate the material. This is frequently > done without recourse to a written agreement, or at most to a simple > written statement of acceptance. > Where contributors wish a formal signed agreement concerning copyright > of any submissions they > make to ISO and/or IEC, such requests must be addressed to ISO Central > Secretariat or the IEC Central > Office, respectively. > > > Andrew Thanks again! Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From mark at macchiato.com Wed Mar 30 12:54:37 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 30 Mar 2016 19:54:37 +0200 Subject: UTC makes the Colbert show In-Reply-To: References: Message-ID: On Wed, Mar 30, 2016 at 7:42 PM, Jennifer 8. Lee wrote: > I thought his "elf exposing self in park" was an amazing (and accurate) > facial expression. > ?Right! How does he make his cheeks do that!?!? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Wed Mar 30 12:57:49 2016 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Wed, 30 Mar 2016 17:57:49 +0000 Subject: UTC makes the Colbert show In-Reply-To: References: Message-ID: He has suggestions for process improvements as well? From: Unicore [mailto:unicore-bounces at unicode.org] On Behalf Of Jennifer 8. Lee Sent: Wednesday, March 30, 2016 10:37 AM To: Mark Davis ?? Cc: UTC ; Unicode Public Subject: Re: UTC makes the Colbert show He cites you by title! On Wednesday, March 30, 2016, Mark Davis ?? > wrote: Fredrik passed this on: https://www.youtube.com/watch?v=CfZE56E0Uts ; skip ahead to 1:30. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 30 13:24:10 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 30 Mar 2016 11:24:10 -0700 Subject: UTC makes the Colbert show Message-ID: <20160330112410.665a7a7059d7ee80bb4d670165c8327d.6b41c00e29.wbe@email03.secureserver.net> > Fredrik passed this on: > https://www.youtube.com/watch?v=CfZE56E0Uts ; skip ahead to 1:30. This is great! Now all of America knows what Unicode is really all about. ?? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Wed Mar 30 13:57:35 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 30 Mar 2016 20:57:35 +0200 Subject: UTC makes the Colbert show In-Reply-To: <20160330112410.665a7a7059d7ee80bb4d670165c8327d.6b41c00e29.wbe@email03.secureserver.net> References: <20160330112410.665a7a7059d7ee80bb4d670165c8327d.6b41c00e29.wbe@email03.secureserver.net> Message-ID: 2016-03-30 20:24 GMT+02:00 Doug Ewell : > > Fredrik passed this on: > > https://www.youtube.com/watch?v=CfZE56E0Uts ; skip ahead to 1:30. > > This is great! Now all of America knows what Unicode is really all > about. > All of America really? Do they all look at the same TV show on the same channel? Well there are probably many more people not even looking at TV but looking at video channels on the Internet (and there there's a plethora of videos with many other topics, seriously treated or not). May be they've heard about Unicode (but most often very superficially). Their contact with it (for example with emojis) is a choice panel on their smartphone, and they absolutely don't care about the encoding or any standard, they'll use these directly "as is" (and don't know really if this is correctly received, the way they intended). They don't even know if an Unicode encoding is really used to transport their messages. Most users on PC have never touched the browser settings about the "default encoding" for pages, they simply don't know how to choose (if they select something incorrect and this causes them problems, they'll just reset the brower default settings defined by other people). Unicode is absolutely not their problem (if there's a problem they will first blame the manufacturer of their device, or may be the maker of the software). Look at their support forum about those issues, in most times they do not understand the technical details, and just ask for a one-click solution (to reset the preferences that were incorrectly set). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Mar 30 15:12:27 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 30 Mar 2016 22:12:27 +0200 (CEST) Subject: Support for Latin ligature IJ (was another thread) In-Reply-To: References: Message-ID: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> On Wed, 30 Mar 2016 00:14:59 +0100, Kent Karlsson wrote [in the thread ?Re: Swapcase for Titlecase characters?]: [?] > I still think ? should have the "soft-dotted" property (and that > that property is finally implemented properly in various systems...). [Refers to: Re: Case for letters j and J with acute from Kent Karlsson on 2016-02-09 http://www.unicode.org/mail-arch/unicode-ml/y2016-m02/0044.html] For ??? that may be unambiguous, but for ?i? there is a need of locale-dependent tailoring, as for Lithuanian it should be hard-dotted. > I've heard that old typewriters used to have a key for ? ?. I?ve read it on Wikipedia, though I?ve been unable to grab any image of such off the internet. This one is Dutch but has none: https://www.bing.com/images/search?q=typewriter+dutch&view=detailv2&id=5473CA1D2B05879CE21B98CD9F729EE838A49E69&selectedindex=31&ccid=wLABJru4&simid=608029570327776271&thid=OIP.Mc0b00126bbb87be9b1d849df9b11a201o0&mode=overlay&first=1 These machines have lowercase ? only, while the uppercase position is given the florin sign: https://img1.etsystatic.com/062/0/5543707/il_570xN.794019731_fiyd.jpg http://www.tiptopvintage.co.uk/wp-content/uploads/2015/05/Brown-Vendex-Typewriter-7.jpg > Maybe it should be reintroduced for Dutch computer keyboards, I pledge in favor. To achieve this, it would be sufficient to have an ISO/IEC?9995-3 compliant keyboard layout for the Netherlands?and one for Belgium, as there are already one for Canada and one for Germany (given that ???, ??? are included on T3). And to complete the job, all of these could be added to CLDR. > as well as used > (for Dutch) in autocorrects (IJ -> ?, ij -> ?) or spell correctors > (looking at the whole word rather than just two letters, and then > not restricted to Dutch per se, but certain Dutch names regardless > of the language for the surrounding text). It?s urgent to spell the names correctly, notably because there are insufficient equivalence classes in search engines. Correctly spelled ??sselmeer? vs missspelled ?IJsselmeer? points to different numbers of results: Bing Search: 2?850?000 vs 886?000 Google Search: 343?000 vs 345?000 while DuckDuckGo, Startpage and Yahoo do not state the number of results (that in any case is mainly theoretical since only the top 500 ones are currently displayable). > That, in turn, would > probably be a better approach than trying to have some special > handling of the sequence "ij" in case mapping (for Dutch alone). In current understanding there seems to be a flaw on whether the ??? ligatures are to be used, or are deprecated. The mere fact that they are compatibility decomposable is cited[1] along with rule D21 to justify separate encoding as ?IJ?. TUS indeed seems to support that POV when it declares Dutch as supported by the Latin-1 supplement. One page below, the ??? ligatures are discussed as compatibility characters, which does not imply deprecation. And indeed, their replacement by two-letter sequences is pointed as a mere matter of fact. While atomic typing of ?ij? seems to be a relict from the ISO/IEC?646 era, I?m puzzled not to find any related autocorrect in word processor when Dutch is on (no instances found in MSO1043.acl of 2010), whereas French ??? is supported in the French ACL. As of special case mapping for ?ij?, its implementation goes increasing, but yes it remains a workaround that won?t be needed any longer as soon as people switch to ISO/IEC?9995-3 keyboard layouts. In the era of globalization, there is pretty no other choice. Hopefully, Marcel [1] https://en.wikipedia.org/wiki/IJ_(digraph)#cite_note-15 From verdy_p at wanadoo.fr Wed Mar 30 16:19:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 30 Mar 2016 23:19:17 +0200 Subject: Support for Latin ligature IJ (was another thread) In-Reply-To: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> Message-ID: In my opinion, the Dutch ?/? "ligature" is not really a ligature and should be treated exactly like ?/? or ?/? as a plain single letter. The use of IJ/ij (encoded as separate letters) is a actually an orthographic fault, that a ligature will not help resolve. Thanks, the decomposition of the "?" letter or "?" into separate letters is only a compatibility decomposition, but it is not canonically equivalent. In such as case, the "?" letter is soft-dotted also in Dutch and the two dots disappear when it has diacritics above. For Lithuanian, the "?" letter is not soft-dotted, but effectively hard-coded (meaning also that it is really a ligature, and that the single-letter should not be used at all, but encoded as i+j with a possible joiner...). In such a case, using the single letter "?/?" meant only for Dutch is also an orthographic fault. But this also means that when you add diacritics in Lithuanian, you'll need to encode explicit dots (like in Turkish) to keep these dots ! 2016-03-30 22:12 GMT+02:00 Marcel Schneider : > On Wed, 30 Mar 2016 00:14:59 +0100, Kent Karlsson wrote [in the thread > ?Re: Swapcase for Titlecase characters?]: > > [?] > > > I still think ? should have the "soft-dotted" property (and that > > that property is finally implemented properly in various systems...). > > [Refers to: > Re: Case for letters j and J with acute from Kent Karlsson on 2016-02-09 > http://www.unicode.org/mail-arch/unicode-ml/y2016-m02/0044.html] > > For ??? that may be unambiguous, but for ?i? there is a need of > locale-dependent tailoring, as for Lithuanian it should be hard-dotted. > > > I've heard that old typewriters used to have a key for ? ?. > > I?ve read it on Wikipedia, though I?ve been unable to grab any image of > such off the internet. > This one is Dutch but has none: > > > https://www.bing.com/images/search?q=typewriter+dutch&view=detailv2&id=5473CA1D2B05879CE21B98CD9F729EE838A49E69&selectedindex=31&ccid=wLABJru4&simid=608029570327776271&thid=OIP.Mc0b00126bbb87be9b1d849df9b11a201o0&mode=overlay&first=1 > > These machines have lowercase ? only, while the uppercase position is > given the florin sign: > > https://img1.etsystatic.com/062/0/5543707/il_570xN.794019731_fiyd.jpg > > > http://www.tiptopvintage.co.uk/wp-content/uploads/2015/05/Brown-Vendex-Typewriter-7.jpg > > > Maybe it should be reintroduced for Dutch computer keyboards, > > I pledge in favor. To achieve this, it would be sufficient to have an > ISO/IEC?9995-3 compliant keyboard layout for the Netherlands?and one for > Belgium, as there are already one for Canada and one for Germany (given > that ???, ??? are included on T3). > > And to complete the job, all of these could be added to CLDR. > > > as well as used > > (for Dutch) in autocorrects (IJ -> ?, ij -> ?) or spell correctors > > (looking at the whole word rather than just two letters, and then > > not restricted to Dutch per se, but certain Dutch names regardless > > of the language for the surrounding text). > > It?s urgent to spell the names correctly, notably because there are > insufficient equivalence classes in search engines. Correctly spelled > ??sselmeer? vs missspelled ?IJsselmeer? points to different numbers of > results: > > Bing Search: 2?850?000 vs 886?000 > Google Search: 343?000 vs 345?000 > > while DuckDuckGo, Startpage and Yahoo do not state the number of results > (that in any case is mainly theoretical since only the top 500 ones are > currently displayable). > > > That, in turn, would > > probably be a better approach than trying to have some special > > handling of the sequence "ij" in case mapping (for Dutch alone). > > In current understanding there seems to be a flaw on whether the ??? > ligatures are to be used, or are deprecated. The mere fact that they are > compatibility decomposable is cited[1] along with rule D21 to justify > separate encoding as ?IJ?. TUS indeed seems to support that POV when it > declares Dutch as supported by the Latin-1 supplement. One page below, the > ??? ligatures are discussed as compatibility characters, which does not > imply deprecation. And indeed, their replacement by two-letter sequences is > pointed as a mere matter of fact. > > While atomic typing of ?ij? seems to be a relict from the ISO/IEC?646 era, > I?m puzzled not to find any related autocorrect in word processor when > Dutch is on (no instances found in MSO1043.acl of 2010), whereas French ??? > is supported in the French ACL. > > As of special case mapping for ?ij?, its implementation goes increasing, > but yes it remains a workaround that won?t be needed any longer as soon as > people switch to ISO/IEC?9995-3 keyboard layouts. In the era of > globalization, there is pretty no other choice. > > Hopefully, > > Marcel > > [1] https://en.wikipedia.org/wiki/IJ_(digraph)#cite_note-15 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Mar 30 16:42:20 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 30 Mar 2016 23:42:20 +0200 Subject: Support for Latin ligature IJ (was another thread) In-Reply-To: References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> Message-ID: Note that the single letter "?" in Dutch is often undistinctable from "?", which is also commonly found as a convenient substitute in many old documents not encoded with Unicode but with ISO8859-1 : this has a caveat because the capitalization would produce "Y" (in ISO8859-1), possibly followed by a combining diaeresis (in Unicode-encoded documents) instead of "IJ" (more correct but not perfect) or the "?" letter (best choice). The use of "?" in Dutch should also be considered as an orthographic fault, and it should be corrected into "?" (to solve the capitalization problem), but there are occurences in Dutch of "?" which is correct (notably in borrowed French toponyms such as "L?Ha?-les-Roses") There may be similar examples in Belgium with French toponyms, but I suspect that those Belgian-French toponyms have their own Dutch "officialized" variant which would be preferable without borrowing the Belgian-French orthography, so that they will not need "?", and they will likely use "?" instead, meaning that the autocorrection of "?" from possible Belgian-French toponyms into "?" will also be correct for Dutch-Belgian toponyms ; it may also be correct for French-French toponyms like "L?Ha?-les-Roses" transformed into "L?Ha?-les-Roses" in Belgian-Dutch, or "L?HA?-LES-ROSES" if capitalized, if autocorrected this way; it would however be incorrect to replace there the "?" (or ?) letter by the two letters "ij" (or "IJ") without the orthographic ligature... By curiosity, I looked into the Dutch Wikipedia to see how they wrote "L?Ha?-les-Roses" and they don't transform the French "?" into some Dutch "?" (and they don't have any other "officialized" Dutch orthography. For this reason, the autocorrection of the "?" letter into the "?" letter in Dutch is disabled by default (even if it would be needed to look into old documents encoded with ISO8859-1). The situation is more complex for the autocorrection of the "ij" digram (extremely frequent in old documents encoded with ISO8859-1) into the plain "?" letter, which seems to be active in various wordprocessors (but which causes problems with borrowed non-Dutch names). 2016-03-30 23:19 GMT+02:00 Philippe Verdy : > In my opinion, the Dutch ?/? "ligature" is not really a ligature and > should be treated exactly like ?/? or ?/? as a plain single letter. > > The use of IJ/ij (encoded as separate letters) is a actually an > orthographic fault, that a ligature will not help resolve. > > Thanks, the decomposition of the "?" letter or "?" into separate letters > is only a compatibility decomposition, but it is not canonically equivalent. > > In such as case, the "?" letter is soft-dotted also in Dutch and the two > dots disappear when it has diacritics above. > > For Lithuanian, the "?" letter is not soft-dotted, but effectively > hard-coded (meaning also that it is really a ligature, and that the > single-letter should not be used at all, but encoded as i+j with a possible > joiner...). In such a case, using the single letter "?/?" meant only for > Dutch is also an orthographic fault. But this also means that when you add > diacritics in Lithuanian, you'll need to encode explicit dots (like in > Turkish) to keep these dots ! > > > 2016-03-30 22:12 GMT+02:00 Marcel Schneider : > >> On Wed, 30 Mar 2016 00:14:59 +0100, Kent Karlsson wrote [in the thread >> ?Re: Swapcase for Titlecase characters?]: >> >> [?] >> >> > I still think ? should have the "soft-dotted" property (and that >> > that property is finally implemented properly in various systems...). >> >> [Refers to: >> Re: Case for letters j and J with acute from Kent Karlsson on 2016-02-09 >> http://www.unicode.org/mail-arch/unicode-ml/y2016-m02/0044.html] >> >> For ??? that may be unambiguous, but for ?i? there is a need of >> locale-dependent tailoring, as for Lithuanian it should be hard-dotted. >> >> > I've heard that old typewriters used to have a key for ? ?. >> >> I?ve read it on Wikipedia, though I?ve been unable to grab any image of >> such off the internet. >> This one is Dutch but has none: >> >> >> https://www.bing.com/images/search?q=typewriter+dutch&view=detailv2&id=5473CA1D2B05879CE21B98CD9F729EE838A49E69&selectedindex=31&ccid=wLABJru4&simid=608029570327776271&thid=OIP.Mc0b00126bbb87be9b1d849df9b11a201o0&mode=overlay&first=1 >> >> These machines have lowercase ? only, while the uppercase position is >> given the florin sign: >> >> https://img1.etsystatic.com/062/0/5543707/il_570xN.794019731_fiyd.jpg >> >> >> http://www.tiptopvintage.co.uk/wp-content/uploads/2015/05/Brown-Vendex-Typewriter-7.jpg >> >> > Maybe it should be reintroduced for Dutch computer keyboards, >> >> I pledge in favor. To achieve this, it would be sufficient to have an >> ISO/IEC?9995-3 compliant keyboard layout for the Netherlands?and one for >> Belgium, as there are already one for Canada and one for Germany (given >> that ???, ??? are included on T3). >> >> And to complete the job, all of these could be added to CLDR. >> >> > as well as used >> > (for Dutch) in autocorrects (IJ -> ?, ij -> ?) or spell correctors >> > (looking at the whole word rather than just two letters, and then >> > not restricted to Dutch per se, but certain Dutch names regardless >> > of the language for the surrounding text). >> >> It?s urgent to spell the names correctly, notably because there are >> insufficient equivalence classes in search engines. Correctly spelled >> ??sselmeer? vs missspelled ?IJsselmeer? points to different numbers of >> results: >> >> Bing Search: 2?850?000 vs 886?000 >> Google Search: 343?000 vs 345?000 >> >> while DuckDuckGo, Startpage and Yahoo do not state the number of results >> (that in any case is mainly theoretical since only the top 500 ones are >> currently displayable). >> >> > That, in turn, would >> > probably be a better approach than trying to have some special >> > handling of the sequence "ij" in case mapping (for Dutch alone). >> >> In current understanding there seems to be a flaw on whether the ??? >> ligatures are to be used, or are deprecated. The mere fact that they are >> compatibility decomposable is cited[1] along with rule D21 to justify >> separate encoding as ?IJ?. TUS indeed seems to support that POV when it >> declares Dutch as supported by the Latin-1 supplement. One page below, the >> ??? ligatures are discussed as compatibility characters, which does not >> imply deprecation. And indeed, their replacement by two-letter sequences is >> pointed as a mere matter of fact. >> >> While atomic typing of ?ij? seems to be a relict from the ISO/IEC?646 >> era, I?m puzzled not to find any related autocorrect in word processor when >> Dutch is on (no instances found in MSO1043.acl of 2010), whereas French ??? >> is supported in the French ACL. >> >> As of special case mapping for ?ij?, its implementation goes increasing, >> but yes it remains a workaround that won?t be needed any longer as soon as >> people switch to ISO/IEC?9995-3 keyboard layouts. In the era of >> globalization, there is pretty no other choice. >> >> Hopefully, >> >> Marcel >> >> [1] https://en.wikipedia.org/wiki/IJ_(digraph)#cite_note-15 >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Wed Mar 30 19:17:17 2016 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 30 Mar 2016 18:17:17 -0600 Subject: UTC makes the Colbert show In-Reply-To: References: Message-ID: <56FC6C8D.1010606@khwilliamson.com> On 03/30/2016 11:54 AM, Mark Davis ?? wrote: > > On Wed, Mar 30, 2016 at 7:42 PM, Jennifer 8. Lee > wrote: > > I thought his "elf exposing self in park" was an amazing (and > accurate) facial expression. > > > ?Right! How does he make his cheeks do that!?!? Botox? From charupdate at orange.fr Wed Mar 30 23:04:49 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 31 Mar 2016 06:04:49 +0200 (CEST) Subject: Support for Latin ligature IJ In-Reply-To: References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> Message-ID: <1284038073.92.1459397089210.JavaMail.www@wwinf1g10> On Wed, 30 Mar 2016 23:42:20 +0200, Philippe Verdy wrote: > Note that the single letter "?" in Dutch is often undistinctable from "?", which is also commonly found as a convenient substitute in many old documents not encoded with Unicode but with ISO8859-1 : this has a caveat because the capitalization would produce "Y" (in ISO8859-1), possibly followed by a combining diaeresis (in Unicode-encoded documents) instead of "IJ" (more correct but not perfect) or the "?" letter (best choice). Almost regularly also the uppercase ??? was represented as a ?Y? in Dutch pre-computer text and signing. Sadly to say, with its excluding three French characters (?, ?, ?)?and missing four Finnish ones?Latin-1 was not what could have been called a Western European charset, even though the euro sign could not be anticipated. > > The use of "?" in Dutch should also be considered as an orthographic fault, and it should be corrected into "?" (to solve the capitalization problem), but there are occurences in Dutch of "?" which is correct (notably in borrowed French toponyms such as "L?Ha?-les-Roses") > > There may be similar examples in Belgium with French toponyms, but I suspect that those Belgian-French toponyms have their own Dutch "officialized" variant which would be preferable without borrowing the Belgian-French orthography, so that they will not need "?", and they will likely use "?" instead, meaning that the autocorrection of "?" from possible Belgian-French toponyms into "?" will also be correct for Dutch-Belgian toponyms ; it may also be correct for French-French toponyms like "L?Ha?-les-Roses" transformed into "L?Ha?-les-Roses" in Belgian-Dutch, or "L?HA?-LES-ROSES" if capitalized, if autocorrected this way; it would however be incorrect to replace there the "?" (or ?) letter by the two letters "ij" (or "IJ") without the orthographic ligature... > > By curiosity, I looked into the Dutch Wikipedia to see how they wrote "L?Ha?-les-Roses" and they don't transform the French "?" into some Dutch "?" (and they don't have any other "officialized" Dutch orthography. > > For this reason, the autocorrection of the "?" letter into the "?" letter in Dutch is disabled by default (even if it would be needed to look into old documents encoded with ISO8859-1). > > The situation is more complex for the autocorrection of the "ij" digram (extremely frequent in old documents encoded with ISO8859-1) into the plain "?" letter, which seems to be active in various wordprocessors (but which causes problems with borrowed non-Dutch names). Yet another example of how autocorrection-based functioning designed to keep in use outdated keyboard layouts is at risk of running into a mess. > > > 2016-03-30 23:19 GMT+02:00 Philippe Verdy : > > > In my opinion, the Dutch ?/? "ligature" is not really a ligature and should be treated exactly like ?/? or ?/? as a plain single letter. I fully agree that these are all plain letters. Consistently, Unicode encoded them all as such: LATIN CAPITAL LETTER I J, LATIN CAPITAL LETTER A E, LATIN CAPITAL LETTER O E. The misleading ?LIGATURE? names have been enforced by ISO, and subsequently partially corrected by Unicode on the request of the mainly concerned NB. ??? too is considered a letter in Dutch. In French, the administrative POV is that ??? and ?OE? are equivalent, and that has been agreed by a representative of the linguistic authority. The point is that (1) one cannot ask people to use letters that are not on their keyboard, (2) one cannot ask software providers to add them in the layout driver while they aren?t printed on keycaps, and (3) one cannot ask manufacturers to add them on the keyboard as long as that is not specified by any official standard. But all that shall now change. Same problem (presumably) on Dutch keyboards, and here again things should soon be ipmroved, when the future revised ISO/IEC?9995 includes a compose key, at least on Right Alt + Space. Such a gateway can be added without altering the space bar, which is the one key that does not need to be engraved, and behind, all characters of the current script can be added without sticking anything more on the keycaps. > > > > The use of IJ/ij (encoded as separate letters) is a actually an orthographic fault, that a ligature will not help resolve. As of the actual meaning of ?ligature?, see above, but you are completely right. > > > > Thanks, the decomposition of the "?" letter or "?" into separate letters is only a compatibility decomposition, but it is not canonically equivalent. That will help improve the cited Wikipedia article. Correcting documentation is actually a precondition for users to dare type U+0132/U+0133. > > > > In such as case, the "?" letter is soft-dotted also in Dutch and the two dots disappear when it has diacritics above. > > > > For Lithuanian, the "?" letter is not soft-dotted, but effectively hard-coded (meaning also that it is really a ligature, and that the single-letter should not be used at all, but encoded as i+j with a possible joiner...). In such a case, using the single letter "?/?" meant only for Dutch is also an orthographic fault. But this also means that when you add diacritics in Lithuanian, you'll need to encode explicit dots (like in Turkish) to keep these dots ! The oopsie is that in some implementations, this way you get two stacked dots plus the other diacritic? We can only hope that this is now fixed. Marcel From duerst at it.aoyama.ac.jp Thu Mar 31 00:51:55 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 31 Mar 2016 14:51:55 +0900 Subject: Support for Latin ligature IJ (was another thread) In-Reply-To: References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> Message-ID: <56FCBAFB.6030403@it.aoyama.ac.jp> On 2016/03/31 06:42, Philippe Verdy wrote: > The use of "?" in Dutch should also be considered as an orthographic fault, > and it should be corrected into "?" (to solve the capitalization problem), > but there are occurences in Dutch of "?" which is correct (notably in > borrowed French toponyms such as "L?Ha?-les-Roses") > > There may be similar examples in Belgium with French toponyms, but I > suspect that those Belgian-French toponyms have their own Dutch > "officialized" variant which would be preferable without borrowing the > Belgian-French orthography, I'm not too familiar with the local Belgian customs for place names, but in general, correspondences will not be that simple. There may be cases with exactly the same spelling (but different pronunciation), cases with simple spelling differences, cases with different words (same or different meanings), and so on. > so that they will not need "?", and they will > likely use "?" instead, meaning that the autocorrection of "?" from > possible Belgian-French toponyms into "?" will also be correct for > Dutch-Belgian toponyms ; it may also be correct for French-French toponyms > like "L?Ha?-les-Roses" transformed into "L?Ha?-les-Roses" in Belgian-Dutch, > or "L?HA?-LES-ROSES" if capitalized, if autocorrected this way; it would > however be incorrect to replace there the "?" (or ?) letter by the two > letters "ij" (or "IJ") without the orthographic ligature... I'm not an expert in French or Dutch pronunciation or orthography, but as far as I understand, transforming "L?Ha?-les-Roses" to "L?Ha?-les-Roses" would be wrong because it would lead to a wrong pronunciation; if anything, "L?H?-les-Roses" would be closer. > By curiosity, I looked into the Dutch Wikipedia to see how they wrote > "L?Ha?-les-Roses" > and they don't transform the French "?" into some Dutch "?" (and they don't > have any other "officialized" Dutch orthography. With Unicode, there's less and less of a need to "officialize" such spellings, even though of course whether to do so or not will continue to depend on other factors such as culture and official policy. > For this reason, the autocorrection of the "?" letter into the "?" letter > in Dutch is disabled by default (even if it would be needed to look into > old documents encoded with ISO8859-1). > > The situation is more complex for the autocorrection of the "ij" digram > (extremely frequent in old documents encoded with ISO8859-1) into the plain > "?" letter, which seems to be active in various wordprocessors (but which > causes problems with borrowed non-Dutch names). Such problems these days can be solved by using context-sensitive corrections, either with something close to regular expressions (detecting typical Dutch spellings) or dictionaries. Regards, Martin. From verdy_p at wanadoo.fr Thu Mar 31 09:40:37 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 31 Mar 2016 16:40:37 +0200 Subject: Support for Latin ligature IJ In-Reply-To: <1284038073.92.1459397089210.JavaMail.www@wwinf1g10> References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> <1284038073.92.1459397089210.JavaMail.www@wwinf1g10> Message-ID: 2016-03-31 6:04 GMT+02:00 Marcel Schneider : > On Wed, 30 Mar 2016 23:42:20 +0200, Philippe Verdy wrote: > > > > In such as case, the "?" letter is soft-dotted also in Dutch and the > two dots disappear when it has diacritics above. > > > > > > For Lithuanian, the "?" letter is not soft-dotted, but effectively > hard-coded (meaning also that it is really a ligature, and that the > single-letter should not be used at all, but encoded as i+j with a possible > joiner...). In such a case, using the single letter "?/?" meant only for > Dutch is also an orthographic fault. But this also means that when you add > diacritics in Lithuanian, you'll need to encode explicit dots (like in > Turkish) to keep these dots ! > > The oopsie is that in some implementations, this way you get two stacked > dots plus the other diacritic? > We can only hope that this is now fixed. > True, but the combining diacritic cannot be the standard dot above (because they would combine vertically, or because the removal of the implicit dot by the addition of a single combining dot above would just leave one dot centered sowere between the two parts of the letter). May be in this case this should be the diaeresis (so "soft-dotted" could also apply to the implicit diaeresis...). Well semantically this is not strictly a diaeresis but two dots above, side-by-side, one over each part of the letter. But this is not so stupid after all for that specific letter to consider that these two horizontal dots are the same as a diaeresis. So let's say we want to add an acute accent above the Lithuanian "?", we would encode "?"+"combining diaeresis"+"combining acute accent" to explicitly encode the two dots and avoid their removal from the soft-dotted "?" caused by the acute accent. Hmmm... not perfect semantically, but this could work... provided that fonts correctly interpret "?"+"combining diaeresis" as meaning it must just preserve the existing dots over the isolated "?" instead of dropping them and placing the dots of the diaeresis at random position over the undotted "?", i.e. the renderings of "?" and of "?"+"combining diaeresis" is undistinctable even if they are not canonically equivalent (exactly like in Turkish for the renderings of isolated "i" and of "i"+"combining dot above" which should also be undistinctable even if they are not canonically equivalent). There's a caveat with the fact that this creates two confusable encodings for the isolated "?" (with or without the combining dots). But this is also true for "i" (with or without the combining dot), or for the isolated "j" letter is a few other Turkic/Altaic languages. - One way to avoid the confusion is in fact to use distinct placements of the dots (over the Dutch/Lithuanian "?" letter or over the Turkish "i" letter) if there's no other diacritic above, and for fonts to use the standard placement of these dots (same as the isolated letter) **only** if there's another combining diacritic above. - Otherwise, the alternate placement could use larger dots, or dots slightly shifted horizontally if they are explicitly encoded where they should not be encoded at all over the isolated letter. Philippe. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 31 09:57:01 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 31 Mar 2016 16:57:01 +0200 Subject: Support for Latin ligature IJ (was another thread) In-Reply-To: <56FCBAFB.6030403@it.aoyama.ac.jp> References: <1891339733.31853.1459368747841.JavaMail.www@wwinf1p15> <56FCBAFB.6030403@it.aoyama.ac.jp> Message-ID: 2016-03-31 7:51 GMT+02:00 Martin J. D?rst : > > I'm not an expert in French or Dutch pronunciation or orthography, but as > far as I understand, transforming "L?Ha?-les-Roses" to "L?Ha?-les-Roses" > would be wrong because it would lead to a wrong pronunciation; if anything, > "L?H?-les-Roses" would be closer. No. The actual pronunciation would be closer to "L'a-i-l?-Roz'" (note the "H" is mute just like the final "es"). In Russian Wikipedia, it is currently transliterated to Cyrillic by ignoring the diaresis which is fundamental (as if it was written "Hay" or "Haie" in French), i.e. "a?" represented as a single Cyrillic vowel (French pronounces the two vowels "a" and "y" distinctly because of the diaeresis, this is the standard role of the diaeresis in French to separate letters, without even any diphtong, or just a very light diphtong between them in fast speech which still preserves the two vowels). Obviously the Russian transliteration is definitely wrong (but it has been borrowed automatically from Russian Wikipedia to OpenStreetMap). I don't think there's any attested usage in Russian with this faulty pronunciation, except when Russians will read the Russian Wikipedia writing this bad name in Cyrillic. Note that Russian Wikipedia is full of very strange (faulty or incoherent) translitterations of foreign terms, invented by some self-proclaimed "experts" (sometimes they mix an actual translation, sometimes they use transliteration. E.g. with the clealt faulty transliteration of "Seine-Saint-Denis" which transliterates the two first words (clearly distinct phonetically in French) with the same 3 Cyrillic letters, with strange approcimation of the actual phonetics, but not with the actual Russian translations of "Seine" and "Saint" : on compound names based on them, sometimes the Russian translation on Wikipedia is used, sometimes the transliteration without any consistence, this choice is completely arbitrary). -------------- next part -------------- An HTML attachment was scrubbed... URL: