From doug at ewellic.org Tue Sep 1 11:37:03 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Sep 2015 09:37:03 -0700 Subject: Dark beer emoji Message-ID: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Document L2/15-211, "Letter in support of dark beer emoji" , is a request submitted by Cuauht?moc Moctezuma, a Mexican brewery. The letter refers to a petition with more than 22,000 signatures supporting such an emoji, and may have at least some commercial motivation ("We want the dark beer to be part of peoples conversations"). As an alternative to this proposal that may provide more flexibility, I propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B CLINKING BEER MUGS. This could be done by establishing a normative correlation between the Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, and/or European Brewery Convention (EBC) beer color scales . This mechanism would allow the entire spectrum of beer styles to be depicted, instead of dividing beers arbitrarily into "light" and "dark," in the same way (and for the same reason) that Unicode already supports a variety of skin tones. For example, a Budweiser or similar lager could be represented as ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. There might be a need to encode an additional "Type 0" color modifier to extend the "light" end of the scale, such as for non-alcoholic brews, or for Coors Light. U+1F37B could be used to denote two beers of the same style, but for beers of different colors, the mechanism described in UTR #51, Section 2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a toast between drinkers of the two beers above could be encoded as ????????? <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences would also be possible, such as for beer samplers offered in some pubs and restaurants. I have no idea whether my proposal is more or less serious, or more or less likely to be adopted, than the original. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From Shawn.Steele at microsoft.com Tue Sep 1 12:29:56 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 1 Sep 2015 17:29:56 +0000 Subject: Dark beer emoji In-Reply-To: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: Ugh, should've encoded that Martian green skin-tone. Then we'd've been prepared for St. Patty's Day beers. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Tuesday, September 1, 2015 9:37 AM To: Unicode Mailing List Subject: Dark beer emoji Document L2/15-211, "Letter in support of dark beer emoji" , is a request submitted by Cuauht?moc Moctezuma, a Mexican brewery. The letter refers to a petition with more than 22,000 signatures supporting such an emoji, and may have at least some commercial motivation ("We want the dark beer to be part of peoples conversations"). As an alternative to this proposal that may provide more flexibility, I propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B CLINKING BEER MUGS. This could be done by establishing a normative correlation between the Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, and/or European Brewery Convention (EBC) beer color scales . This mechanism would allow the entire spectrum of beer styles to be depicted, instead of dividing beers arbitrarily into "light" and "dark," in the same way (and for the same reason) that Unicode already supports a variety of skin tones. For example, a Budweiser or similar lager could be represented as ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. There might be a need to encode an additional "Type 0" color modifier to extend the "light" end of the scale, such as for non-alcoholic brews, or for Coors Light. U+1F37B could be used to denote two beers of the same style, but for beers of different colors, the mechanism described in UTR #51, Section 2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a toast between drinkers of the two beers above could be encoded as ????????? <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences would also be possible, such as for beer samplers offered in some pubs and restaurants. I have no idea whether my proposal is more or less serious, or more or less likely to be adopted, than the original. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Tue Sep 1 12:36:11 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 1 Sep 2015 10:36:11 -0700 Subject: Dark beer emoji In-Reply-To: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: <55E5E20B.2090908@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Sep 1 12:40:13 2015 From: everson at evertype.com (Michael Everson) Date: Tue, 1 Sep 2015 18:40:13 +0100 Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: On 1 Sep 2015, at 18:29, Shawn Steele wrote: > > Ugh, should've encoded that Martian green skin-tone. Then we'd've been prepared for St. Patty's Day beers. Recte: St. Paddy?s Day Michael Everson * http://www.evertype.com/ From Shawn.Steele at microsoft.com Tue Sep 1 12:50:56 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 1 Sep 2015 17:50:56 +0000 Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: Thanks -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Everson Sent: Tuesday, September 1, 2015 10:40 AM To: Unicode Mailing List Subject: Re: Dark beer emoji On 1 Sep 2015, at 18:29, Shawn Steele wrote: > > Ugh, should've encoded that Martian green skin-tone. Then we'd've been prepared for St. Patty's Day beers. Recte: St. Paddy?s Day Michael Everson * http://www.evertype.com/ From doug at ewellic.org Tue Sep 1 13:13:13 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Sep 2015 11:13:13 -0700 Subject: Dark beer emoji Message-ID: <20150901111313.665a7a7059d7ee80bb4d670165c8327d.701569f03c.wbe@email03.secureserver.net> Asmus Freytag (t) wrote: > Well, you didn't consider that each style of beer may be served in a > different style glass. :) Yay, emoji modifier chaining: U+1F37A BEER MUG U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 U+1Fxxx EMOJI MODIFIER WEIZEN GLASS -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From public at khwilliamson.com Tue Sep 1 13:37:26 2015 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 1 Sep 2015 12:37:26 -0600 Subject: Dark beer emoji In-Reply-To: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: <55E5F066.3020502@khwilliamson.com> On 09/01/2015 10:37 AM, Doug Ewell wrote: > I have no idea whether my proposal is more or less serious, or more or > less likely to be adopted, than the original. When I read this, I wondered if it was April 1 instead of September 1. From Shawn.Steele at microsoft.com Tue Sep 1 13:44:00 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 1 Sep 2015 18:44:00 +0000 Subject: Dark beer emoji In-Reply-To: <55E5F066.3020502@khwilliamson.com> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <55E5F066.3020502@khwilliamson.com> Message-ID: It's my birthday, so I knew it wasn't April. :) It'd be a fun font easter egg though... -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson Sent: Tuesday, September 1, 2015 11:37 AM To: Doug Ewell ; Unicode Mailing List Subject: Re: Dark beer emoji On 09/01/2015 10:37 AM, Doug Ewell wrote: > I have no idea whether my proposal is more or less serious, or more or > less likely to be adopted, than the original. When I read this, I wondered if it was April 1 instead of September 1. From doug at ewellic.org Tue Sep 1 13:53:09 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 01 Sep 2015 11:53:09 -0700 Subject: Dark beer emoji Message-ID: <20150901115309.665a7a7059d7ee80bb4d670165c8327d.5a8578f302.wbe@email03.secureserver.net> Karl Williamson wrote: > When I read this, I wondered if it was April 1 instead of September 1. The opportunity wouldn't have lasted another seven months. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Sep 1 15:42:45 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 1 Sep 2015 21:42:45 +0100 Subject: Dark beer emoji In-Reply-To: <20150901111313.665a7a7059d7ee80bb4d670165c8327d.701569f03c.wbe@email03.secureserver.net> References: <20150901111313.665a7a7059d7ee80bb4d670165c8327d.701569f03c.wbe@email03.secureserver.net> Message-ID: <20150901214245.52957528@JRWUBU2> On Tue, 01 Sep 2015 11:13:13 -0700 "Doug Ewell" wrote: > Asmus Freytag (t) wrote: > > > Well, you didn't consider that each style of beer may be served in a > > different style glass. :) > > Yay, emoji modifier chaining: > > U+1F37A BEER MUG > U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 > U+1Fxxx EMOJI MODIFIER WEIZEN GLASS How is that to be equated to ? Or is some rendering difference to be expected? Richard. From Shawn.Steele at microsoft.com Tue Sep 1 15:56:06 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Tue, 1 Sep 2015 20:56:06 +0000 Subject: Dark beer emoji In-Reply-To: <20150901214245.52957528@JRWUBU2> References: <20150901111313.665a7a7059d7ee80bb4d670165c8327d.701569f03c.wbe@email03.secureserver.net> <20150901214245.52957528@JRWUBU2> Message-ID: In one version the beer is inside the glass, in the other, the beer is outside the glass. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Tuesday, September 1, 2015 1:43 PM To: Unicode Mailing List Subject: Re: Dark beer emoji On Tue, 01 Sep 2015 11:13:13 -0700 "Doug Ewell" wrote: > Asmus Freytag (t) wrote: > > > Well, you didn't consider that each style of beer may be served in a > > different style glass. :) > > Yay, emoji modifier chaining: > > U+1F37A BEER MUG > U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 1Fxxx EMOJI MODIFIER > U+WEIZEN GLASS How is that to be equated to ? Or is some rendering difference to be expected? Richard. From steve at swales.us Tue Sep 1 16:01:21 2015 From: steve at swales.us (Steve Swales) Date: Tue, 1 Sep 2015 14:01:21 -0700 Subject: Dark beer emoji In-Reply-To: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> Personally, I love this idea, and would like to claim first authorship ??. Here?s a snippet from the email I sent to my old colleagues at Apple back on April 15th (not the 1st): > Hi, Apple iOS/Keyboard/Design//I18n folks, > > Just wanted to say, nice work on the new Emoji keyboard design and expanded repertoire. I desperately wish the skin tone modifiers would work on the beer emoji, however. Need my porter and stout. Maybe next update? For old times' sake? ?? . > -steve > On Sep 1, 2015, at 9:37 AM, Doug Ewell wrote: > > Document L2/15-211, "Letter in support of dark beer emoji" > , is a > request submitted by Cuauht?moc Moctezuma, a Mexican brewery. > > The letter refers to a petition with more than 22,000 signatures > supporting such an emoji, and may have at least some commercial > motivation ("We want the dark beer to be part of peoples > conversations"). > > As an alternative to this proposal that may provide more flexibility, I > propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to > U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B > CLINKING BEER MUGS. > > This could be done by establishing a normative correlation between the > Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, > and/or European Brewery Convention (EBC) beer color scales > . > > This mechanism would allow the entire spectrum of beer styles to be > depicted, instead of dividing beers arbitrarily into "light" and "dark," > in the same way (and for the same reason) that Unicode already supports > a variety of skin tones. > > For example, a Budweiser or similar lager could be represented as > ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? > <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. > There might be a need to encode an additional "Type 0" color modifier to > extend the "light" end of the scale, such as for non-alcoholic brews, or > for Coors Light. > > U+1F37B could be used to denote two beers of the same style, but for > beers of different colors, the mechanism described in UTR #51, Section > 2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a > toast between drinkers of the two beers above could be encoded as > ????????? <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences > would also be possible, such as for beer samplers offered in some pubs > and restaurants. > > I have no idea whether my proposal is more or less serious, or more or > less likely to be adopted, than the original. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From verdy_p at wanadoo.fr Wed Sep 2 03:12:24 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 2 Sep 2015 10:12:24 +0200 Subject: Dark beer emoji In-Reply-To: <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> Message-ID: now it's time to create varuations for coffea cups, ice creams, more cakes, various forms of burgers, roasted meats, sausages, chickens/turkeys, and eggs, breads... we've put the finger into an infinitely deep hole of images. the initial emojis were used express essential feelings used in interpersonal communication, niw we see attempts to use them to sell various branded products which are not even intercultural. do we need tem in plain text, hen their representation will be top poor to to show their uniqueness. images transported separateky are better. otherwise we'll use text to give real names, brands and product descriptions and characteristics. i do like the proliferation of emojis for priducrs that will fall out of use or that are too much protected and not for general sales. i don't like exclusive claims of authorship that come with those proposals. Le 1 sept. 2015 23:10, "Steve Swales" a ?crit : > Personally, I love this idea, and would like to claim first authorship > ??. Here?s a snippet from the email I sent to my old colleagues at Apple > back on April 15th (not the 1st): > > > Hi, Apple iOS/Keyboard/Design//I18n folks, > > > > Just wanted to say, nice work on the new Emoji keyboard design and > expanded repertoire. I desperately wish the skin tone modifiers would > work on the beer emoji, however. Need my porter and stout. Maybe next > update? For old times' sake? ?? . > > > > -steve > > > On Sep 1, 2015, at 9:37 AM, Doug Ewell wrote: > > > > Document L2/15-211, "Letter in support of dark beer emoji" > > , is a > > request submitted by Cuauht?moc Moctezuma, a Mexican brewery. > > > > The letter refers to a petition with more than 22,000 signatures > > supporting such an emoji, and may have at least some commercial > > motivation ("We want the dark beer to be part of peoples > > conversations"). > > > > As an alternative to this proposal that may provide more flexibility, I > > propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to > > U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B > > CLINKING BEER MUGS. > > > > This could be done by establishing a normative correlation between the > > Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, > > and/or European Brewery Convention (EBC) beer color scales > > . > > > > This mechanism would allow the entire spectrum of beer styles to be > > depicted, instead of dividing beers arbitrarily into "light" and "dark," > > in the same way (and for the same reason) that Unicode already supports > > a variety of skin tones. > > > > For example, a Budweiser or similar lager could be represented as > > ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? > > <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. > > There might be a need to encode an additional "Type 0" color modifier to > > extend the "light" end of the scale, such as for non-alcoholic brews, or > > for Coors Light. > > > > U+1F37B could be used to denote two beers of the same style, but for > > beers of different colors, the mechanism described in UTR #51, Section > > 2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a > > toast between drinkers of the two beers above could be encoded as > > ????????? <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences > > would also be possible, such as for beer samplers offered in some pubs > > and restaurants. > > > > I have no idea whether my proposal is more or less serious, or more or > > less likely to be adopted, than the original. > > > > -- > > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Sep 2 12:20:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 2 Sep 2015 19:20:14 +0200 (CEST) Subject: Dark beer emoji Message-ID: <916520143.14433.1441214414817.JavaMail.www@wwinf1h08> On 01 Sep 2015 at 19:40, Shawn?Steele wrote: > Ugh, should've encoded that Martian green skin-tone. Then we'd've been prepared for St. Patty's Day beers. On 19 Aug 2015 at 22:18, Mark?E.?Shoulson wrote: > And is there an emoji for GRAIN OF SALT? (Actually, that could almost > be useful... or even just a geometric CUBE...) I see that you mock rather often and not only in one circumstance. Had I been aware, I wouldn?t have got started the way I did. Sorry. I hope that these apologies of yours have reached Mrs?Haneys? mailbox.?:) Cheers, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Sep 2 12:30:34 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 2 Sep 2015 19:30:34 +0200 (CEST) Subject: Effectiveness of locale support (was: Re: Custom source samples) Message-ID: <750143234.14655.1441215034410.JavaMail.www@wwinf1h08> I don?t want to pull interminable threads, and I even thought of leaving the List, thinking not to have anything else to contribute. But finally I?m pleased to stay tuned and would like to draw your attention to a topic I brought in when I committed myself to dig up some full answer to why people are prevented taking full control over their keyboard layout. And as it?s about locale support, this old-new issue even meets the core of Unicode, and I?m hopeful that it would make a good thread. I?ve formally promised to stop definitely criticizing other people?s work on the Unicode Mailing List. So I?ve worked hard to turn this into a constructive comment. As we know and have been refreshed by the two cited blog posts (which I don?t cite again...), French speaking users in Qu?bec are not fully granted the means of writing their language, as the keyboard layout preferred by the OEMs and their OS supplier (and pretendedly by the local population, but that?s untrue, they just aren?t given the choice) does not allow to write French. The most outstanding default is that the French letter ?? is missing. These two blog posts are seemingly just the iceberg?s top of that criticism of other people?s work that must be current practice among Apple?s competitors when the matter is what keyboard to offer in Qu?bec. The funny side is that they do worse, not better (while they should), thus missing precisely what is commonly supposed to be the condition of any criticism. So *if* they want to insist on selling that keyboard they?re selling, then they *at least* have to add ?? on AltGr+Oo, and ?? on AltGr+Aa. [They must have been told this quite a number of times. Voil? once more, in the case they?re monitoring this Mailing List.] About the alternative so-called French traditional layout that ships with Windows for use in Canada, there?s to say that to make it at least Latin-1, one should re-add the superscript two that seems to have been replaced with the at sign (while superscript one and three are there), and the masculine ordinal indicator that seems to have been replaced with the micro sign (while the feminine ordinal indicator is there). And to make it Latin-9 and definitely Unicode, one should add the ?? ligature e.g. on the ?? key which is empty on AltGr at this time. I?wonder whether they noticed the criticism of locale keyboard support flowing in at Microsoft that is mirrored in this blog post: http://www.siao2.com/2005/01/01/345222.aspx IMHO one cannot do such a bad job AND bully the Canadian Multilingual Standard keyboard at the same time, I?m sure everybody agrees. (Please see my next e-mail. To avoid sending one too long e-mail, I?ve splitted the stuff in two.) Nevertheless, whatever utterings are very useful to decrypt to learn about the inner thoughts that finally determine what companies are doing or not doing, regardless of the companies? size. It?s like French etnographer Germaine?Tillion said in an interview: One must *understand* what oppresses you. And she related this to her personal interpretation of the verb ?to?exist? (based on its Latin etymology). This recalls me that French people in Canada are a minority. Actually, Qu?bec is likely to be overrun by the road-roller of uniformization and big business that is eager to shape the market to make it fit into its business strategy, its stock flow management, by removing key #102, the Applications key, and actually the Right Control key. Too long a space bar, poor ergonomics (with Right Alt too much to the right). And by unsupporting the Canadian Multilingual keyboard. Would Microsoft, Hewlett?Packard, and the other manufacturers, please grant Qu?bec full support, and help it to fully exist? Thanks, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Sep 2 12:45:47 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 2 Sep 2015 19:45:47 +0200 (CEST) Subject: Effectiveness of locale support Message-ID: <128861395.14966.1441215947475.JavaMail.www@wwinf1h08> In my previous e-mail I?ve... a typo. Please read *ethnographer* with two greek h's, not one. And I've started proving that the Canadian Multilingual Standard keyboard is far better than its competitors. No wonder: The only reason it has been created, was to make something better fit for French (and many other languages) to be written in Canada and particularly in Qu?bec. And the reason it has been standardized, was that the users who have been asked for their opinion, clearly preferred the new keyboard over the existing ones (even if at the beginning, it wasn?t multilingual, and was restricted by the lacks of Latin-1). One could even make it a rule: Usually one is likely to consider that a national government and standards body are in a far better place to learn about?and cater for?user preferences, than anybody else. Denying people to write correctly their language and to use their preferred keyboard, is illegal discrimination. If there isn?t any legal provision prohibiting this discrimination in Qu?bec, that?s probably because Canada is not a part of the European?Union, I infer from what Richard?wrote on 28 Aug 2015 at 00:09: > I may have scared them into > silence by noting that people changing code because of one particular > *new* sentence in Section 23.2, namely: > > P2S4: Note in particular that the word joiner is ignored for word > > segmentation. > are at risk (but see below) of putting themselves in breach of the UK's > 'Equality Act 2010'; more generally, they may be in breach of > transpositions of the EU Racial Equality Directive (2000/43/EC). You > don't need to have racialist intentions to be in breach. http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0222.html To better mind what?s on, I invite you to take a glance at some details: The keyboard symbols that are puzzling strangers at the point that they may ask ?where is the right Control key?, are on the keycaps because they are in ISO?9995-7 and allow for a-linguality instead of bilingual overload. However, as they stay missing evidence, one is about to set keycaps back to text. I wrote that the Canadian Standard keyboard is a genuine ISO?9995 implementation. That?s true for the original standard. Unfortunately, this was altered when it was implemented on Windows. This issue is about the group selector, which should be Shift+AltGr, not right Ctrl, and should be remanent. The ISO standard only allows for THREE levels per group; hence, again, no Shift+AltGr level. That fully ISO 9995 conformant keyboards are restricted to three levels per group, is an accessibility issue: No user must be forced to type his language with more than *two* fingers. (Many people, including me, got started when learning about this fact, as this is also a counter-productive limitation, but I?m not discussing an ISO standard here.) Furthermore, the ISO keyboard standard 9995 always considered that all characters for natioal use must fit into Group?1, while Group?2 (and above) is to contain supplemental characters for all other supported languages. Like it or not, this principle is deeply embedded in ISO?9995. Now we may ask ?Why the ?? isn?t therein?? Because ?? had been excluded from ISO?8859-1 on the faith of French representatives (who didn?t really represent France but only one manufacturer, as for the most voicy of the two), and regardless of the Canadian representative asking for its inclusion because ?? is *necessary* in French. Standards need to be read carefully prior to making statements on what is missing in a given implementation. And sometimes you even have to investigate. [To tell it in people?s words following the above blog post: CSA didn?t do like MS is supposed to have done...] That?s how Ian?James altered the keyboard?s ergonomics, given that many dead keys, all to the right, are now to be pressed along with Right Control! It?s as if he were too tired to add some code conferring the specified behavior to the Right Alt key. I dimly suggest that there could be a relation to what is discussed in another blog post; I would say that for being disliked, the standard has been implemented carelessly: http://www.siao2.com/2008/10/23/9013000.aspx Never let other people make an OS implementation of your standard! That?s why we need to access the C sources of Windows keyboard drivers. And that?s why we need to get our drivers compiled from C sources (as opposed to KLC sources). In the scope of Unicode implementation, feeding KLC files into KbdUTool is not too bad as a method, as this allows for chained dead keys, and for ligatures under a five units length ceiling even when missing defines are added in kbd.h. But this way, locales support is suboptimal, because Windows? potential is not fully available. Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.mcglothlin at gmail.com Wed Sep 2 12:59:18 2015 From: mike.mcglothlin at gmail.com (Michael McGlothlin) Date: Wed, 2 Sep 2015 12:59:18 -0500 Subject: Dark beer emoji In-Reply-To: <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> Message-ID: It should be applied to all emoji. Could be fun with the poo one. Thanks, Michael McGlothlin Sent from my iPhone > On Sep 1, 2015, at 4:01 PM, Steve Swales wrote: > > Personally, I love this idea, and would like to claim first authorship ??. Here?s a snippet from the email I sent to my old colleagues at Apple back on April 15th (not the 1st): > >> Hi, Apple iOS/Keyboard/Design//I18n folks, >> >> Just wanted to say, nice work on the new Emoji keyboard design and expanded repertoire. I desperately wish the skin tone modifiers would work on the beer emoji, however. Need my porter and stout. Maybe next update? For old times' sake? ?? . > > -steve > >> On Sep 1, 2015, at 9:37 AM, Doug Ewell wrote: >> >> Document L2/15-211, "Letter in support of dark beer emoji" >> , is a >> request submitted by Cuauht?moc Moctezuma, a Mexican brewery. >> >> The letter refers to a petition with more than 22,000 signatures >> supporting such an emoji, and may have at least some commercial >> motivation ("We want the dark beer to be part of peoples >> conversations"). >> >> As an alternative to this proposal that may provide more flexibility, I >> propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to >> U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B >> CLINKING BEER MUGS. >> >> This could be done by establishing a normative correlation between the >> Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, >> and/or European Brewery Convention (EBC) beer color scales >> . >> >> This mechanism would allow the entire spectrum of beer styles to be >> depicted, instead of dividing beers arbitrarily into "light" and "dark," >> in the same way (and for the same reason) that Unicode already supports >> a variety of skin tones. >> >> For example, a Budweiser or similar lager could be represented as >> ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? >> <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. >> There might be a need to encode an additional "Type 0" color modifier to >> extend the "light" end of the scale, such as for non-alcoholic brews, or >> for Coors Light. >> >> U+1F37B could be used to denote two beers of the same style, but for >> beers of different colors, the mechanism described in UTR #51, Section >> 2.2.1 ("Multi-Person Groupings"), involving ZWJ, could be utilized. So a >> toast between drinkers of the two beers above could be encoded as >> ????????? <1F37A, 1F3FB, 200D, 1F37A, 1F3FD>. Longer sequences >> would also be possible, such as for beer samplers offered in some pubs >> and restaurants. >> >> I have no idea whether my proposal is more or less serious, or more or >> less likely to be adopted, than the original. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From charupdate at orange.fr Wed Sep 2 14:13:20 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 2 Sep 2015 21:13:20 +0200 (CEST) Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> Message-ID: <666369174.13853.1441221200621.JavaMail.www@wwinf2233> On 02 Sep 2015 at 20:07, Michael McGlothlin wrote: > It should be applied to all emoji. Could be fun with the poo one. > > >> On Sep 1, 2015, at 9:37 AM, Doug Ewell wrote: > >> > >> As an alternative to this proposal that may provide more flexibility, I > >> propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to > >> U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B > >> CLINKING BEER MUGS. With U+1F35E BREAD it will be particularly useful to denote completeness. Whole bread vs white bread -- and all crumb tone levels between. Note: This isn't a mockery. I've thought at this when Asmus mentioned emoji for milk and bread: http://www.unicode.org/mail-arch/unicode-ml/y2015-m08/0017.html ? Thank you Doug and all who responded. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Wed Sep 2 15:56:42 2015 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 2 Sep 2015 13:56:42 -0700 Subject: Dark beer emoji In-Reply-To: <666369174.13853.1441221200621.JavaMail.www@wwinf2233> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> Message-ID: On Wed, Sep 2, 2015 at 12:13 PM, Marcel Schneider wrote: > On 02 Sep 2015 at 20:07, Michael McGlothlin > wrote: > >> It should be applied to all emoji. Could be fun with the poo one. >> >> >> On Sep 1, 2015, at 9:37 AM, Doug Ewell wrote: >> >> >> >> As an alternative to this proposal that may provide more flexibility, I >> >> propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to >> >> U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B >> >> CLINKING BEER MUGS. > > With U+1F35E BREAD it will be particularly useful to denote completeness. > Whole bread vs white bread -- and all crumb tone levels between. TYPE 1-2: White bread TYPE 3: Potato bread TYPE 4: Whole wheat TYPE 5: Multigrain TYPE 6: Pumpernickel But why stop there? They could also be applied to U+1F382 BIRTHDAY CAKE: TYPE 1-2: Angel food TYPE 3: Carrot cake TYPE 4: German's chocolate TYPE 5: Red velvet TYPE 6: Devil's food > Note: This isn't a mockery. I've thought at this when Asmus mentioned emoji > for milk and bread: How would the skin tone modifiers affect milk, I wonder? There's chocolate milk, sure, but shades? Would one of them be strawberry milk? From gwalla at gmail.com Wed Sep 2 16:00:25 2015 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 2 Sep 2015 14:00:25 -0700 Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> Message-ID: On Wed, Sep 2, 2015 at 10:59 AM, Michael McGlothlin wrote: > It should be applied to all emoji. Could be fun with the poo one. Who was it who proposed a set of Bristol stool scale modifiers for U+1F4A9? From doug at ewellic.org Wed Sep 2 16:26:07 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Sep 2015 14:26:07 -0700 Subject: Dark beer emoji Message-ID: <20150902142607.665a7a7059d7ee80bb4d670165c8327d.3f31c9e5b7.wbe@email03.secureserver.net> Garth Wallace wrote: > TYPE 1-2: White bread > TYPE 3: Potato bread > TYPE 4: Whole wheat > TYPE 5: Multigrain > TYPE 6: Pumpernickel While trying to construct a rejoinder involving soft drinks (variously "soda" or "pop"), I discovered that Unicode has no such emoji. This is an outrage, of course. I can't believe Unicode even calls itself a coded character set without an emoji for soft drinks. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From andrewcwest at gmail.com Wed Sep 2 17:37:31 2015 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 2 Sep 2015 23:37:31 +0100 Subject: Dark beer emoji In-Reply-To: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: On 1 September 2015 at 17:37, Doug Ewell wrote: > > As an alternative to this proposal that may provide more flexibility, I > propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to > U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B > CLINKING BEER MUGS. > > This could be done by establishing a normative correlation between the > Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, > and/or European Brewery Convention (EBC) beer color scales > . > > This mechanism would allow the entire spectrum of beer styles to be > depicted, instead of dividing beers arbitrarily into "light" and "dark," > in the same way (and for the same reason) that Unicode already supports > a variety of skin tones. > > For example, a Budweiser or similar lager could be represented as > ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? > <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. > There might be a need to encode an additional "Type 0" color modifier to > extend the "light" end of the scale, such as for non-alcoholic brews, or > for Coors Light. Yet more blatant anti-ginger discrimination. Yet another reason to encode a ginger emoji modifier at the earliest opportunity (see https://www.change.org/p/apple-redheads-should-have-emoji-too), which could then be applied to U+1F37A BEER MUG in order to depict ginger beer. Andrew From doug at ewellic.org Wed Sep 2 17:45:07 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 02 Sep 2015 15:45:07 -0700 Subject: Dark beer emoji Message-ID: <20150902154507.665a7a7059d7ee80bb4d670165c8327d.08be4b2b0f.wbe@email03.secureserver.net> Andrew West wrote: > Yet more blatant anti-ginger discrimination. Yet another reason to > encode a ginger emoji modifier at the earliest opportunity (see > https://www.change.org/p/apple-redheads-should-have-emoji-too), which > could then be applied to U+1F37A BEER MUG in order to depict ginger > beer. Quote from the change.org page: "We can hardly believe that our petition to get ginger emoji has received over 10,000 signatures! There's still work to be done but one step closer to getting a redhead emoji!" That's the perception. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From olopierpa at gmail.com Wed Sep 2 18:09:24 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Thu, 3 Sep 2015 01:09:24 +0200 Subject: Dark beer emoji In-Reply-To: <20150902154507.665a7a7059d7ee80bb4d670165c8327d.08be4b2b0f.wbe@email03.secureserver.net> References: <20150902154507.665a7a7059d7ee80bb4d670165c8327d.08be4b2b0f.wbe@email03.secureserver.net> Message-ID: A warm beer expresses a very different concept from a cold beer. I propose a range of temperature modifiers. On Thu, Sep 3, 2015 at 12:45 AM, Doug Ewell wrote: > Andrew West wrote: > >> Yet more blatant anti-ginger discrimination. Yet another reason to >> encode a ginger emoji modifier at the earliest opportunity (see >> https://www.change.org/p/apple-redheads-should-have-emoji-too), which >> could then be applied to U+1F37A BEER MUG in order to depict ginger >> beer. > > Quote from the change.org page: > > "We can hardly believe that our petition to get ginger emoji has > received over 10,000 signatures! There's still work to be done but one > step closer to getting a redhead emoji!" > > That's the perception. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From A.Schappo at lboro.ac.uk Thu Sep 3 03:22:33 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Thu, 3 Sep 2015 08:22:33 +0000 Subject: Decomposition/Compatibility Mapping Issue Message-ID: So ............... I was looking at http://unicode.org/cldr/utility/regex.jsp?a=%5Cp%7Bscript%3DHan%7D&b=? and getting a cool looking Modified Regex Pattern. The last range ??-?? is CJK Compatibility Ideographs Supplement U+2F800-2FA1D. [?-??-??-????-??-??-??-??-??-???-????-????-????-????-??] So ....... then ....... I decided to copy/paste the above Modified Regex Pattern into Richard Ishida's Uniview http://r12a.github.io/uniview/ So ........ I then noticed that ?? U+2F800 was listed as ? U+4E3D [CJK Unified Ideographs] Thus the decomposition/compatibility mapping U+4E3D was being substituted for the original U+2F800. I was using Safari on OS X Yosemite. I repeated the above with Chrome and Firefox and there was no problem, no substitution occurred. Thus it appears to be a copy/paste problem with Safari or code used by Safari. I could have so easily missed this problem. I wonder if there are similar decomposition/compatibility mapping issues. Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Sep 3 04:09:23 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Sep 2015 11:09:23 +0200 (CEST) Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> Message-ID: <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> On Wed, 2 Sep 2015 14:00:25 -0700, Garth Wallace wrote: > > With U+1F35E BREAD it will be particularly useful to denote completeness. > > Whole bread vs white bread -- and all crumb tone levels between. > > TYPE 1-2: White bread > TYPE 3: Potato bread > TYPE 4: Whole wheat > TYPE 5: Multigrain > TYPE 6: Pumpernickel > > But why stop there? They could also be applied to U+1F382 BIRTHDAY CAKE: > > TYPE 1-2: Angel food > TYPE 3: Carrot cake > TYPE 4: German's chocolate > TYPE 5: Red velvet > TYPE 6: Devil's food > > > Note: This isn't a mockery. I've thought at this when Asmus mentioned emoji > > for milk and bread: > > How would the skin tone modifiers affect milk, I wonder? There's > chocolate milk, sure, but shades? While primarily, TYPE 3 or TYPE?4 when applied to a future MILK emoji, could denote ?coffee with milk?, I'd prefer it could be WHOLE SUGAR MILK. There are two reasons to that. 1??, nutritionists unanimously warn us that a mix of coffee and milk is harmful. 2??, the mineral content of whole sugar makes it a balanced food by contrast with the depleted sugar (whether this be refined or not, with or without caramel). > Would one of them be strawberry milk? I would like so. TYPE 5: Strawberry milk Asmus already told us that there'll be no soy beans emoji, by lack of iconicity. However, could there be a generic BEANS emoji along with the on-coming MILK emoji? A two-emoji sequence MILK, BEANS or BEANS, MILK would then denote tonyu. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Sep 3 04:45:26 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 3 Sep 2015 02:45:26 -0700 Subject: Dark beer emoji In-Reply-To: <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> Message-ID: <55E816B6.8000308@ix.netcom.com> An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Sep 3 04:48:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Sep 2015 11:48:45 +0200 (CEST) Subject: Dark beer emoji In-Reply-To: <20150902142607.665a7a7059d7ee80bb4d670165c8327d.3f31c9e5b7.wbe@email03.secureserver.net> References: <20150902142607.665a7a7059d7ee80bb4d670165c8327d.3f31c9e5b7.wbe@email03.secureserver.net> Message-ID: <2086574028.5787.1441273725652.JavaMail.www@wwinf1h23> On Wed, 02 Sep 2015 14:26:07 -0700, Doug Ewell wrote: > Garth Wallace wrote: > > > TYPE 1-2: White bread > > TYPE 3: Potato bread > > TYPE 4: Whole wheat > > TYPE 5: Multigrain > > TYPE 6: Pumpernickel > > While trying to construct a rejoinder involving soft drinks (variously > "soda" or "pop"), I discovered that Unicode has no such emoji. > > This is an outrage, of course. I can't believe Unicode even calls itself > a coded character set without an emoji for soft drinks. Sorry to contradict. So far, not encoding soft drinks emoji is IMHO a wise decision, in accordance with governmental and public health autorities? action against soft drink consumption (along with that against alcoholic beverages consumption). The issue with soft drinks is that they?re made of water and depleted sugar instead of water and complete sugar (see my previous e-mail), while in many cases the sugar?s colour tone has strictly no impact on the beverage?s final colour. By the way, the whole sugar?s taste may pleasingly complete the overall aroma. So not having encoded any soft drink emoji before a number of other food and beverage emoji are encoded, does not detract from Unicode being a coded character set. On the other hand, I believe that encoding a glass that is not to contain alcoholic beverages, say, a soft drink glass, which is almost the same as a milk glass, could be a very useful proposal. Yet we have then a new GLASS OF MILK emoji, which being polysemic, could denote lemon soda when yellow, and so on across the colour spectrum. And of course, the same will be used for FRUIT JUICE! Most probably when preceded by a fruit emoji like DURIAN to depict a DURIAN SMOOTHIE. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Sep 3 05:01:25 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Sep 2015 12:01:25 +0200 (CEST) Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> Message-ID: <1084893606.6075.1441274485337.JavaMail.www@wwinf1h23> On Wed, 2 Sep 2015 23:37:31 +0100, Andrew West wrote: > On 1 September 2015 at 17:37, Doug Ewell wrote: > > > > As an alternative to this proposal that may provide more flexibility, I > > propose adapting the Fitzpatrick skin-tone modifiers from U+1F3FB to > > U+1F3FF to be valid for use following U+1F37A BEER MUG or U+1F37B > > CLINKING BEER MUGS. > > > > This could be done by establishing a normative correlation between the > > Fitzpatrick scale and the Standard Reference Method (SRM), Lovibond, > > and/or European Brewery Convention (EBC) beer color scales > > . > > > > This mechanism would allow the entire spectrum of beer styles to be > > depicted, instead of dividing beers arbitrarily into "light" and "dark," > > in the same way (and for the same reason) that Unicode already supports > > a variety of skin tones. > > > > For example, a Budweiser or similar lager could be represented as > > ???? <1F37A, 1F3FB>, while a Newcastle Brown Ale might be ???? > > <1F37A, 1F3FD>. U+1F3FF could denote imperial stout or Baltic porter. > > There might be a need to encode an additional "Type 0" color modifier to > > extend the "light" end of the scale, such as for non-alcoholic brews, or > > for Coors Light. > > Yet more blatant anti-ginger discrimination. Yet another reason to > encode a ginger emoji modifier at the earliest opportunity (see > https://www.change.org/p/apple-redheads-should-have-emoji-too), which > could then be applied to U+1F37A BEER MUG in order to depict ginger > beer. Given all the colours to be encoded as a modifier, could there be a way to encode a MULTICOLOUR EMOJI MODIFIER? It could consist of two hex digits for the code range, and three hex digits for the colour (based on the HTML three digit hex codes for colours). This would solve at once all colour requirements in a reasonable resolution (as I?believe that defining emoji tones with six digits is uselessly precise). Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Sep 3 06:20:15 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Sep 2015 13:20:15 +0200 (CEST) Subject: Dark beer emoji In-Reply-To: <55E816B6.8000308@ix.netcom.com> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> <55E816B6.8000308@ix.netcom.com> Message-ID: <304482719.8028.1441279215723.JavaMail.www@wwinf1j19> On Thu, 3 Sep 2015 02:45:26 -0700, Asmus Freytag (t) wrote: > On 9/3/2015 2:09 AM, Marcel Schneider wrote: > > Asmus already told us that there'll be no soy beans emoji, by lack of iconicity. However, could there be a generic BEANS emoji > A coffee bean has a very recognizable shape, and I personally would consider it of sufficient "iconicity" to be considered, although, not uncommonly the representations seem to involve more than a single coffee bean... IMHO *three* beans per emoji may be a suitable compromise between iconicity and the idea of a plural. So will we have a COFFEE BEANS emoji along with a generic BEANS emoji? Supposing that a coffee bean has so peculiar a shape that it cannot represent any other species. But the *two* would be very useful, each one in its domain. Thanks, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Sep 3 07:00:56 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 3 Sep 2015 13:00:56 +0100 (BST) Subject: Dark beer emoji In-Reply-To: <1084893606.6075.1441274485337.JavaMail.www@wwinf1h23> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <1084893606.6075.1441274485337.JavaMail.www@wwinf1h23> Message-ID: <27446542.29776.1441281656629.JavaMail.defaultUser@defaultHost> Marcel Schneider wrote as follows: > Given all the colours to be encoded as a modifier, could there be a way to encode a MULTICOLOUR EMOJI MODIFIER? It could consist of two hex digits for the code range, and three hex digits for the colour (based on the HTML three digit hex codes for colours). This would solve at once all colour requirements in a reasonable resolution (as I believe that defining emoji tones with six digits is uselessly precise). May I refer you to the following thread please? Tag characters and in-line graphics (from Tag characters) The first post in the thread is as follows. http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html There is within that post a suggested format for specifying a custom colour for a local palette using tag characters. That is for use within an inline-graphic yet could be adapted to applying one or more colours to the glyph of an individual character. With the in-line graphic there would be a particular base character used for the in-line graphic. For colourizing an individual character the character itself would be the base character. I would prefer base 10 numbers to specify colour components, as used in many graphics programs. 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) For each glyph with more than one colour used within the glyph, The Unicode Standard would need to state the palette colour number for each part of the glyph. William Overington 3 September 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.shirriff at gmail.com Thu Sep 3 09:27:41 2015 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Thu, 3 Sep 2015 07:27:41 -0700 Subject: Upcoming proposal for Bitcoin sign Message-ID: I'm putting together a proposal for the Bitcoin sign to be added to Unicode, so I wanted to check here if people have any comments/concerns/objections. I'm aware of the previous rejected proposal L2/11-130 and I address the issues from its rejection . In particular, my proposal includes many examples of the symbol in running text. I also checked with bitcoin.org that they have no trademark on the logo. Please let me know of any other potential issues. Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Thu Sep 3 10:06:01 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 3 Sep 2015 16:06:01 +0100 Subject: Technical or encoding sub mailing list ? Message-ID: <8B7636C94F06431C85F9DF6C8C67C020@erratique.ch> Hello, Since I implement parts of the Unicode standard I'm interested in keeping in touch with discussions about the standard and its evolution from a technical point of view. I'm however not interested in the encoding point of view and all the discussions of whichever pet symbol or concept random people from the internet want to assign an integer to. With respect to these interests the amount of noise and off-topic threads I get from this list is considerate and I'm considering unsubscribing. Before I do so I would like to ask the moderators of this mailing list if they would consider creating either a more technically focused mailing list for implementers or, alternatively, forking off encoding discussions to a dedicated mailing list. Thanks, Daniel From Shawn.Steele at microsoft.com Thu Sep 3 11:15:19 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 3 Sep 2015 16:15:19 +0000 Subject: Dark beer emoji In-Reply-To: <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> Message-ID: If we have a bunch of ingredients emoji, then do yeast + grain + hops emoji combine into beer emoji? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Thu Sep 3 11:32:42 2015 From: rick at unicode.org (Rick McGowan) Date: Thu, 03 Sep 2015 09:32:42 -0700 Subject: The proposed update LDML specification for CLDR Release 28 now available for review Message-ID: <55E8762A.5010101@unicode.org> A proposed update to the LDML specification (UTS #35) will be available for review as of Monday, September 7 at 06:00 GMT. The open review period closes on Monday, September 14 at 06:00 GMT. (This is a short review period, because CLDR 28 is scheduled for release in the week of September 16.) The proposed update will be at http://unicode.org/reports/tr35/proposed.html To report bugs in the specification, please use http://unicode.org/cldr/trac/newticket From doug at ewellic.org Thu Sep 3 11:41:39 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Sep 2015 09:41:39 -0700 Subject: Technical or encoding sub mailing list =?UTF-8?Q?=3F?= Message-ID: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> Daniel B?nzli wrote: > Since I implement parts of the Unicode standard I'm interested in > keeping in touch with discussions about the standard and its evolution > from a technical point of view. > > I'm however not interested in the encoding point of view and all the > discussions of whichever pet symbol or concept random people from the > internet want to assign an integer to. > > With respect to these interests the amount of noise and off-topic > threads I get from this list is considerate and I'm considering > unsubscribing. > > Before I do so I would like to ask the moderators of this mailing list > if they would consider creating either a more technically focused > mailing list for implementers or, alternatively, forking off encoding > discussions to a dedicated mailing list. Well, that's not elitist or anything. Many of us are also implementers of the Unicode Standard, have been on the Unicode list for a long time (17 years in my case), and hardly think of ourselves as "random people from the internet." -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From daniel.buenzli at erratique.ch Thu Sep 3 11:59:59 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 3 Sep 2015 17:59:59 +0100 Subject: Technical or encoding sub mailing list ? In-Reply-To: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> References: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> Message-ID: Le jeudi, 3 septembre 2015 ? 17:41, Doug Ewell a ?crit : > Well, that's not elitist or anything. > > Many of us are also implementers of the Unicode Standard, have been on > the Unicode list for a long time (17 years in my case), and hardly think > of ourselves as "random people from the internet." If that can reassure you I do consider myself a random person from the internet on this list. It just turns out that random persons from the internet do have different interests, hence my request. Daniel From doug at ewellic.org Thu Sep 3 12:33:32 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Sep 2015 10:33:32 -0700 Subject: Technical or encoding sub mailing list =?UTF-8?Q?=3F?= Message-ID: <20150903103332.665a7a7059d7ee80bb4d670165c8327d.78816c6054.wbe@email03.secureserver.net> For 75 USD per year, or about 73 CHF, you can join the Unicode Consortium as an individual member, and thereby have full access to the Unicore list and other internal technical discussion lists. There are discounts on membership rates if you pay for 3 or more years at a time. http://www.unicode.org/consortium/levels.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? -------- Original Message -------- Subject: Re: Technical or encoding sub mailing list ? From: Daniel_B?nzli Date: Thu, September 03, 2015 10:59 am To: Doug Ewell Cc: Unicode Mailing List Le jeudi, 3 septembre 2015 ? 17:41, Doug Ewell a ?crit : > Well, that's not elitist or anything. > > Many of us are also implementers of the Unicode Standard, have been on > the Unicode list for a long time (17 years in my case), and hardly think > of ourselves as "random people from the internet." If that can reassure you I do consider myself a random person from the internet on this list. It just turns out that random persons from the internet do have different interests, hence my request. Daniel From unicode at maxtruxa.com Thu Sep 3 13:40:35 2015 From: unicode at maxtruxa.com (Max Truxa) Date: Thu, 3 Sep 2015 20:40:35 +0200 Subject: Technical or encoding sub mailing list ? In-Reply-To: <55e89284.b2b2320a.183f4.fffffe12SMTPIN_ADDED_MISSING@mx.google.com> References: <20150903103332.665a7a7059d7ee80bb4d670165c8327d.78816c6054.wbe@email03.secureserver.net> <55e89284.b2b2320a.183f4.fffffe12SMTPIN_ADDED_MISSING@mx.google.com> Message-ID: On Sep 3, 2015 5:11 PM, "Daniel B?nzli" wrote: > > Hello, > > Since I implement parts of the Unicode standard I'm interested in keeping in touch with discussions about the standard and its evolution from a technical point of view. > > I'm however not interested in the encoding point of view and all the discussions of whichever pet symbol or concept random people from the internet want to assign an integer to. > > With respect to these interests the amount of noise and off-topic threads I get from this list is considerate and I'm considering unsubscribing. > > Before I do so I would like to ask the moderators of this mailing list if they would consider creating either a more technically focused mailing list for implementers or, alternatively, forking off encoding discussions to a dedicated mailing list. > > Thanks, > > Daniel > > I feel you. Even though I probably would have worded such a request a little less offensive (or at least in a way people are less likely to take offense in it). Personally i find many non-technical discussions very interesting to read but the effort required to parse all that information to find something that is actually technically relevant can be quite huge at times. On Sep 3, 2015 7:39 PM, "Doug Ewell" wrote: > > For 75 USD per year, or about 73 CHF, you can join the Unicode > Consortium as an individual member, and thereby have full access to the > Unicore list and other internal technical discussion lists. > > There are discounts on membership rates if you pay for 3 or more years > at a time. > > http://www.unicode.org/consortium/levels.html > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > I didn't know it was that simple to get access to the Unicore list. Thank you very much! Best regards, Max Truxa From daniel.buenzli at erratique.ch Thu Sep 3 13:42:37 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 3 Sep 2015 19:42:37 +0100 Subject: Technical or encoding sub mailing list ? In-Reply-To: <20150903103332.665a7a7059d7ee80bb4d670165c8327d.78816c6054.wbe@email03.secureserver.net> References: <20150903103332.665a7a7059d7ee80bb4d670165c8327d.78816c6054.wbe@email03.secureserver.net> Message-ID: <6F47D1886B614A0282BDA84FAF76445A@erratique.ch> Le jeudi, 3 septembre 2015 ? 18:33, Doug Ewell a ?crit : > For 75 USD per year, or about 73 CHF, you can join the Unicode > Consortium as an individual member, and thereby have full access to the > Unicore list and other internal technical discussion lists. Well that sounds elitist... Joke apart, I still think that most of the time a good distinction can be made between the standard and the encoding process. The latter being a much more political procedure to which as an implementer I prefer to remain neutral to (and am not interested in following). Best, Daniel From doug at ewellic.org Thu Sep 3 13:53:12 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 03 Sep 2015 11:53:12 -0700 Subject: Technical or encoding sub mailing list =?UTF-8?Q?=3F?= Message-ID: <20150903115312.665a7a7059d7ee80bb4d670165c8327d.4f20bad301.wbe@email03.secureserver.net> Daniel B?nzli wrote: >> For 75 USD per year, or about 73 CHF, you can join the Unicode >> Consortium as an individual member, and thereby have full access to >> the Unicore list and other internal technical discussion lists. > > Well that sounds elitist... FWIW, I'm not a member due to the cost. > Joke apart, I still think that most of the time a good distinction can > be made between the standard and the encoding process. The latter > being a much more political procedure to which as an implementer I > prefer to remain neutral to (and am not interested in following). Most Internet mailing lists contain threads that may not be of interest to every subscriber. The Delete button is your friend. Waiting to see if Sarasvati decides to weigh in on the proposal to split the list. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Thu Sep 3 14:30:34 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 3 Sep 2015 21:30:34 +0200 (CEST) Subject: Technical or encoding sub mailing list ? In-Reply-To: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> References: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> Message-ID: <1737048593.21455.1441308634210.JavaMail.www@wwinf1f21> On Thu, 03 Sep 2015 09:41:39 -0700, Doug Ewell wrote: > > Daniel B?nzli wrote: > > > Since I implement parts of the Unicode standard I'm interested in > > keeping in touch with discussions about the standard and its evolution > > from a technical point of view. > > > > I'm however not interested in the encoding point of view and all the > > discussions of whichever pet symbol or concept random people from the > > internet want to assign an integer to. > > > > With respect to these interests the amount of noise and off-topic > > threads I get from this list is considerate and I'm considering > > unsubscribing. > > > > Before I do so I would like to ask the moderators of this mailing list > > if they would consider creating either a more technically focused > > mailing list for implementers or, alternatively, forking off encoding > > discussions to a dedicated mailing list. > > Well, that's not elitist or anything. > > Many of us are also implementers of the Unicode Standard, have been on > the Unicode list for a long time (17 years in my case), and hardly think > of ourselves as "random people from the internet." I believe that Daniel targets rather people like me, who am new on the List and have (unfortunately) never been a Unicode staff member. Nevertheless, I don't believe that anybody's subscription to this List result from a ?random?. To meet Daniel's request, the ?technical? threads Daniel is likely to be interested in, might be given a ?(TECHNICAL)? attribute in the Subject at some point of the thread, so that it will be easy to filter them and follow back in the Archive. I hope that helps... Marcel (?implementing? Unicode on a keyboard layout hopefully designed for a national standard) From asmus-inc at ix.netcom.com Thu Sep 3 14:41:42 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 3 Sep 2015 12:41:42 -0700 Subject: Dark beer emoji In-Reply-To: References: <20150901093703.665a7a7059d7ee80bb4d670165c8327d.70846b3fe6.wbe@email03.secureserver.net> <61BE6BAF-A38E-4D20-BF27-57F81F1AB531@swales.us> <666369174.13853.1441221200621.JavaMail.www@wwinf2233> <655390662.4885.1441271363853.JavaMail.www@wwinf1h23> Message-ID: <55E8A276.8020304@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Sep 3 14:54:30 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 3 Sep 2015 12:54:30 -0700 Subject: Technical or encoding sub mailing list ? In-Reply-To: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> References: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> Message-ID: <55E8A576.3000409@ix.netcom.com> An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Sep 4 10:06:24 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Sep 2015 08:06:24 -0700 Subject: Another attempt at plain language Message-ID: <20150904080624.665a7a7059d7ee80bb4d670165c8327d.43d7553337.wbe@email03.secureserver.net> Mark Davis ?? wrote: > However, if it ends up not being added as a BCP47 variant, one could > file a ticket for consideration as a BCP47 locale variant. The syntax > would be a bit different, eg en-u-va-plain vs en-plain. To clarify, this would be an extension-U subtag in accordance with RFC 6067. I'm confused how "plain English" (or German or what have you) represents any sort of aspect of a locale. Is "special variant" that open-ended? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Sep 4 10:10:52 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 04 Sep 2015 08:10:52 -0700 Subject: Please disregard my post to the wrong list Message-ID: <20150904081052.665a7a7059d7ee80bb4d670165c8327d.7b7ff9941c.wbe@email03.secureserver.net> From chris.fynn at gmail.com Fri Sep 4 12:22:33 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 4 Sep 2015 22:52:33 +0530 Subject: "Unicode of Death" In-Reply-To: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: On 28 May 2015 at 20:23, Doug Ewell wrote: .... > "Every character you use has a unicode value which tells your phone what > to display. One of the unicode values is actually never-ending and so > when the phone tries to read it it goes into an infinite loop which > crashes it." > > I've read TUS Chapter 4 and UTR #23 and I still can't find the > "never-ending" Unicode property. > > Perhaps astonishingly to some, the string displays fine on all my > Windows devices. Not all apps get the directionality right, but no > crashes. > Well isn't Apple's street address "Infinite Loop"? -------------- next part -------------- An HTML attachment was scrubbed... URL: From chris.fynn at gmail.com Fri Sep 4 12:31:09 2015 From: chris.fynn at gmail.com (Christopher Fynn) Date: Fri, 4 Sep 2015 23:01:09 +0530 Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: Perhaps there should be a "tounge in cheek" emoji to indicate this On 30 May 2015 at 04:50, Andrew Cunningham wrote: > Geez Philippe, > > It was tounge in cheek. > > A. > > > On Saturday, 30 May 2015, Philippe Verdy wrote: > > > > 2015-05-28 23:36 GMT+02:00 Andrew Cunningham : > >> > >> Not the first time unicode crashes things. There was the google chrome > bug on osx that crashed the tab for any syriac text. > > > > "Unicode crashes things"? Unicode has nothing to do in those crashes > caused by bugs in applications that make incorrect assumptions (in fact not > even related to characters themselves but to the supposed behavior of the > layout engine. Programmers and designers for example VERY frequently forget > the constraints for RTL languages and make incorrect assumptions about left > and right sides when sizing objects, or they don't expect that the cursor > will advance backward and forget that some measurements can be negative: if > they use this negative value to compute the size of a bitmap redering > surface, they'll get out of memory, unchecked null pointers returned, then > they will crash assuming the buffer was effectively allocated. > > These are the same kind of bugs as with the too common buffer overruns > with unchecked assumtions: the code is kept because "it works as is" in > their limited immediate tests. > > Producing full coverage tests is a difficult and lengthy task, that > programmers not always have the time to do, when they are urged to produce > a workable solution for some clients and then given no time to improve the > code before the same code is distributed to a wider range of clients. > > Commercial staffs do that frequently, they can't even read the technical > limitations even when they are documented by programmers... in addition the > commercial staff like selling softwares that will cause customers to ask > for support... that will be billed ! After that, programmers are > overwhelmed by bug reports and support requests, and have even less time to > design other thigs that they are working on and still have to produce. QA > tools may help programmers in this case by providing statistics about the > effective costs of producing new software with better quality, and the cost > of supporting it when it contains too many bugs: commercial teams like > those statistics because they can convert them to costs, commercial > margins, and billing rates. (When such QA tools are not used, programmers > will rapidly leave the place, they are fed up by the growing pressure to do > always more in the same time, with also a growing number of "urgent" > support requests.). > > Those that say "Unicode crashes things" do the same thing: they make > broad unchecked assumptions about how things are really made or how things > are actually working. > > > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Sep 4 13:11:09 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 4 Sep 2015 20:11:09 +0200 (CEST) Subject: "Unicode of Death" Message-ID: <627754852.23518.1441390269789.JavaMail.www@wwinf1e22> On Fri, 4 Sep 2015 23:01:09 +0530, Christopher Fynn wrote: [...] >> On Saturday, 30 May 2015, Philippe Verdy wrote: >> >>> 2015-05-28 23:36 GMT+02:00 Andrew Cunningham : >>>> >>>> Not the first time unicode crashes things. There was the google chrome bug on osx that crashed the tab for any syriac text. >>> >>> "Unicode crashes things"? Unicode has nothing to do in those crashes caused by bugs in applications that make incorrect assumptions (in fact not even related to characters themselves but to the supposed behavior of the layout engine. Programmers and designers for example VERY frequently forget the constraints for RTL languages and make incorrect assumptions about left and right sides when sizing objects, or they don't expect that the cursor will advance backward and forget that some measurements can be negative: if they use this negative value to compute the size of a bitmap redering surface, they'll get out of memory, unchecked null pointers returned, then they will crash assuming the buffer was effectively allocated. >>> These are the same kind of bugs as with the too common buffer overruns with unchecked assumtions: the code is kept because "it works as is" in their limited immediate tests. >>> Producing full coverage tests is a difficult and lengthy task, that programmers not always have the time to do, when they are urged to produce a workable solution for some clients and then given no time to improve the code before the same code is distributed to a wider range of clients. >>> Commercial staffs do that frequently, they can't even read the technical limitations even when they are documented by programmers... in addition the commercial staff like selling softwares that will cause customers to ask for support... that will be billed ! After that, programmers are overwhelmed by bug reports and support requests, and have even less time to design other thigs that they are working on and still have to produce. QA tools may help programmers in this case by providing statistics about the effective costs of producing new software with better quality, and the cost of supporting it when it contains too many bugs: commercial teams like those statistics because they can convert them to costs, commercial margins, and billing rates. (When such QA tools are not used, programmers will rapidly leave the place, they are fed up by the growing pressure to do always more in the same time, with also a growing number of "urgent" support requests.). >>> Those that say "Unicode crashes things" do the same thing: they make broad unchecked assumptions about how things are really made or how things are actually working. Voil? a very huge part of the answer to my various questions. I?ve joined up too late... >>> commercial staff like selling softwares that will cause customers to ask for support... that will be billed ! That was my suspicion when I faced so much problems. So there?s nothing more to await? Thanks Philippe! Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Sep 5 02:35:15 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 5 Sep 2015 09:35:15 +0200 (CEST) Subject: "Unicode of Death" In-Reply-To: References: <20150528075342.665a7a7059d7ee80bb4d670165c8327d.f8c9f482c0.wbe@email03.secureserver.net> Message-ID: <1731639002.1272.1441438515747.JavaMail.www@wwinf1g33> On Fri, 4 Sep 2015 23:01:09 +0530, Christopher Fynn wrote: > Perhaps there should be a "tounge in cheek" emoji to indicate this I didn?t notice the joke. Did you mean ?tongue in cheek?? (I?ve checked there are two spellings.) You may feel free to laugh. I too did at Shawn?s and Asmus? joke?:?D http://unicode.org/mail-arch/unicode-ml/y2015-m09/0042.html [but not, of course, when I read about people having their devices crashing]?:?( Thanks! Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Sep 5 09:14:31 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 5 Sep 2015 16:14:31 +0200 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: References: Message-ID: At one point, the proposal states: Another alternative is ? THAI CURRENCY SYMBOL BAHT. This has the advantage of already being in Unicode and somewhat resembling the Bitcoin sign. A major disadvantage is this symbol is already in use as a currency symbol for a different currency, so using it to represent Bitcoin will lead to confusion.The Baht and the Bitcoin sign are two different symbols for two different currencies. Currency symbols are quite often used for very different currencies, with very different values. The $, for example, is used for currencies all over the world, including many not called 'dollar'. I'd suggest that you amend your proposal to address why the case of Bitcoin and Baht are different than the case of Dollar and Peso (and other currencies using $). Mark *? Il meglio ? l?inimico del bene ?* On Thu, Sep 3, 2015 at 4:27 PM, Ken Shirriff wrote: > I'm putting together a proposal for the Bitcoin sign to be added to > Unicode, so I wanted to check here if people have any > comments/concerns/objections. > > I'm aware of the previous rejected proposal L2/11-130 > and I address the > issues from its rejection > . In particular, my > proposal includes many examples of the symbol in running text. I also > checked with bitcoin.org that they have no trademark on the logo. > > Please let me know of any other potential issues. > > Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.shirriff at gmail.com Sat Sep 5 10:24:44 2015 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Sat, 5 Sep 2015 08:24:44 -0700 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: References: Message-ID: Thanks for your comment, Mark. I've rewritten the baht section. Let me know if this addresses your concerns. Another alternative is ? THAI CURRENCY SYMBOL BAHT. The bitcoin sign and baht symbol are two unrelated symbols that have some visual similarity. They are not variants of the same symbol, unlike single-bar and double-bar dollar signs. Some websites use the baht symbol to represent bitcoins due to the lack of the bitcoin symbol in Unicode. However, this is considered by some to be ?hijacking? and ?stealing? of the bhat symbol. [footnote] While the same symbol can be used for two currencies (e.g. $ for dollars and pesos), reusing the baht symbol for bitcoin is not a good solution when two different symbols currently exist. Footnote: Some Bitcoin enthusiasts want to hijack the symbol for Thailand?s currency, Tech in Asia. https://www.techinasia.com/bitcoin-enthusiasts-steal-symbol-thailands-currency/ To ? or not to ?: Bitcoin debates stealing Thai baht's identity. http://bangkok.coconuts.co/2014/04/22/bh-or-not-b-bitcoin-movement-debates-stealing-thai-bahts-identity Ken On Sat, Sep 5, 2015 at 7:14 AM, Mark Davis ?? wrote: > At one point, the proposal states: > > Another alternative is ? THAI CURRENCY SYMBOL BAHT. This has the advantage > of already being in Unicode and somewhat resembling the Bitcoin sign. A > major disadvantage is this symbol is already in use as a currency symbol > for a different currency, so using it to represent Bitcoin will lead to > confusion.The Baht and the Bitcoin sign are two different symbols for two > different currencies. > > > Currency symbols are quite often used for very different currencies, with > very different values. The $, for example, is used for currencies all over > the world, including many not called 'dollar'. I'd suggest that you amend > your proposal to address why the case of Bitcoin and Baht are different > than the case of Dollar and Peso (and other currencies using $). > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Thu, Sep 3, 2015 at 4:27 PM, Ken Shirriff > wrote: > >> I'm putting together a proposal for the Bitcoin sign to be added to >> Unicode, so I wanted to check here if people have any >> comments/concerns/objections. >> >> I'm aware of the previous rejected proposal L2/11-130 >> and I address the >> issues from its rejection >> . In particular, my >> proposal includes many examples of the symbol in running text. I also >> checked with bitcoin.org that they have no trademark on the logo. >> >> Please let me know of any other potential issues. >> >> Ken >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Mon Sep 7 00:10:45 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 7 Sep 2015 14:10:45 +0900 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: References: Message-ID: <55ED1C55.7060702@it.aoyama.ac.jp> Hello Ken, You write "The bitcoin sign and baht symbol are two unrelated symbols that have some visual similarity.", but don't really give any supporting information for that claim. For example, searching for images of bitcoin and bath symbols shows that the Bitcoin usually has two vertical bars, which however show only above and below the B, whereas the baht sign usually has one bar going through the B. But first, this distinction is not always maintained. Second, I extremely strongly doubt that people are making the distinction in handwriting. The 'bath form' of the symbol is much easier to write by hand that the 'bitcoin form', and so most people in handwriting will use the former even for bitcoins. Just try to correctly write the four little strokes of the 'bitcoin form', and you will understand easily. Regards, Martin. On 2015/09/06 00:24, Ken Shirriff wrote: > Thanks for your comment, Mark. I've rewritten the baht section. Let me know > if this addresses your concerns. > > > Another alternative is ? THAI CURRENCY SYMBOL BAHT. The bitcoin sign and > baht symbol are two unrelated symbols that have some visual similarity. > They are not variants of the same symbol, unlike single-bar and double-bar > dollar signs. Some websites use the baht symbol to represent bitcoins due > to the lack of the bitcoin symbol in Unicode. However, this is considered > by some to be ?hijacking? and ?stealing? of the bhat symbol. [footnote] > While the same symbol can be used for two currencies (e.g. $ for dollars > and pesos), reusing the baht symbol for bitcoin is not a good solution when > two different symbols currently exist. > > Footnote: > > Some Bitcoin enthusiasts want to hijack the symbol for Thailand?s currency, > Tech in Asia. > https://www.techinasia.com/bitcoin-enthusiasts-steal-symbol-thailands-currency/ > To ? or not to ?: Bitcoin debates stealing Thai baht's identity. > http://bangkok.coconuts.co/2014/04/22/bh-or-not-b-bitcoin-movement-debates-stealing-thai-bahts-identity > > > Ken > > On Sat, Sep 5, 2015 at 7:14 AM, Mark Davis ?? wrote: > >> At one point, the proposal states: >> >> Another alternative is ? THAI CURRENCY SYMBOL BAHT. This has the advantage >> of already being in Unicode and somewhat resembling the Bitcoin sign. A >> major disadvantage is this symbol is already in use as a currency symbol >> for a different currency, so using it to represent Bitcoin will lead to >> confusion.The Baht and the Bitcoin sign are two different symbols for two >> different currencies. >> >> >> Currency symbols are quite often used for very different currencies, with >> very different values. The $, for example, is used for currencies all over >> the world, including many not called 'dollar'. I'd suggest that you amend >> your proposal to address why the case of Bitcoin and Baht are different >> than the case of Dollar and Peso (and other currencies using $). >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Thu, Sep 3, 2015 at 4:27 PM, Ken Shirriff >> wrote: >> >>> I'm putting together a proposal for the Bitcoin sign to be added to >>> Unicode, so I wanted to check here if people have any >>> comments/concerns/objections. >>> >>> I'm aware of the previous rejected proposal L2/11-130 >>> and I address the >>> issues from its rejection >>> . In particular, my >>> proposal includes many examples of the symbol in running text. I also >>> checked with bitcoin.org that they have no trademark on the logo. >>> >>> Please let me know of any other potential issues. >>> >>> Ken >>> >> >> > From richard.wordingham at ntlworld.com Mon Sep 7 01:23:21 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 7 Sep 2015 07:23:21 +0100 Subject: String Ranges in Unicode Sets In-Reply-To: <55E8762A.5010101@unicode.org> References: <55E8762A.5010101@unicode.org> Message-ID: <20150907072321.48321560@JRWUBU2> On Thu, 03 Sep 2015 09:32:42 -0700 Rick McGowan wrote: > A proposed update to the LDML specification (UTS #35) will be > available for review as of Monday, September 7 at 06:00 GMT. The open > review period closes on Monday, September 14 at 06:00 GMT. (This is a > short review period, because CLDR 28 is scheduled for release in the > week of September 16.) > > The proposed update will be at > http://unicode.org/reports/tr35/proposed.html > > To report bugs in the specification, please use > http://unicode.org/cldr/trac/newticket > Have the implications of adding string ranges to Unicode sets been considered? I'm mentioning them on the list because their impact goes beyond locales, and I haven't worked out their implications myself. By my reading, adding string ranges will initially make regular expression engines that don't use ICU non-compliant with Level 1 of UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and intersection'. I don't imagine the extra work of set operations on Unicode sets containing string ranges will be popular. It may be worst for the minority of regular expression engines that use the regularity of regular expressions. I note that the safety feature of requiring the start and end points to have the same length has been removed from their design. String ranges seem particularly vulnerable to the ill-effects of unpredictable normalisation. Richard. From asmus-inc at ix.netcom.com Mon Sep 7 01:24:51 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 6 Sep 2015 23:24:51 -0700 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: <55ED1C55.7060702@it.aoyama.ac.jp> References: <55ED1C55.7060702@it.aoyama.ac.jp> Message-ID: <55ED2DB3.5040009@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Sep 7 09:54:16 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 7 Sep 2015 16:54:16 +0200 Subject: String Ranges in Unicode Sets In-Reply-To: <20150907072321.48321560@JRWUBU2> References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> Message-ID: Thanks for the feedback. >By my reading, adding string ranges will initially make regular expression engines that don't use ICU non-compliant with Level 1 of UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and I don't see where you are getting that. UTS 35 isn't referenced by UTS 18 except for some examples of possible extensions in 1.2.3 Other Properties, and locale id syntax in level 3. I may be missing something, however. Can you tell me where #18 is referencing UnicodeSet? > I don't imagine the extra work of set operations String ranges need not be implemented internally (and I don't think the CLDR committee would expect them to be, in general). They are simply a way of expressing the *string format* of a UnicodeSet in a more compact fashion. (And UnicodeSets themselves can have a variety of different implementations, in any event). ?> ? String ? ? ranges seem particularly vulnerable to the ill-effects of unpredictable UnicodeSets are low level constructs, as are their string representations. Like all strings, the string format of a UnicodeSet may change if it is normalized. That is nothing new. - The string format "[a-?]" (that is, U+0061 LATIN SMALL LETTER A through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390 code points. - Under NFC it would change to "[a-?]" (that is, U+0061 LATIN SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and contain 841 code points. You really don't want to normalize the string format of UnicodeSets. Or if you suspect that those string formats might be normalized, then just use escaped format \x{...} for anything that might change under normalization. === Note that while it is fine to bring up topics for discussion here (or, better yet, on the "cldr-users at unicode.org" list), anything that requires a change will have to be filed as a CLDR ticket. Richard, I'm sure you know this, and also raised this topic here because of the relation to UTS18, so this is a reminder for others. Mark *? Il meglio ? l?inimico del bene ?* On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Thu, 03 Sep 2015 09:32:42 -0700 > Rick McGowan wrote: > > > A proposed update to the LDML specification (UTS #35) will be > > available for review as of Monday, September 7 at 06:00 GMT. The open > > review period closes on Monday, September 14 at 06:00 GMT. (This is a > > short review period, because CLDR 28 is scheduled for release in the > > week of September 16.) > > > > The proposed update will be at > > http://unicode.org/reports/tr35/proposed.html > > > > To report bugs in the specification, please use > > http://unicode.org/cldr/trac/newticket > > > > Have the implications of adding string ranges to Unicode sets been > considered? I'm mentioning them on the list because their impact goes > beyond locales, and I haven't worked out their implications myself. > > By my reading, adding string ranges will initially make regular > expression engines that don't use ICU non-compliant with Level 1 of > UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction and > intersection'. I don't imagine the extra work of set operations on > Unicode sets containing string ranges will be popular. It may be worst > for the minority of regular expression engines that use the regularity > of regular expressions. > > I note that the safety feature of requiring the start and end points > to have the same length has been removed from their design. String > ranges seem particularly vulnerable to the ill-effects of unpredictable > normalisation. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Sep 7 09:11:12 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 7 Sep 2015 15:11:12 +0100 (BST) Subject: A song in Esperanto Message-ID: <32856325.49589.1441635072798.JavaMail.defaultUser@defaultHost> A song in Esperanto I have written a song in Esperanto and published it on the web. http://www.users.globalnet.co.uk/~ngo/song1023.htm The publication process was interesting and I applied information that I found in the following Unicode code chart. Latin Extended-A http://www.unicode.org/charts/PDF/U0100.pdf I used the following two characters from that code chart. U+011D LATIN SMALL LETTER G WITH CIRCUMFLEX U+015D LATIN SMALL LETTER S WITH CIRCUMFLEX I wrote the HTML code directly into WordPad and saved as a Text Document from WordPad. I encoded the two accented characters each by using an ampersand followed by a U+0023 NUMBER SIGN character followed by an x followed by a four hexadecimal character code point followed by a semicolon. I have also published some other songs on the web. There is an index page as follows. http://www.users.globalnet.co.uk/~ngo/song0001.htm Two of the songs are as a result of topics on this mailing list. They are on the following pages. http://www.users.globalnet.co.uk/~ngo/song1018.htm http://www.users.globalnet.co.uk/~ngo/song1021.htm There is also the following which is about colour fonts. http://www.users.globalnet.co.uk/~ngo/une_chanson.pdf William Overington 7 September 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.shirriff at gmail.com Mon Sep 7 11:26:59 2015 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Mon, 7 Sep 2015 09:26:59 -0700 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: <55ED1C55.7060702@it.aoyama.ac.jp> References: <55ED1C55.7060702@it.aoyama.ac.jp> Message-ID: On Sun, Sep 6, 2015 at 10:10 PM, Martin J. D?rst wrote: > Hello Ken, > > You write "The bitcoin sign and baht symbol are two unrelated symbols that > have some visual similarity.", but don't really give any supporting > information for that claim. > Thanks for your comments, Martin. Asmus Freytag gave a detailed response, but I'd like to add a few things. The bitcoin sign is unrelated to the baht in origin. The bitcoin sign was first used in an icon replacing the software's "BC" logo with the bitcoin sign logo, showing the roots of the bitcoin sign are the letter B. There's no historical connection to the baht, unlike the multiple uses of $ which are historically related. The baht sign and the bitcoin sign are viewed as two distinct symbols by most of the Bitcoin community. Evidence for this is the bitcoin.org forum, which implemented a special mechanism to insert the bitcoin sign in text. This was done because the baht sign are bitcoin sign are considered different by the community. If the bitcoin sign were considered interchangeable with ?, it would have been much easier to just use ?. Other evidence is the development of special fonts to display the bitcoin sign. I believe (based on my reading) that the Thai community views the baht sign and the bitcoin sign as two distinct symbols. I have never seen the bitcoin sign used to represent baht (except one case widely viewed as a mistake ). As a thought experiment, consider a font that rendered the baht sign with the bitcoin glyph. I expect this would be extremely unpopular in Thailand, showing the bitcoin sign is not just a glyph variant of the baht sign. I linked to a couple articles from Thailand criticizing use of ? as "stealing" the baht sign, but use of the bitcoin sign is not viewed as a problem, showing that the bitcoin sign is not viewed in Thailand as a variant of the baht sign. Visually, the bitcoin sign and baht sign are distinct. The bitcoin sign is almost invariably represented with two vertical bars, which are not visible through the center of the B. This is how it is described on the bitcoin wiki . The baht sign is almost invariably represented with one vertical bar, which is visible through the B. (I couldn't find any official definition of the baht sign.) This is a different situation from the dollar sign, where single-bar and double-bar forms are interchangeable. A font can't provide a single glyph that will be satisfactory for both baht and bitcoin signs. To summarize, the bitcoin community and the Thai community both view the bitcoin sign and baht sign as two separate symbols. They shouldn't be unified. Ken > For example, searching for images of bitcoin and bath symbols shows that > the Bitcoin usually has two vertical bars, which however show only above > and below the B, whereas the baht sign usually has one bar going through > the B. > > But first, this distinction is not always maintained. Second, I extremely > strongly doubt that people are making the distinction in handwriting. The > 'bath form' of the symbol is much easier to write by hand that the 'bitcoin > form', and so most people in handwriting will use the former even for > bitcoins. Just try to correctly write the four little strokes of the > 'bitcoin form', and you will understand easily. > > Regards, Martin. > > > On 2015/09/06 00:24, Ken Shirriff wrote: > >> Thanks for your comment, Mark. I've rewritten the baht section. Let me >> know >> if this addresses your concerns. >> >> >> Another alternative is ? THAI CURRENCY SYMBOL BAHT. The bitcoin sign and >> baht symbol are two unrelated symbols that have some visual similarity. >> They are not variants of the same symbol, unlike single-bar and double-bar >> dollar signs. Some websites use the baht symbol to represent bitcoins due >> to the lack of the bitcoin symbol in Unicode. However, this is considered >> by some to be ?hijacking? and ?stealing? of the bhat symbol. [footnote] >> While the same symbol can be used for two currencies (e.g. $ for dollars >> and pesos), reusing the baht symbol for bitcoin is not a good solution >> when >> two different symbols currently exist. >> >> Footnote: >> >> Some Bitcoin enthusiasts want to hijack the symbol for Thailand?s >> currency, >> Tech in Asia. >> >> https://www.techinasia.com/bitcoin-enthusiasts-steal-symbol-thailands-currency/ >> To ? or not to ?: Bitcoin debates stealing Thai baht's identity. >> >> http://bangkok.coconuts.co/2014/04/22/bh-or-not-b-bitcoin-movement-debates-stealing-thai-bahts-identity >> >> >> Ken >> >> On Sat, Sep 5, 2015 at 7:14 AM, Mark Davis ?? wrote: >> >> At one point, the proposal states: >>> >>> Another alternative is ? THAI CURRENCY SYMBOL BAHT. This has the >>> advantage >>> of already being in Unicode and somewhat resembling the Bitcoin sign. A >>> major disadvantage is this symbol is already in use as a currency symbol >>> for a different currency, so using it to represent Bitcoin will lead to >>> confusion.The Baht and the Bitcoin sign are two different symbols for two >>> different currencies. >>> >>> >>> Currency symbols are quite often used for very different currencies, with >>> very different values. The $, for example, is used for currencies all >>> over >>> the world, including many not called 'dollar'. I'd suggest that you amend >>> your proposal to address why the case of Bitcoin and Baht are different >>> than the case of Dollar and Peso (and other currencies using $). >>> >>> >>> Mark >>> >>> *? Il meglio ? l?inimico del bene ?* >>> >>> On Thu, Sep 3, 2015 at 4:27 PM, Ken Shirriff >>> wrote: >>> >>> I'm putting together a proposal for the Bitcoin sign to be added to >>>> Unicode, so I wanted to check here if people have any >>>> comments/concerns/objections. >>>> >>>> I'm aware of the previous rejected proposal L2/11-130 >>>> and I address the >>>> issues from its rejection >>>> . In particular, my >>>> proposal includes many examples of the symbol in running text. I also >>>> checked with bitcoin.org that they have no trademark on the logo. >>>> >>>> Please let me know of any other potential issues. >>>> >>>> Ken >>>> >>>> >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Sep 7 12:27:55 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 7 Sep 2015 19:27:55 +0200 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: References: <55ED1C55.7060702@it.aoyama.ac.jp> Message-ID: <844DFF91-522F-4D55-BDCB-53267BBC640C@evertype.com> Just want to say, I don?t think this one is a runner right now. I spent many months recently working with people associated with Bitcoin and they could not decide what they wanted to do. Michael Everson * http://www.evertype.com/ From unicode at mva.name Mon Sep 7 12:49:03 2015 From: unicode at mva.name (Vadim A. Misbakh-Soloviov) Date: Mon, 07 Sep 2015 23:49:03 +0600 Subject: [RFC] Discussion about chances of some characters to be added in Unicode Message-ID: <1866954.ovXpklIGud@hp> Hello there! First of all, I'm sorry in advance, if my message's tone is not suitable for that mail list. Next, I'd like to discuss the chances of some characters to be added in Unicode at all. Most of all I interested about: 1) Full-height right and left isosceles triangles, positioned in the edges of the glyph space (so, when concatinated with space symbol on the background of same color of it's foreground, it looks integrally [ref: triangles_demo attach, although there is font rendering artefacts anyway, but, I hope, I clearly decribed the idea]). ref: symbols on the both edges on the attached "pwl" picture 2) "Forking" characher (not the math one, but VCS one). ref: in the middle on the attached "pwl" picture. 3) "Pause" (media) character (it is ones for "play/pause" and "play" in the unicode already, but it not for "pause"). There is "cheats" like using two vertical bars instead, usually it looks very ugly. 4) "Power" (like on power buttons on electronic devices) And, actually, imho, it also be nice to have all of symbols from the picture in the Unicode. P.S. I'd also ask about some more symbols, which is "missed" in everyday life and substituted with glyphicons on the web (but, you know, it is impossible to use glyphicons in CLI/console applications ?), like: "cart", "exit", "barcode" (ideally, including also "qr" and "datamatrix" ones), and more, and more, but let's initially talk about that ones I talked initially ? P.P.S.: and also it would be nice, I think, to have "icons" symbols of major OS brands (at least, Windows, MacOS, Linux, FreeBSD) to stop them (first two ones) of using Private set for that. -- Best regards, mva -------------- next part -------------- A non-text attachment was scrubbed... Name: pwl.png Type: image/png Size: 2611 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: triangles_demo.png Type: image/png Size: 3640 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 819 bytes Desc: This is a digitally signed message part. URL: From srl at icu-project.org Mon Sep 7 13:18:12 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Mon, 7 Sep 2015 11:18:12 -0700 Subject: [RFC] Discussion about chances of some characters to be added in Unicode In-Reply-To: <1866954.ovXpklIGud@hp> References: <1866954.ovXpklIGud@hp> Message-ID: Hello! The power symbol was already accepted, see http://unicode.org/alloc/Pipeline.html Steven Enviado desde nuestro iPhone. > El sept 7, 2015, a las 10:49 AM, Vadim A. Misbakh-Soloviov escribi?: > > Hello there! > > First of all, I'm sorry in advance, if my message's tone is not suitable for > that mail list. > Next, I'd like to discuss the chances of some characters to be added in > Unicode at all. > Most of all I interested about: > > 1) Full-height right and left isosceles triangles, positioned in the edges of > the glyph space (so, when concatinated with space symbol on the background of > same color of it's foreground, it looks integrally [ref: triangles_demo > attach, although there is font rendering artefacts anyway, but, I hope, I > clearly decribed the idea]). > ref: symbols on the both edges on the attached "pwl" picture > > 2) "Forking" characher (not the math one, but VCS one). > ref: in the middle on the attached "pwl" picture. > > 3) "Pause" (media) character (it is ones for "play/pause" and "play" in the > unicode already, but it not for "pause"). There is "cheats" like using two > vertical bars instead, usually it looks very ugly. > > 4) "Power" (like on power buttons on electronic devices) > > And, actually, imho, it also be nice to have all of symbols from the picture > in the Unicode. > > P.S. I'd also ask about some more symbols, which is "missed" in everyday life > and substituted with glyphicons on the web (but, you know, it is impossible to > use glyphicons in CLI/console applications ?), like: "cart", "exit", "barcode" > (ideally, including also "qr" and "datamatrix" ones), and more, and more, but > let's initially talk about that ones I talked initially ? > > P.P.S.: and also it would be nice, I think, to have "icons" symbols of major > OS brands (at least, Windows, MacOS, Linux, FreeBSD) to stop them (first two > ones) of using Private set for that. > > > -- > Best regards, > mva > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Sep 7 14:46:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 7 Sep 2015 20:46:06 +0100 Subject: String Ranges in Unicode Sets In-Reply-To: References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> Message-ID: <20150907204606.799fa7c0@JRWUBU2> On Mon, 7 Sep 2015 16:54:16 +0200 Mark Davis ?? wrote: > On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: >> By my reading, adding string ranges will initially make regular >> expression engines that don't use ICU non-compliant with Level 1 of >> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction >> and > I don't see where you are getting that. UTS 35 isn't referenced by > UTS 18 except for some examples of possible extensions in 1.2.3 Other > Properties, and locale id syntax in level 3. I may be missing > something, however. Can you tell me where #18 is referencing > UnicodeSet? In http://unicode.org/mail-arch/unicode-ml/y2014-m05/0052.html , you stated that the Unicode sets referred to in UTS#18 RL1.3 are the Unicode sets defined in UTS #35. We are now waiting for you to add the reference under Action 141-A76 - 'Make changes in UTS #18 based on general feedback in L2/14-277' (http://www.unicode.org/L2/L2014/14277-pubrev-ovrflw.html). I presume no change has been made yet because there are no *urgent* changes for UTS #18. > String ranges need not be implemented internally (and I don't think > the CLDR committee would expect them to be, in general). They are > simply a way of expressing the *string format* of a UnicodeSet in a > more compact fashion. (And UnicodeSets themselves can have a variety > of different implementations, in any event). [\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a very compact way of expressing a lot of strings. You wouldn't decompose that into a list of strings. >> String ? ? >> ranges seem particularly vulnerable to the ill-effects of >> unpredictable > UnicodeSets are low level constructs, as are their string > representations. Like all strings, the string format of a UnicodeSet > may change if it is normalized. That is nothing new. > - The string format "[a-?]" (that is, U+0061 LATIN SMALL LETTER A > through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390 > code points. > - Under NFC it would change to "[a-?]" (that is, U+0061 LATIN > SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and > contain 841 code points. At least this gives the same range whether normalised to NFC or to NFD. Using NFD, the preferred normalisation for regular expressions semi-respecting canonical equivalence, [{x?}-{?}] would not include the 2-character string "xa", as both bounds would decompose to two characters. Using NFC, the preferred normalisation for LDML (and for XML, I think), this would be a contraction for [{x?}-{x?}], and would include the 2-character string "xa". If the two strings had to have the same length, [{x?}-{?}] would be flagged as erroneous if interpreted in NFC, and with any luck, similar errors that were not detected would then also be corrected. It's not perfect, but il meglio ? l?inimico del bene. > You really don't want to normalize the string format of UnicodeSets. > Or if you suspect that those string formats might be normalized, then > just use escaped format \x{...} for anything that might change under > normalization. It would probably be sensible to issue a warning if the specification of a string bound had more than one canonical equivalent. I'm thinking of accidents. While an XML processor must not be Unicode compliant, I thought most regular expression engine environments were allowed to be Unicode compliant. TUS 8.0 Chapter 3 C6: "A process shall not assume that the interpretations of two canonical-equivalent character sequences are distinct." > Note that while it is fine to bring up topics for discussion here (or, > better yet, on the "cldr-users at unicode.org" > list), As this impacts regular expressions in general, I think this is the better list for the impact on Unicode sets outside CLDR. > anything that requires a change will have to be filed as a > CLDR ticket. Richard, I'm sure you know this, and also raised this > topic here because of the relation to UTS18, so this is a reminder > for others. Exactly. Richard. From richard.wordingham at ntlworld.com Mon Sep 7 16:43:01 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 7 Sep 2015 22:43:01 +0100 Subject: Upcoming proposal for Bitcoin sign In-Reply-To: References: <55ED1C55.7060702@it.aoyama.ac.jp> Message-ID: <20150907224301.10bc5aaf@JRWUBU2> On Mon, 7 Sep 2015 09:26:59 -0700 Ken Shirriff wrote: > The bitcoin sign is unrelated to the baht in origin. The bitcoin sign > was first used in an icon replacing > > the software's "BC" logo with the bitcoin sign logo, showing the > roots of the bitcoin sign are the letter B. There's no historical > connection to the baht, unlike the multiple uses of $ which are > historically related. The bitcoin sign and the baht sign are very closely related. Both are a combination of 'B' and the vertical strokes of the dollar symbol. Indeed, if you look at the first picture at http://www.goabroad.com/articles/study-abroad/thai-cuisine-the-spicy-truth , you can see a plain 'B' on the left and in the middle what looks like a B with two strokes below. A lot of handwritten baht signs end with a rightward flourish from the centre. It would seem that the preferred visible currency sign in Thailand is actually the two-character string ".-"! In a lot of cases, there's either no indicator of currency, or the word is written out in full. Perhaps a saving argument is the two forms of the pound sign - U+00A3 POUND SIGN and U+20A4 LIRA SIGN. Proper blue five pound notes had the two-barred form U+20A4 (which is how I learnt to write the pound sign); as the notes became greener, their lesser value was indicated by the use of the one-barred form U+00A3. The code chart notes that the preferred form for the lira is POUND SIGN, and I can tell you that my preferred form for the pound sterling is the so-called LIRA SIGN. Richard. From asmus-inc at ix.netcom.com Mon Sep 7 17:11:44 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 7 Sep 2015 15:11:44 -0700 Subject: String Ranges in Unicode Sets In-Reply-To: <20150907072321.48321560@JRWUBU2> References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> Message-ID: <55EE0BA0.9020105@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Sep 8 02:14:44 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 8 Sep 2015 09:14:44 +0200 Subject: String Ranges in Unicode Sets In-Reply-To: <20150907204606.799fa7c0@JRWUBU2> References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> <20150907204606.799fa7c0@JRWUBU2> Message-ID: Mark *? Il meglio ? l?inimico del bene ?* On Mon, Sep 7, 2015 at 9:46 PM, Richard Wordingham < richard.wordingham at ntlworld.com> wrote: > On Mon, 7 Sep 2015 16:54:16 +0200 > Mark Davis ?? wrote: > > > On Mon, Sep 7, 2015 at 8:23 AM, Richard Wordingham < > > richard.wordingham at ntlworld.com> wrote: > > >> By my reading, adding string ranges will initially make regular > >> expression engines that don't use ICU non-compliant with Level 1 of > >> UTS#18 Unicode Regular Expressions, in particular RL1.3 'subtraction > >> and > > > I don't see where you are getting that. UTS 35 isn't referenced by > > UTS 18 except for some examples of possible extensions in 1.2.3 Other > > Properties, and locale id syntax in level 3. I may be missing > > something, however. Can you tell me where #18 is referencing > > UnicodeSet? > > In http://unicode.org/mail-arch/unicode-ml/y2014-m05/0052.html , > you stated that the Unicode sets referred to in UTS#18 RL1.3 are the > Unicode sets defined in UTS #35. We are now waiting for you to add the > reference under Action 141-A76 - 'Make changes in UTS #18 based on > general feedback in > L2/14-277' (http://www.unicode.org/L2/L2014/14277-pubrev-ovrflw.html). > ?Good point. I tend to think that any new syntax would need to be approached charfully, and might only be mentioned as optional at first. But you'll get a chance for public review ? once you see them.? > I presume no change has been made yet because there are no *urgent* > changes for UTS #18. > ?Right, it was backed up behind Unicode 8.0.? > > String ranges need not be implemented internally (and I don't think > > the CLDR committee would expect them to be, in general). They are > > simply a way of expressing the *string format* of a UnicodeSet in a > > more compact fashion. (And UnicodeSets themselves can have a variety > > of different implementations, in any event). > > [\x{0000 0000 0000 0000} - \x{DFFFF DFFFF DFFFF DFFFF}] is a > very compact way of expressing a lot of strings. You wouldn't > decompose that into a list of strings. > Clearly there will be various memory/performance issues that ?would need to be taken into account. Not every implementation will be designed to handle extreme cases, and may simply not allow the creation of such as set. Not every string can be parsed by a BigDecimal system, etc. Not every regex expressions can be used (without DOS) on common implementations, and so on. > >> String ? ? > >> ranges seem particularly vulnerable to the ill-effects of > >> unpredictable > > > UnicodeSets are low level constructs, as are their string > > representations. Like all strings, the string format of a UnicodeSet > > may change if it is normalized. That is nothing new. > > > - The string format "[a-?]" (that is, U+0061 LATIN SMALL LETTER A > > through U+2126 OHM SIGN) represents a UnicodeSet that contains 8,390 > > code points. > > - Under NFC it would change to "[a-?]" (that is, U+0061 LATIN > > SMALL LETTER A through U+03A9 GREEK CAPITAL LETTER OMEGA), and > > contain 841 code points. > > At least this gives the same range whether normalised to NFC or to > NFD. Using NFD, the preferred normalisation for regular > expressions semi-respecting canonical equivalence, [{x?}-{?}] would > not include the 2-character string "xa", as both bounds would decompose > to two characters. Using NFC, the preferred normalisation for LDML > (and for XML, I think), this would be a contraction for [{x?}-{x?}], > and would include the 2-character string "xa". > If the two strings had > to have the same length, [{x?}-{?}] would be flagged as erroneous if > interpreted in NFC, ?If you look at the text in http://unicode.org/repos/cldr/trunk/specs/ldml/tr35.html#Lists_of_Code_Points, there was already a a restriction on the lengths. > and with any luck, similar errors that were not > detected would then also be corrected. It's not perfect, but ?I think that would just give people a false sense of security. Normalizing string format of a UnicodeSet (or regex) can change what the set matches, pretty dramatically, and is to be avoided (or as I said, one should use escaped strings where it can't be avoided). > il meglio > ? l?inimico del bene. > ?LOL? > > You really don't want to normalize the string format of UnicodeSets. > > Or if you suspect that those string formats might be normalized, then > > just use escaped format \x{...} for anything that might change under > > normalization. > > It would probably be sensible to issue a warning if the specification > of a string bound had more than one canonical equivalent. > ?Issue a warning works in a UI. Not necessarily so well in production code... ? > > I'm thinking of accidents. While an XML processor must not be Unicode > compliant, I thought most regular expression engine environments were > allowed to be Unicode compliant. > > TUS 8.0 Chapter 3 C6: "A process shall not assume that the > interpretations of two canonical-equivalent character sequences are > distinct." > ?A compiler will take source code containing String x="?"; and compile it to a certain binary. If that same source code is NFD'd, the compiler will produce a different result. Do you really think that such compiler is not compliant to Unicode?? If so, then we should add some more clarifications around C6. > > Note that while it is fine to bring up topics for discussion here (or, > > better yet, on the "cldr-users at unicode.org" > > list), > > As this impacts regular expressions in general, I think this is the > better list for the impact on Unicode sets outside CLDR. > ? > ?? > > anything that requires a change will have to be filed as a > > CLDR ticket. Richard, I'm sure you know this, and also raised this > > topic here because of the relation to UTS18, so this is a reminder > > for others. > > Exactly. > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Tue Sep 8 02:53:47 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 8 Sep 2015 00:53:47 -0700 Subject: String Ranges in Unicode Sets In-Reply-To: References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> <20150907204606.799fa7c0@JRWUBU2> Message-ID: <55EE940B.2060103@ix.netcom.com> An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Sep 8 06:46:48 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 8 Sep 2015 13:46:48 +0200 Subject: String Ranges in Unicode Sets In-Reply-To: <55EE940B.2060103@ix.netcom.com> References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> <20150907204606.799fa7c0@JRWUBU2> <55EE940B.2060103@ix.netcom.com> Message-ID: On Tue, Sep 8, 2015 at 9:53 AM, Asmus Freytag (t) wrote: > it is implied the String Range formulation is a compact form. > > Can you prove that it doesn't create any set of strings that can't be > specified in other ways (other than full enumeration of the strings?). > I ?t is simply a compact string representation, and is defined semantically by what it expands to. ? ? Just like character ranges, [a-z], etc. Of course, the underlying implementation *could* differ, but that doesn't affect the semantics. > What about set operations on sets with string ranges? > ?Again, the range notation is just a formatting issue. Anything you can do with [{ax}-{bz}?] you can also do with [{ax}{ay}{az}{bx}{by}{bz}?], and vice versa, since the former is defined to be equivalent to the latter. These are just string representations of the same *logical* underlying implementation. > Can they be expressed (other than working them out and writing down the > full enumeration of the resulting set)? > I'm not quite sure what you mean. That's like asking, "Can [a-z] be expressed, ?other than by writing out the full enumeration [a b c d e ... z]?". Well, yes. You could represent [a-z] in many ways: [\p{ASCII}&\p{lu}], for example. Or [\u0061 \u0062 ...]. Or.... ?But I'm probably misunderstanding what you are trying to say.? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Sep 8 07:08:26 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 8 Sep 2015 14:08:26 +0200 Subject: [RFC] Discussion about chances of some characters to be added in Unicode In-Reply-To: References: <1866954.ovXpklIGud@hp> Message-ID: <55EECFBA.1060308@gmail.com> Le 07/09/2015 20:18, Steven R. Loomis a ?crit : > Hello! > The power symbol was already accepted, see > http://unicode.org/alloc/Pipeline.html > And the proposal for the power symbol(s) is here http://www.unicode.org/L2/L2014/14009r-power-symbol.pdf . Fr?d?ric From frederic.grosshans at gmail.com Tue Sep 8 07:43:12 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 8 Sep 2015 14:43:12 +0200 Subject: [RFC] Discussion about chances of some characters to be added in Unicode In-Reply-To: <1866954.ovXpklIGud@hp> References: <1866954.ovXpklIGud@hp> Message-ID: <55EED7E0.6060804@gmail.com> Le 07/09/2015 19:49, Vadim A. Misbakh-Soloviov a ?crit : > Hello there! > > First of all, I'm sorry in advance, if my message's tone is not suitable for > that mail list. > Next, I'd like to discuss the chances of some characters to be added in > Unicode at all. > Most of all I interested about: > > 1) Full-height right and left isosceles triangles, positioned in the edges of > the glyph space (so, when concatinated with space symbol on the background of > same color of it's foreground, it looks integrally [ref: triangles_demo > attach, although there is font rendering artefacts anyway, but, I hope, I > clearly decribed the idea]). > ref: symbols on the both edges on the attached "pwl" picture Your description looks more like a glyph specification than a (more semantic) character description. I suspect that ?U+23F4 BLACK MEDIUM LEFT-POINTING TRIANGLE and ?U+23F5 BLACK MEDIUM RIGHT-POINTING TRIANGLE, introduced in Unicode 7.0 as interface symbols (anyone remember which proposal it was ?) are what you are looking for. > > 2) "Forking" characher (not the math one, but VCS one). > ref: in the middle on the attached "pwl" picture. This one seems legit to me, but the ?external link sign? seemed legit to me and was rejected (see http://unicode.org/alloc/nonapprovals.html ). > > 3) "Pause" (media) character (it is ones for "play/pause" and "play" in the > unicode already, but it not for "pause"). There is "cheats" like using two > vertical bars instead, usually it looks very ugly. You are looking for ?U+23F8 DOUBLE VERTICAL BAR (alternate name: pause), introduced in Unicode 7.0 for that specific purpose (I don?t remember the proposal) > > 4) "Power" (like on power buttons on electronic devices) As said by Steven, this one is already in the pipeline, even if not accepted yet > > And, actually, imho, it also be nice to have all of symbols from the picture > in the Unicode. Things like ?? U+1F512 LOCK ? > > P.S. I'd also ask about some more symbols, which is "missed" in everyday life > and substituted with glyphicons on the web (but, you know, it is impossible to > use glyphicons in CLI/console applications ?), That is not a Unicode problem, it is an interface problem, arguably a bug in CLI/console developement. > like: "cart", "exit", "barcode" The shopping cart is currently under consideration (see http://www.unicode.org/L2/L2015/15195r2-emoji-add-tranche6.pdf, as U+1F6D2) > (ideally, including also "qr" and "datamatrix" ones), and more, and more, but > let's initially talk about that ones I talked initially ? > > P.P.S.: and also it would be nice, I think, to have "icons" symbols of major > OS brands (at least, Windows, MacOS, Linux, FreeBSD) to stop them (first two > ones) of using Private set for that. That?s a big No since 1999 : these symbols are logos, and excluded with Unicode. No one wants to deal with the legal nightmares of doing so. Fr?d?ric From wjgo_10009 at btinternet.com Tue Sep 8 09:05:03 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 8 Sep 2015 15:05:03 +0100 (BST) Subject: Technical or encoding sub mailing list ? In-Reply-To: <55E8A576.3000409@ix.netcom.com> References: <20150903094139.665a7a7059d7ee80bb4d670165c8327d.3fabbe5441.wbe@email03.secureserver.net> <55E8A576.3000409@ix.netcom.com> Message-ID: <23571788.47434.1441721103109.JavaMail.defaultUser@defaultHost> Asmus Freytag wrote as follows: > There is a small set of people who like to hi-jack the list for their personal agendas, even after being told that the audience on the list has no interest. Some compound the issue by letting loose an inordinate number of posts in a short time, or don't know how to write anything short of a novella. I wonder if I may comment please. In the following post http://www.unicode.org/mail-arch/unicode-ml/y2014-m12/0032.html Asmus wrote as follows: quote ... Unicode has matured to the point of being the only game in town. end quote So there is a balance between the ways of regarding posts by an enthusiastic individual who is seeking to make progress with his or her research and who is seeking advice and constructive helpful comments on what he or she is suggesting should be encoded. As if in a research common room and floating ideas to experts in a variety of specialties, such as encoding, linguistics and software programming, seeking opinions, while each participant is sat enjoying a hot beverage, be it tea, coffee, hot chocolate or peppermint tea. > A bit of occasional "water-cooler" style banter, on the other hand, while off-topic and distracting, is also amusing and diverting. It's the social-media part of Unicode and goes back to before "social media" was a term. Yes, indeed. Fine. > I would agree that the former at times feels abusive, but the latter is tradition. Well, that the former feeling is felt is unfortunate. For myself, that is not my intention. I am seeking to make progress with my research. I want to submit a proposal to encode one character into regular Unicode so that it can be used with the base character followed by a sequence of tag characters method that was recently invented for encoding flags: a method that can have application for various purposes, including in-line graphics encoded in a plain text document. Yet discussion of my ideas in this mailing list is not allowed at present and maybe it never will be allowed. This makes it difficult for me to have discussions prior to submitting a proposal document. May I mention that if anyone is interested in viewing my latest research there are four transcripts available at the following place? http://www.users.globalnet.co.uk/~ngo/locsetag.htm William Overington 8 September 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Sep 8 10:19:03 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 08 Sep 2015 08:19:03 -0700 Subject: String Ranges in Unicode Sets Message-ID: <20150908081903.665a7a7059d7ee80bb4d670165c8327d.295ea8ba4b.wbe@email03.secureserver.net> Mark Davis ??? wrote: >> TUS 8.0 Chapter 3 C6: "A process shall not assume that the >> interpretations of two canonical-equivalent character sequences are >> distinct." > > ?A compiler will take source code containing String x="?"; and compile > it to a certain binary. If that same source code is NFD'd, the > compiler will produce a different result. > > Do you really think that such compiler is not compliant to Unicode?? > If so, then we should add some more clarifications around C6. I agree. The word "interpretations" in C6 can't have been intended to include the interpretation of code points qua code points. That would make a great many internal processes impossible. I think of C6 as meaning that spell-checkers, for example, should not treat Jos? (NFC, four code points) and Jose? (NFD, five code points) as separate entries. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Sep 8 16:41:08 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Sep 2015 22:41:08 +0100 Subject: String Ranges in Unicode Sets In-Reply-To: <20150908081903.665a7a7059d7ee80bb4d670165c8327d.295ea8ba4b.wbe@email03.secureserver.net> References: <20150908081903.665a7a7059d7ee80bb4d670165c8327d.295ea8ba4b.wbe@email03.secureserver.net> Message-ID: <20150908224108.07969c71@JRWUBU2> On Tue, 08 Sep 2015 08:19:03 -0700 "Doug Ewell" wrote: > Mark Davis ??? wrote: > > >> TUS 8.0 Chapter 3 C6: "A process shall not assume that the > >> interpretations of two canonical-equivalent character sequences are > >> distinct." > > > > ?A compiler will take source code containing String x="?"; and > > compile it to a certain binary. If that same source code is NFD'd, > > the compiler will produce a different result. > > > > Do you really think that such compiler is not compliant to Unicode?? > > If so, then we should add some more clarifications around C6. It's not me who put mens rea into the conformance requirements. If a compiler does no more than check strings for validity, than it may simply naively copy the sequence of scalar values without being non-compliant, so long as the *intent* is not to preserve differences. For example, if a process changes strings to preferred canonically equivalent strings, but treats characters with ccc=9 as though they had ccc=0, it probably is in breach. On the other hand, if it treated characters with ccc=9 as though they had ccc=300 (not a possible value of ccc), it is compliant. I think it is quite possible to have two identical pieces of code of which one is compliant and the other is non-compliant. It all depends on the code's motive, which I can only think refers to the motives of the intelligent entity that caused the code to be as it is. > I agree. The word "interpretations" in C6 can't have been intended to > include the interpretation of code points qua code points. That would > make a great many internal processes impossible. I would make it even more extreme by saying that the intent is that the rule apply to encoded text, as opposed to mere strings of code units. The problem is that some procedures allow a character to represent itself even where that is not consistent because the data will be seen as text. For example, it is my opinion that combining marks and control characters only belong in the representation of Unicode sets when they part of a non-defective string element. > I think of C6 as meaning that spell-checkers, for example, should not > treat Jos? (NFC, four code points) and Jose? (NFD, five code points) > as separate entries. C6 does not prohibit spell-checkers from neglecting to normalise. The authors of the code of a spell-checker could take the view that the database writers should have included all canonically equivalent forms. Practically, that allows a spell-checker to enforce normalisation. There's another, subtle feature for spell checkers. By any reading, C6 does not require a spell-checker to realise that 'find' might be spelt with U+FB01 LATIN SMALL LIGATURE FI. Applying NFKC or NFKD to the Thai word for 'water' would be wrong, for that converts to , which is wrong and looks quite different. Moreover, U+FB01 is not an acceptable alternative to in Turkish. Richard. From richard.wordingham at ntlworld.com Tue Sep 8 17:01:35 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 8 Sep 2015 23:01:35 +0100 Subject: String Ranges in Unicode Sets In-Reply-To: References: <55E8762A.5010101@unicode.org> <20150907072321.48321560@JRWUBU2> <20150907204606.799fa7c0@JRWUBU2> <55EE940B.2060103@ix.netcom.com> Message-ID: <20150908230135.314377eb@JRWUBU2> On Tue, 8 Sep 2015 13:46:48 +0200 Mark Davis ?? wrote: > On Tue, Sep 8, 2015 at 9:53 AM, Asmus Freytag (t) > wrote: > > What about set operations on sets with string ranges? > ?Again, the range notation is just a formatting issue. Anything you > can do with [{ax}-{bz}?] you can also do with > [{ax}{ay}{az}{bx}{by}{bz}?], and vice versa, since the former is > defined to be equivalent to the latter. These are just string > representations of the same *logical* underlying implementation. > > Can they be expressed (other than working them out and writing down > > the full enumeration of the resulting set)? > I'm not quite sure what you mean. That's like asking, "Can [a-z] be > expressed, ?other than by writing out the full enumeration [a b c d > e ... z]?". Well, yes. You could represent [a-z] in many ways: > [\p{ASCII}&\p{lu}], for example. Or [\u0061 \u0062 ...]. Or.... > ?But I'm probably misunderstanding what you are trying to say.? I think Asmus is asking if there is a more compact representation of the result of a string operation than just listing all the string elements. The answer would then be yes. Just [a-z]~~[e-s] can be written (and represented internally) as [a-dt-z], so [{aa}-{zz}]-[{ee}-{ss}] can be written (and represented internally) as the union of four non-overlapping string ranges [{aa}-{dz} {ea}-{sd} {et}-{sz} {ta}-{tz}]. Fortunately, unions of string ranges of the same length commute, which is not necessarily the case for Unicode sets. (It is possible that [[a][{ab}]] might preferentially match "a" while [[{ab}][a]] preferentially matched "ab".) Richard. From petercon at microsoft.com Thu Sep 10 13:04:33 2015 From: petercon at microsoft.com (Peter Constable) Date: Thu, 10 Sep 2015 18:04:33 +0000 Subject: [somewhat off topic] straw poll Message-ID: I was having an offline discussion with someone regarding certain topics that may show up on this list on occasion, and the question came up of what evidence we might have of sentiment on the list. So, I thought I'd conduct a simple straw poll - respond if you feel inclined. The questions are framed around this hypothetical scenario: Suppose I were to post a message to the list describing some experiment I did, creating a Web page containing (say) some Latin characters - not obscure, just-added-in-Unicode-8 characters, but ones that have been in the standard for some time; that my process for creating the file was to use (say) Notepad and entering HTML numeric character references; and that my findings were that it worked. Q1: Would you find that to be an interesting post that adds makes your participation in the list more useful, or would you find it a noisy distraction that reduces the value you get from participating in the list? Q2: If I were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Q3: If 50 people (still a small portion of the list membership) were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Thu Sep 10 13:21:25 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Thu, 10 Sep 2015 18:21:25 +0000 Subject: [somewhat off topic] straw poll In-Reply-To: References: Message-ID: Q1: neutral Q2: annoying Q3: reducing value of the list for me Le jeu. 10 sept. 2015 20:10, Peter Constable a ?crit : > I was having an offline discussion with someone regarding certain topics > that may show up on this list on occasion, and the question came up of what > evidence we might have of sentiment on the list. So, I thought I?d conduct > a simple straw poll ? respond if you feel inclined. > > > > The questions are framed around this hypothetical scenario: Suppose I were > to post a message to the list describing some experiment I did, creating a > Web page containing (say) some Latin characters ? not obscure, > just-added-in-Unicode-8 characters, but ones that have been in the standard > for some time; that my process for creating the file was to use (say) > Notepad and entering HTML numeric character references; and that my > findings were that it worked. > > > > Q1: Would you find that to be an interesting post that adds makes your > participation in the list more useful, or would you find it a noisy > distraction that reduces the value you get from participating in the list? > > > > Q2: If I were to send messages along that line on a regular basis, would > that add value to your participation in the list, or reduce it? > > > > Q3: If 50 people (still a small portion of the list membership) were to > send messages along that line on a regular basis, would that add value to > your participation in the list, or reduce it? > > > > > > > > Peter > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Sep 10 13:23:25 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 10 Sep 2015 11:23:25 -0700 Subject: [somewhat off topic] straw poll In-Reply-To: References: Message-ID: <55F1CA9D.5090200@ix.netcom.com> An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu Sep 10 13:29:45 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 10 Sep 2015 18:29:45 +0000 Subject: [somewhat off topic] straw poll In-Reply-To: References: Message-ID: Q1 I ignore threads that aren?t of interest (outlook even has a handy ?ignore thread? button - though lists like this tend to break it) Q2 If they get too annoying and don?t have useful content, then I make a rule to send that person?s mail to the trashcan. I include their name in the body to catch replies as well. Q3 If there were too many of those folks, then I?d have more rules. From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Fr?d?ric Grosshans Sent: Thursday, September 10, 2015 11:21 AM To: Peter Constable ; Unicode Mailing List Subject: Re: [somewhat off topic] straw poll Q1: neutral Q2: annoying Q3: reducing value of the list for me Le jeu. 10 sept. 2015 20:10, Peter Constable > a ?crit : I was having an offline discussion with someone regarding certain topics that may show up on this list on occasion, and the question came up of what evidence we might have of sentiment on the list. So, I thought I?d conduct a simple straw poll ? respond if you feel inclined. The questions are framed around this hypothetical scenario: Suppose I were to post a message to the list describing some experiment I did, creating a Web page containing (say) some Latin characters ? not obscure, just-added-in-Unicode-8 characters, but ones that have been in the standard for some time; that my process for creating the file was to use (say) Notepad and entering HTML numeric character references; and that my findings were that it worked. Q1: Would you find that to be an interesting post that adds makes your participation in the list more useful, or would you find it a noisy distraction that reduces the value you get from participating in the list? Q2: If I were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Q3: If 50 people (still a small portion of the list membership) were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu Sep 10 13:33:57 2015 From: petercon at microsoft.com (Peter Constable) Date: Thu, 10 Sep 2015 18:33:57 +0000 Subject: [somewhat off topic] straw poll In-Reply-To: <55F1CA9D.5090200@ix.netcom.com> References: <55F1CA9D.5090200@ix.netcom.com> Message-ID: Asmus, this came out of a friendly conversation meant to understand what kinds of topics do or don?t seem interesting to people, and how people might react. There was real interest in getting some indication of list sentiment. I certainly don?t mean to cause offense, or get too off topic. But I won?t push this if it?s felt to be that ? I am certainly willing to follow the sentiments of list members on this and any whether any other topics are appropriate. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Asmus Freytag (t) Sent: Thursday, September 10, 2015 11:23 AM To: unicode at unicode.org Subject: Re: [somewhat off topic] straw poll On 9/10/2015 11:04 AM, Peter Constable wrote: I was having an offline discussion with someone regarding certain topics that may show up on this list on occasion, and the question came up of what evidence we might have of sentiment on the list. So, I thought I?d conduct a simple straw poll ? respond if you feel inclined. This whole exercise strikes me as off topic. :) A./ The questions are framed around this hypothetical scenario: Suppose I were to post a message to the list describing some experiment I did, creating a Web page containing (say) some Latin characters ? not obscure, just-added-in-Unicode-8 characters, but ones that have been in the standard for some time; that my process for creating the file was to use (say) Notepad and entering HTML numeric character references; and that my findings were that it worked. Q1: Would you find that to be an interesting post that adds makes your participation in the list more useful, or would you find it a noisy distraction that reduces the value you get from participating in the list? Q2: If I were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Q3: If 50 people (still a small portion of the list membership) were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From KalvesmakiJ at doaks.org Thu Sep 10 13:44:47 2015 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Thu, 10 Sep 2015 18:44:47 +0000 Subject: [somewhat off topic] straw poll In-Reply-To: References: <55F1CA9D.5090200@ix.netcom.com> Message-ID: Dear Peter, This is the sort of inquiry that would be more efficiently conducted as a poll independent of the listserv, say with Google forms, to get a broader, more representative response from list members, many of whom wish neither to post nor to read individual responses on the listserv. jk From: Peter Constable > Date: Thursday, September 10, 2015 at 2:33 PM To: "Asmus Freytag (t)" >, "unicode at unicode.org" > Subject: RE: [somewhat off topic] straw poll this came out of a friendly conversation meant to understand what kinds of topics do or don?t seem interesting to people, and how people might react. There was real interest in getting some indication of list sentiment. I certainly don?t mean to cause offense, or get too off topic. But I won?t push this if it?s felt to be that ? I am certainly willing to follow the sentiments of list members on this and any whether any other topics are appropriate. Peter From richard.wordingham at ntlworld.com Thu Sep 10 14:49:06 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 10 Sep 2015 20:49:06 +0100 Subject: [somewhat off topic] straw poll In-Reply-To: References: Message-ID: <20150910204906.067d1fe0@JRWUBU2> On Thu, 10 Sep 2015 18:04:33 +0000 Peter Constable wrote: > The questions are framed around this hypothetical scenario: Suppose I > were to post a message to the list describing some experiment I did, > creating a Web page containing (say) some Latin characters - not > obscure, just-added-in-Unicode-8 characters, but ones that have been > in the standard for some time; that my process for creating the file > was to use (say) Notepad and entering HTML numeric character > references; and that my findings were that it worked. > > Q1: Would you find that to be an interesting post that adds makes > your participation in the list more useful, or would you find it a > noisy distraction that reduces the value you get from participating > in the list? Q1. It would tell me nothing I didn't know. That is because the usual expectation is now that Unicode works, so failures are of greater interest. Of course, news that significant hold-outs against Unicode had seen the light would also be useful. On the other hand, some people might respond with useful alternative tricks for arbitrary text entry - keyboards that will take hex input as an exceptional case (e.g. m17n Unicode for BMP on many Linux systems), alt/x in Word, a reminder of the existence of MSKLC and Tavultesoft keyman for making one's own keyboards and so on, and it could become a useful thread for some lurkers. > Q2: If I were to send messages along that line on a regular basis, > would that add value to your participation in the list, or reduce it? If nothing useful happened, it would probably reduce, but a hint of the week thread would be tolerable and excused by the thought that it may be helping some people. And then I would probably learn something useful, and my attitude would become more favourable. > Q3: If 50 people (still a small portion of the list membership) were > to send messages along that line on a regular basis, would that add > value to your participation in the list, or reduce it? They'd probably soon run out of new, useful or interesting things to say. So, what has become of Sarasvati? She hasn't scolded list participants for a long time. Richard. From charupdate at orange.fr Fri Sep 11 02:10:33 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 11 Sep 2015 09:10:33 +0200 (CEST) Subject: [somewhat off topic] straw poll Message-ID: <858151477.2818.1441955433203.JavaMail.www@wwinf1f13> On 10 Sep 2015 at 20:30, Asmus Freytag (t) wrote: > On 9/10/2015 11:04 AM, Peter Constable wrote: >> I was having an offline discussion with someone regarding certain topics that may show up on this list on occasion, and the question came up of what evidence we might have of sentiment on the list. So, I thought I?d conduct a simple straw poll ? respond if you feel inclined. > This whole exercise strikes me as off topic.? :) > > A./ > >> The questions are framed around this hypothetical scenario: Suppose I were to post a message to the list describing some experiment I did, creating a Web page containing (say) some Latin characters ? not obscure, just-added-in-Unicode-8 characters, but ones that have been in the standard for some time; that my process for creating the file was to use (say) Notepad and entering HTML numeric character references; and that my findings were that it worked. >> Q1: Would you find that to be an interesting post that adds makes your participation in the list more useful, or would you find it a noisy distraction that reduces the value you get from participating in the list? >> Q2: If I were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? >> Q3: If 50 people (still a small portion of the list membership) were to send messages along that line on a regular basis, would that add value to your participation in the list, or reduce it? I?m not about to fill up the frightening number of metadiscussions that arouse since I?ve been mailing to the List, but after having posted all my main concerns and thanked for the answers, I see myself faced with the need for some kind of debrief, since an influential subscriber started using the strawmen technique to gather testimonies against another subscriber. I can?t find another explanation for puffing up the issue by asking for statements about *fifty* persons sharing basic experiences on Unicode use, while AFAIK there have never been more than two, William?Overington and myself, of whose *only one* is left. Talking about a multitude of people is a totally unrealistic scenario, Richard?Wordingham outlined, because the stuff then inevitably runs out very soon: http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0079.html Making any decisions based upon opinions gathered by this technique, results in using an unfair methodology. I?ve been stating that only *one* person is left, and I?m happy to add my response, which doesn?t fit any of the three artificially built-up questions, but well the one that is tacitly underlying to each one of them: I?ve been glad to learn how William?Overington is using HTML character hex codes. IIRW, it?s even in the wake that I?ve added the &#x sequence in Shift on Numpad?0 when KanaLock is on, and the semicolon on + (while hex digits, and U+ and 0x, are on my numpad since a longer time). That?s what I?ll use when creating my next web page, as professionals are said to use text editors to achieve this (and I already did for charupdate.info; except that now it?ll be Notepad++ instead of Notepad that Peter?Constable cites again). To conclude, I wonder how Microsoft?which should ship a whole bunch of ultimately completed Unicode keyboard layouts with Windows since Unicode is thriving?I?wonder how Microsoft justify their cynism about seeing people discovering each one for himself what MSFT should have hurried to serve on a tray to all users, provided that Windows is the productivity worktool it claims to be. Well, basically this List is not the right spot to place that criticism. This is why I?ve to thank William and Peter for having brought up the occasion, each one in his way. I?confess that I prefer William?s. By far. Best wishes, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Sep 11 08:06:49 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 11 Sep 2015 14:06:49 +0100 (BST) Subject: [somewhat off topic] straw poll In-Reply-To: <20150910204906.067d1fe0@JRWUBU2> References: <20150910204906.067d1fe0@JRWUBU2> Message-ID: <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> Richard Wordingham wrote: > So, what has become of ... I hope that that does not start again. It is unfair dealing. Please look at the way that I was treated by a person or persons unknown. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0208.html I do not understand why the request for a moratorium was not made either to Unicode Inc. or to one or more of the people named on the following web page. http://www.unicode.org/consortium/directors.html I do not know why the moratorium was imposed by a person or persons unknown. I have been put back onto moderated post status as a result of the moratorium being imposed and I am still on it. Although it is called a moratorium there was no indication of whether or how the moratorium would be removed. I have simply had to try to make progress in other ways than by posting to the Unicode list. I am hoping to send a document to the Unicode Technical Committee about encoding one character into Unicode so as to enable my invention to become implemented. The moratorium prevents discussion in the mailing list prior to submission. I am hoping that the moratorium will be removed, yet it is something that I cannot apply for as I do not know where to apply! William Overington 11 September 2015 ! From wjgo_10009 at btinternet.com Fri Sep 11 09:13:15 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 11 Sep 2015 15:13:15 +0100 (BST) Subject: [somewhat off topic] straw poll In-Reply-To: <858151477.2818.1441955433203.JavaMail.www@wwinf1f13> References: <858151477.2818.1441955433203.JavaMail.www@wwinf1f13> Message-ID: <31826964.42623.1441980795598.JavaMail.defaultUser@defaultHost> I am grateful to Marcel for his comments. I received some email responses to my post entitled A song in Esperanto http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0056.html Only one of the email responses was in any way whatsoever critical of me posting that post. I responded and there was a continuing exchange of emails for a short while. I wrote two emails, the other person wrote three emails. There is the issue of netiquette so I feel that there is little more that I can add. However, I am entirely happy and indeed would be pleased for the other person who participated in the email exchange to publish to this mailing list a full, unedited transcript of the five emails if that person is willing to do so. I feel that in the discussion in this present thread it is important to remember the scope that the rules for the mailing list state. William Overington 11 September 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Fri Sep 11 10:14:04 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Fri, 11 Sep 2015 18:14:04 +0300 Subject: VS: [somewhat off topic] straw poll In-Reply-To: <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> References: <20150910204906.067d1fe0@JRWUBU2> <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> Message-ID: <000301d0eca4$7fd91830$7f8b4890$@fi> I, for one, don't see any reason to lift the moratorium on that particular worn-out topic. Sincerely, Erkki I. Kolehmainen -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta William_J_G Overington L?hetetty: 11. syyskuuta 2015 16:07 Vastaanottaja: asmus-inc at ix.netcom.com; richard.wordingham at ntlworld.com; Marcel Schneider; unicode at unicode.org; Shawn.Steele at microsoft.com; petercon at microsoft.com Aihe: Re: [somewhat off topic] straw poll Richard Wordingham wrote: > So, what has become of ... I hope that that does not start again. It is unfair dealing. Please look at the way that I was treated by a person or persons unknown. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0208.html I do not understand why the request for a moratorium was not made either to Unicode Inc. or to one or more of the people named on the following web page. http://www.unicode.org/consortium/directors.html I do not know why the moratorium was imposed by a person or persons unknown. I have been put back onto moderated post status as a result of the moratorium being imposed and I am still on it. Although it is called a moratorium there was no indication of whether or how the moratorium would be removed. I have simply had to try to make progress in other ways than by posting to the Unicode list. I am hoping to send a document to the Unicode Technical Committee about encoding one character into Unicode so as to enable my invention to become implemented. The moratorium prevents discussion in the mailing list prior to submission. I am hoping that the moratorium will be removed, yet it is something that I cannot apply for as I do not know where to apply! William Overington 11 September 2015 ! From webalorixa at gmail.com Fri Sep 11 10:45:01 2015 From: webalorixa at gmail.com (Luis de la Orden) Date: Fri, 11 Sep 2015 16:45:01 +0100 Subject: [somewhat off topic] straw poll In-Reply-To: <31826964.42623.1441980795598.JavaMail.defaultUser@defaultHost> References: <858151477.2818.1441955433203.JavaMail.www@wwinf1f13> <31826964.42623.1441980795598.JavaMail.defaultUser@defaultHost> Message-ID: Q1: It doesn't matter. These problems are inherent of the format the discussions are constrained to happen: mailing lists. Mailing lists are like the old postal service in the countryside, the mailmen would lazily get to the house at the crossroads and dump everyone's mail there for the whole community to collect. Q2: It doesn't matter. Mailing lists are not organised around topics, it is your free and conscious choice of topics that matters. I have no interest in anything else than African Languages written in Latin characters, but a mailing list forces us all to receive everyone else's emails. I neither want to receive everyone's emails nor send emails to everyone. I want a little corner where perhaps every other month someone will come by and talk about Yoruba, Igbo, even off-topic religious concept of Yoruba divination etc.. and that I will not hear about anything else until I *decide* to browse around. The biggest confusion going on here is that the model treats this list's topic as "Unicode" whilst the reality is that Unicode is a universe of diverse topics. Another thing is that Unicode makes 10 - 15% of my main professional interests as a User Experience Architect. For those amongst us that work 70% - 100% of the time with Unicode technology, email selection and deletion makes sense but for others this is another stream of messages that needs to be managed side by side with what they really work with. Q3: It doesn't matter. Since now we all know I am just interested in African languages, anyone can only assume that the value I am getting is as much as these African Languages topics appear subtracted by as many times all the other topics do multiplied by the amount of times I have to delete something I am not interested. And now testing an old functionality of mailing lists: UNSUBSCRIBE On 11 September 2015 at 15:13, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > I am grateful to Marcel for his comments. > > I received some email responses to my post entitled > > A song in Esperanto > > http://www.unicode.org/mail-arch/unicode-ml/y2015-m09/0056.html > > Only one of the email responses was in any way whatsoever critical of me > posting that post. > > I responded and there was a continuing exchange of emails for a short > while. > > I wrote two emails, the other person wrote three emails. > > There is the issue of netiquette so I feel that there is little more that > I can add. > > However, I am entirely happy and indeed would be pleased for the other > person who participated in the email exchange to publish to this mailing > list a full, unedited transcript of the five emails if that person is > willing to do so. > > I feel that in the discussion in this present thread it is important to > remember the scope that the rules for the mailing list state. > > William Overington > > 11 September 2015 > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Sep 11 10:33:31 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 11 Sep 2015 16:33:31 +0100 (BST) Subject: VS: [somewhat off topic] straw poll In-Reply-To: <000301d0eca4$7fd91830$7f8b4890$@fi> References: <20150910204906.067d1fe0@JRWUBU2> <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> <000301d0eca4$7fd91830$7f8b4890$@fi> Message-ID: <5920554.49990.1441985611875.JavaMail.defaultUser@defaultHost> Erkki I. Kolehmainen wrote: > I, for one, don't see any reason to lift the moratorium on that particular worn-out topic. One reason is that there is the new idea of using the base character followed by a sequence of tag characters technique to represent each localizable sentence. Thus only one new character would need to become encoded into regular Unicode. Yet there are other reasons too. One is that the moratorium was not stated as being imposed either by Unicode Inc. or by any of the people named on the following web page. http://www.unicode.org/consortium/directors.html If Unicode Inc. chooses to impose a moratorium on discussing this development in information technology then Unicode Inc. should say so officially and post a policy document and not have this unfair imposition of a moratorium by a person or persons unknown. Also, it is not a worn-out topic. It is a wonderful possibility for the future. William Overington 11 September 2015 From richard.wordingham at ntlworld.com Fri Sep 11 12:04:04 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 11 Sep 2015 18:04:04 +0100 Subject: [somewhat off topic] straw poll In-Reply-To: <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> References: <20150910204906.067d1fe0@JRWUBU2> <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> Message-ID: <20150911180404.378279ed@JRWUBU2> On Fri, 11 Sep 2015 14:06:49 +0100 (BST) William_J_G Overington wrote: > Richard Wordingham wrote: > > So, what has become of ... > Please look at the way that I was treated by a person or persons > unknown. > http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0208.html I'd forgotten that posting. As to who Sarasvati is, try Wikipedia: https://en.wikipedia.org/wiki/Saraswati . Recording her role on the Unicode list would probably count as 'original research'. Sarasvati appears to be in charge of the email lists, though I will admit I'm not sure where to find a statement of this. I am quite sure that Sarasvati enjoys the confidence of the Unicode Consortium. Richard. From frederic.grosshans at gmail.com Fri Sep 11 12:18:34 2015 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Fri, 11 Sep 2015 19:18:34 +0200 Subject: [somewhat off topic] straw poll In-Reply-To: <20150911180404.378279ed@JRWUBU2> References: <20150910204906.067d1fe0@JRWUBU2> <26872019.36726.1441976809728.JavaMail.defaultUser@defaultHost> <20150911180404.378279ed@JRWUBU2> Message-ID: <55F30CEA.1050806@gmail.com> Le 11/09/2015 19:04, Richard Wordingham a ?crit : > As to who Sarasvati is, try Wikipedia: > https://en.wikipedia.org/wiki/Saraswati . Recording her role on the > Unicode list would probably count as 'original research' Thanks for this link ! I?m ashamed to confess I ignored her identity, and I thought she was an employee of some IT company, managing the mailing list. Fr?d?ric From doug at ewellic.org Fri Sep 11 12:25:44 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 11 Sep 2015 10:25:44 -0700 Subject: VS: [somewhat off topic] straw poll Message-ID: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> William_J_G Overington wrote: > If Unicode Inc. chooses to impose a moratorium on discussing this > development in information technology then Unicode Inc. should say so > officially and post a policy document and not have this unfair > imposition of a moratorium by a person or persons unknown. Finally, something on which William and I can agree. I absolutely agree that UTC -- the technical committee, not the corporation -- should issue a formal statement expressing its position as to: 1. Generally, whether novel and untested concepts, particularly those for which a sizable body of popular support has not been established, are viewed by UTC as suitable and appropriate candidates for encoding in the Unicode Standard, on the basis of their perceived future usefulness. (I believe this statement has been made already; if so, a reference that can be easily cited would serve the purpose.) 2. Specifically, whether the particular concept that William proposes, to encode entities that are not characters into the Unicode Standard on the basis of their perceived future usefulness, is viewed by UTC as being suitable for and appropriate to the standard. Whichever position is taken by this statement, pro or con, this list should honor it. > Also, it is not a worn-out topic. It is a wonderful possibility for > the future. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Fri Sep 11 12:35:58 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 11 Sep 2015 19:35:58 +0200 Subject: VS: [somewhat off topic] straw poll In-Reply-To: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> References: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> Message-ID: I suggest that you create a proposal for the UTC so that it can go on record; I suspect it will get a favorable reception. Mark *? Il meglio ? l?inimico del bene ?* On Fri, Sep 11, 2015 at 7:25 PM, Doug Ewell wrote: > William_J_G Overington > wrote: > > > If Unicode Inc. chooses to impose a moratorium on discussing this > > development in information technology then Unicode Inc. should say so > > officially and post a policy document and not have this unfair > > imposition of a moratorium by a person or persons unknown. > > Finally, something on which William and I can agree. > > I absolutely agree that UTC -- the technical committee, not the > corporation -- should issue a formal statement expressing its position > as to: > > 1. Generally, whether novel and untested concepts, particularly those > for which a sizable body of popular support has not been established, > are viewed by UTC as suitable and appropriate candidates for encoding in > the Unicode Standard, on the basis of their perceived future usefulness. > (I believe this statement has been made already; if so, a reference that > can be easily cited would serve the purpose.) > > 2. Specifically, whether the particular concept that William proposes, > to encode entities that are not characters into the Unicode Standard on > the basis of their perceived future usefulness, is viewed by UTC as > being suitable for and appropriate to the standard. > > Whichever position is taken by this statement, pro or con, this list > should honor it. > > > Also, it is not a worn-out topic. It is a wonderful possibility for > > the future. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Sep 11 12:37:42 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 11 Sep 2015 19:37:42 +0200 Subject: VS: [somewhat off topic] straw poll In-Reply-To: References: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> Message-ID: BTW, the only way I see anything from Overington is when a message is quoted by someone else, since I long ago filtered those out of my email inbox. Mark *? Il meglio ? l?inimico del bene ?* On Fri, Sep 11, 2015 at 7:35 PM, Mark Davis ?? wrote: > I suggest that you create a proposal for the UTC so that it can go on > record; I suspect it will get a favorable reception. > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Fri, Sep 11, 2015 at 7:25 PM, Doug Ewell wrote: > >> William_J_G Overington >> wrote: >> >> > If Unicode Inc. chooses to impose a moratorium on discussing this >> > development in information technology then Unicode Inc. should say so >> > officially and post a policy document and not have this unfair >> > imposition of a moratorium by a person or persons unknown. >> >> Finally, something on which William and I can agree. >> >> I absolutely agree that UTC -- the technical committee, not the >> corporation -- should issue a formal statement expressing its position >> as to: >> >> 1. Generally, whether novel and untested concepts, particularly those >> for which a sizable body of popular support has not been established, >> are viewed by UTC as suitable and appropriate candidates for encoding in >> the Unicode Standard, on the basis of their perceived future usefulness. >> (I believe this statement has been made already; if so, a reference that >> can be easily cited would serve the purpose.) >> >> 2. Specifically, whether the particular concept that William proposes, >> to encode entities that are not characters into the Unicode Standard on >> the basis of their perceived future usefulness, is viewed by UTC as >> being suitable for and appropriate to the standard. >> >> Whichever position is taken by this statement, pro or con, this list >> should honor it. >> >> > Also, it is not a worn-out topic. It is a wonderful possibility for >> > the future. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From root at unicode.org Fri Sep 11 12:51:37 2015 From: root at unicode.org (Sarasvati) Date: Fri, 11 Sep 2015 12:51:37 -0500 Subject: [somewhat off topic] straw poll Message-ID: <201509111751.t8BHpbrs029759@sarasvati.unicode.org> Greetings to all: Mr Wordingham wondered, > So, what has become of Sarasvati? > She hasn't scolded list participants for a long time. Most list participants continue to behave in a civil manner that doesn't require much scolding. Although sometimes individuals may be escorted to the woodshed where I store my lart. Let me take this opportunity to remind everyone to please remain tolerably on-topic, an admittedly wide range. As people stray further into realms of meta-discussion, other subscribers become increasingly annoyed. Mr Overington wondered, > there was no indication of whether or how > the moratorium would be removed. Once a moratorium has been declared here, it will not be lifted. The topic to which Mr Overington refers will never be suitable for discussion here. Your ever-watchful-even-when-silent, -- Sarasvati From rick at unicode.org Fri Sep 11 13:07:47 2015 From: rick at unicode.org (Rick McGowan) Date: Fri, 11 Sep 2015 11:07:47 -0700 Subject: VS: [somewhat off topic] straw poll In-Reply-To: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> References: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> Message-ID: <55F31873.9030009@unicode.org> Doug, et al -- The primordial statement you're looking for is in TUS, Chapter 1 and has been there forever. See: http://www.unicode.org/versions/Unicode8.0.0/ch01.pdf In section 1.1, page 3: *Note, however, that the Unicode Standard does not encode idiosyncratic, personal, novel, or private-use characters, nor does it encode logos or graphics.* I'm not sure UTC has ever made any specific pronouncement on the topic, but they do sometimes add things to the notice of non-approvals, which can generally be taken as a precedent. http://unicode.org/alloc/nonapprovals.html If there is any such statement from the UTC, Ken Whsitler would probably be the one who could put his hand upon it most quickly. :-) R. On 9/11/2015 10:25 AM, Doug Ewell wrote: > I absolutely agree that UTC -- the technical committee, not the > corporation -- should issue a formal statement expressing its position > as to: > > 1. Generally, whether novel and untested concepts, particularly those > for which a sizable body of popular support has not been established, > are viewed by UTC as suitable and appropriate candidates for encoding in > the Unicode Standard, on the basis of their perceived future usefulness. > (I believe this statement has been made already; if so, a reference that > can be easily cited would serve the purpose.) > > 2. Specifically, whether the particular concept that William proposes, > to encode entities that are not characters into the Unicode Standard on > the basis of their perceived future usefulness, is viewed by UTC as > being suitable for and appropriate to the standard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Sep 11 13:11:16 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 11 Sep 2015 11:11:16 -0700 Subject: VS: [somewhat off topic] straw poll Message-ID: <20150911111116.665a7a7059d7ee80bb4d670165c8327d.cce00ab0d1.wbe@email03.secureserver.net> Mark Davis ?? wrote: > I suggest that you create a proposal for the UTC so that it can go on > record; I suspect it will get a favorable reception. I assume this was not meant for me personally. I have no authority to speak for UTC. The closest I ever got to that was when I got UTN #14 published. I'm serious about this (unlike the beer color modifiers). This statement needs to come officially and formally from UTC, as William suggested, not from randoms like me. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Fri Sep 11 13:26:06 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 11 Sep 2015 18:26:06 +0000 Subject: VS: [somewhat off topic] straw poll In-Reply-To: <20150911111116.665a7a7059d7ee80bb4d670165c8327d.cce00ab0d1.wbe@email03.secureserver.net> References: <20150911111116.665a7a7059d7ee80bb4d670165c8327d.cce00ab0d1.wbe@email03.secureserver.net> Message-ID: UTC can act on documents submitted to it, or to input submitted to it via the contact form (http://www.unicode.org/reporting.html), but will not act in response solely to topics discussed in this list. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Friday, September 11, 2015 11:11 AM To: Mark Davis ?? Cc: Unicode Mailing List Subject: RE: VS: [somewhat off topic] straw poll Mark Davis ?? wrote: > I suggest that you create a proposal for the UTC so that it can go on > record; I suspect it will get a favorable reception. I assume this was not meant for me personally. I have no authority to speak for UTC. The closest I ever got to that was when I got UTN #14 published. I'm serious about this (unlike the beer color modifiers). This statement needs to come officially and formally from UTC, as William suggested, not from randoms like me. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Sep 11 13:34:37 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 11 Sep 2015 11:34:37 -0700 Subject: VS: [somewhat off topic] straw poll Message-ID: <20150911113437.665a7a7059d7ee80bb4d670165c8327d.d978b6f58c.wbe@email03.secureserver.net> Rick McGowan wrote: > In section 1.1, page 3: > > *Note, however, that the Unicode Standard does not encode > idiosyncratic, personal, novel, or private-use characters, nor does it > encode logos or graphics.* Is there a statement anywhere about entities that aren't characters in any sense, other than having an arbitrary glyph assigned to them in a font somewhere? What about encoding things on speculation of future use, without a clear indication of imminent adoption -- the criterion applied to the euro sign, and more recently to emoji? > I'm not sure UTC has ever made any specific pronouncement on the > topic, but they do sometimes add things to the notice of non-approvals, which > can generally be taken as a precedent. Unfortunately for those hoping for a definitive statement, even non-approvals are occasionally overturned; U+1E9E?LATIN CAPITAL LETTER SHARP S leaps to mind. Evidently nothing short of a specific pronouncement on this specific topic will suffice. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Fri Sep 11 13:36:58 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 11 Sep 2015 18:36:58 +0000 Subject: [somewhat off topic] straw poll In-Reply-To: <201509111751.t8BHpbrs029759@sarasvati.unicode.org> References: <201509111751.t8BHpbrs029759@sarasvati.unicode.org> Message-ID: I did not intend to create a disturbance. Nor did I intend to do anything that might possibly be perceived as seeking action from the list administrator. (I mention that since Sarasvati was invoked.) And I certainly was not intending in any way to bring up moratoria that may have been declared on past topics or to suggest moratoria on new topics. (I mention that since somehow a previously-declared moratorium was raised in a reply to my original post.) I was merely seeking an indication of sentiment on the list regarding certain topics. This arose from an off-list discussion with one list member who has on occasion posted on certain topics and who indicated interest in seeing an indication of sentiment from the list. But it seem like my approach may be stirring up trouble and hence was not well-conceived. Hence, I apologize to the list and to any individuals I may have offended by this. Peter From richard.wordingham at ntlworld.com Fri Sep 11 14:46:15 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 11 Sep 2015 20:46:15 +0100 Subject: [somewhat off topic] straw poll In-Reply-To: References: <201509111751.t8BHpbrs029759@sarasvati.unicode.org> Message-ID: <20150911204615.39bbe997@JRWUBU2> On Fri, 11 Sep 2015 18:36:58 +0000 Peter Constable wrote: > I did not intend to create a disturbance. Nor did I intend to do > anything that might possibly be perceived as seeking action from the > list administrator. (I mention that since Sarasvati was invoked.) > But it seem like my approach may be stirring up trouble and hence was > not well-conceived. That's the trouble with staying civil! We have to guess when others are angry, and guess wrong. > Hence, I apologize to the list and to any individuals I may have > offended by this. Accepted, but with the observation that you are blameless. Richard. From daniel.buenzli at erratique.ch Fri Sep 11 18:14:10 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sat, 12 Sep 2015 00:14:10 +0100 Subject: VS: [somewhat off topic] straw poll In-Reply-To: References: <20150911102544.665a7a7059d7ee80bb4d670165c8327d.ddb725c04d.wbe@email03.secureserver.net> Message-ID: Le vendredi, 11 septembre 2015 ? 18:37, Mark Davis ?? a ?crit : > BTW, the only way I see anything from Overington is when a message is quoted by someone else, since I long ago filtered those out of my email inbox. When I read this message [1] (which I disagree with but that's another issue) I thought you were a moderator on this list. If that is the case then I don't think you should base your moderation of having your own personal filter over the mailing list. If you are not the actual moderator for the list then forget about this message. Whoever the moderator is on this list, I think (s)he doing a pretty bad job at it. Best, Daniel [1] http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0249.html From textexin at xencraft.com Fri Sep 11 18:26:40 2015 From: textexin at xencraft.com (Tex Texin) Date: Fri, 11 Sep 2015 16:26:40 -0700 Subject: the wheels on the bus Message-ID: <006001d0ece9$504468a0$f0cd39e0$@xencraft.com> Why do so many of the threads on this list seem best described as wheels coming off the bus? (Where is the emoji for that?) It is all too common for a thread to start, its appropriateness questioned, and then meta, policy and legalistic analysis ensue to no real end. I understand we often enter gray areas for what is appropriate for Unicode to include and to what the interest of a diverse list going from newbies to experts and longstanding members, innovative and pragmatic folks, so we get discussion at all levels and we want to be extremely tolerant. (OK, we means me. Not sure what you all think, but this is how I interpret the policies here and the comments being made). Since we don?t want to ban people but we want to improve the quality of the discussions, perhaps we can do the following. Create another list for meta, policy, and topics that are not directly encoding related. If a thread starts here, and a number of voices indicate it is off topic or if the mighty Sarasvati deems so, the discussion gets moved to the "meta list" (by Satasvati or a UTC delegate). There the idea can evolve, be debated, or die on the vine. At some point if it becomes a proposal to the UTC, or is refined enough that Sarasvati or some delegate ordained by Unicode can bring the idea back to this list. But it should only come back if authorized. Violating that policy is grounds for banishment. By "move" I do not mean deleted from this list. We just need to stipulate further discussion is on the "meta" list. An approach like this gives ideas that are not of obvious interest or relevance to this list a place to go. And yes the decision as to which subjects should be moved over is still gray and Solomon-like, but since the discussion has a home those who want to pursue it can do so, so the practice isn?t harmful. And it should reduce the urge for advocates to keep bringing the unwanted subjects up on this list. The other benefit is I, and I am sure many others wanted to echo Asmus and others comments about the poll or other topics being off topic. I didn?t respond as me too messages make the problem worse. If an off topic thread is moved over, then even the "so glad it moved" messages can go there. Or messages of a new type "Please bring this off topic thread from the Unicode list over here..." Ok, I have rolled out a new bus and I know the wheels are coming off. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Friday, September 11, 2015 11:35 AM To: Rick McGowan Cc: Unicode Mailing List Subject: RE: VS: [somewhat off topic] straw poll Rick McGowan wrote: > In section 1.1, page 3: > > *Note, however, that the Unicode Standard does not encode > idiosyncratic, personal, novel, or private-use characters, nor does it > encode logos or graphics.* Is there a statement anywhere about entities that aren't characters in any sense, other than having an arbitrary glyph assigned to them in a font somewhere? What about encoding things on speculation of future use, without a clear indication of imminent adoption -- the criterion applied to the euro sign, and more recently to emoji? > I'm not sure UTC has ever made any specific pronouncement on the > topic, but they do sometimes add things to the notice of > non-approvals, which can generally be taken as a precedent. Unfortunately for those hoping for a definitive statement, even non-approvals are occasionally overturned; U+1E9E LATIN CAPITAL LETTER SHARP S leaps to mind. Evidently nothing short of a specific pronouncement on this specific topic will suffice. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From root at unicode.org Fri Sep 11 19:31:01 2015 From: root at unicode.org (Sarasvati) Date: Fri, 11 Sep 2015 19:31:01 -0500 Subject: VS: [somewhat off topic] straw poll Message-ID: <201509120031.t8C0V1Dx017549@sarasvati.unicode.org> Good morning everyone! This topic has probably now received enough attention, and thank you to all who have contributed. Let us please move along to something else. Everyone should now point their web browsers at the list policies and guidelines to refresh themselves: http://unicode.org/policies/mail_policy.html The main point to remember in the current context is that discussions of mail list policy are out of scope for this list. If you have problems with a subscriber, or with how a topic is unfolding, please write to the staff, not to the list. Moderation on this list has always been very light, and mainly to assure a basic level of civility in the discussions. If you have a problem with that, please consider filing a complaint via the contact form or contacting the offending user privately. http://www.unicode.org/reporting.html Your, -- Sarasavati From otto.stolz at uni-konstanz.de Sat Sep 12 06:21:22 2015 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Sat, 12 Sep 2015 13:21:22 +0200 Subject: [somewhat off topic] straw poll In-Reply-To: References: Message-ID: <55F40AB2.7050303@uni-konstanz.de> Am 10. September 2015 um 20:04 h schrieb Peter Constable: > [?] creating a Web page containing (say) some Latin characters > - not obscure, [?] to use (say) Notepad and entering HTML > numeric character references; and that my findings were that > it worked. > Q1: Would you find that to be an interesting post [?] A1: No, because the scenario given is about a standard technique that every list participant is supposed to be aware of. I?d simply ignore a message of this type. If, however, a message were asking a question on this technique, I?d probably sent the author a short reply pointing to the pertinent FAQ entry, or HTML tutorial. > Q2: If I were to send messages along that line on a regular basis, > would that add value to your participation in the list, or reduce it? A2: Neither. If a particular author became notorious of this sort of contributions, I?d start to ignore his messages, altogether. If his messages would develop into a nuisance, I?d add him to the filter rules of my e-mail client. > Q3: If 50 people (still a small portion of the list membership) > were to send messages along that line on a regular basis, would > that add value to your participation in the list, or reduce it? A3: All of them would not start doing so at the same time, wouldn?t they? Hence, A2 would apply, on a per-case basis, without much ado. Best wishes, Otto From daniel.buenzli at erratique.ch Tue Sep 15 20:45:27 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 16 Sep 2015 02:45:27 +0100 Subject: Grapheme clusters and east asian width Message-ID: Hello, Is there any guidance on how to combine the information given by grapheme clusters and the east asian width property to do fixed-width layouts in terminal emulators ? For example if we have: U+AC01 ( ? ) HANGUL SYLLABLE GAG This will delimit a single grapheme cluster with east asian width W and hence 2 columns in a tty. However if we have it as the sequence: U+1100 ( ? ) HANGUL CHOSEONG KIYEOK U+1161 ( ? ) HANGUL JUNGSEONG A U+11A8 ( ? ) HANGUL JONGSEONG KIYEOK This will delimit a single grapheme cluster, but if I try to add up their east asian widths (W, N, N), this would result in 4 columns. Does something na?ve like looking up only the east asian width of the first scalar value in the grapheme cluster and use 2 columns for it if this is F or W and 1 column otherwise work or are there counter examples where this breaks ? Or is there anything more clever that can be done ? Thanks, Daniel From daniel.buenzli at erratique.ch Wed Sep 16 12:44:38 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 16 Sep 2015 18:44:38 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <55F9A2A3.2060500@bayarea.net> References: <55F9A2A3.2060500@bayarea.net> Message-ID: Le mercredi, 16 septembre 2015 ? 18:10, Edwin Hoogerbeets a ?crit : > Have you looked into the Unicode Normalization Algorithm? Since in general a precomposed character cannot always be found, I'll still need to apply unicode segmentation algorithm for finding grapheme clusters and I'd rather not add one more layer of processing if I can avoid it. Best, Daniel From richard.wordingham at ntlworld.com Wed Sep 16 14:33:51 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 16 Sep 2015 20:33:51 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: References: Message-ID: <20150916203351.50403e3e@JRWUBU2> On Wed, 16 Sep 2015 02:45:27 +0100 Daniel B?nzli wrote: > This will delimit a single grapheme cluster, but if I try to add up > their east asian widths (W, N, N), this would result in 4 columns. > Does something na?ve like looking up only the east asian width of the > first scalar value in the grapheme cluster and use 2 columns for it > if this is F or W and 1 column otherwise work or are there counter > examples where this breaks ? Or is there anything more clever that > can be done ? The silence is a bit worrying, but I can't see why that wouldn't work for normal text in CJK scripts. (Hangul LLLLLVVVVTTTT would probably cause some problems!) Have you addressed the issue of Indic scripts? There are discontiguous grapheme clusters composed of indecomposable code points (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g. U+0BCA TAMIL VOWEL SIGN OO), and whether consonant + virama + consonant is one cell or two may even depend on the font (e.g. Devanagari). How are you handling ligatures between grapheme clusters, e.g. English ? There are Tamil and Tai Tham examples of compulsory ligatures, shri and naa. Looking further ahead, there are characters in the pipeline that should be either Mc or Mn depending on what the base consonant is! You have dealt with grapheme clusters with a width of one cell and a depth of two, haven't you? Actually, there's a good argument for some grapheme clusters occupying cells above and below the line! Richard. From lyratelle at gmx.de Wed Sep 16 15:27:25 2015 From: lyratelle at gmx.de (Dominikus Dittes Scherkl) Date: Wed, 16 Sep 2015 22:27:25 +0200 Subject: Grapheme clusters and east asian width In-Reply-To: References: Message-ID: <55F9D0AD.1010400@gmx.de> Am 16.09.2015 um 03:45 schrieb Daniel B?nzli: > Hello, > > Is there any guidance on how to combine the information given by > grapheme clusters and the east asian width property to do fixed-width > layouts in terminal emulators ? > > For example if we have: > > U+AC01 ( ? ) HANGUL SYLLABLE GAG > > This will delimit a single grapheme cluster with east asian width W > and hence 2 columns in a tty. However if we have it as the sequence: > > U+1100 ( ? ) HANGUL CHOSEONG KIYEOK U+1161 ( ? ) HANGUL JUNGSEONG A > U+11A8 ( ? ) HANGUL JONGSEONG KIYEOK > > > > This will delimit a single grapheme cluster, but if I try to add up > their east asian widths (W, N, N), this would result in 4 columns. > Why adding them up? I think every grapheme cluster of hangul syllables would have simply width 2 - that is the concept of CJK charakters. -- Dominikus Dittes Scherkl From asmus-inc at ix.netcom.com Wed Sep 16 16:14:11 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Wed, 16 Sep 2015 14:14:11 -0700 Subject: Grapheme clusters and east asian width In-Reply-To: References: Message-ID: <55F9DBA3.20400@ix.netcom.com> An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Sep 16 16:34:17 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 16 Sep 2015 22:34:17 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <55F9D0AD.1010400@gmx.de> References: <55F9D0AD.1010400@gmx.de> Message-ID: <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> Le mercredi, 16 septembre 2015 ? 21:27, Dominikus Dittes Scherkl a ?crit : > Why adding them up? > I think every grapheme cluster of hangul syllables would have simply > width 2 - that is the concept of CJK charakters. I don't personally know how CJK characters behave in general w.r.t. to width, that's why I'm asking. I'm just trying to find a simple, best-effort, data-driven algorithm for the problem at-hand by using standard properties and possibly without making built-in assumptions about scripts. Le mercredi, 16 septembre 2015 ? 20:33, Richard Wordingham a ?crit : > Have you addressed the issue of Indic scripts? There are > discontiguous grapheme clusters composed of indecomposable code points > (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g. > U+0BCA TAMIL VOWEL SIGN OO), Not sure I understand what you mean here. > and whether consonant + virama + consonant is one cell or two may even depend on the font (e.g. > Devanagari). Well anything that is related to font metrics is out of scope from the point of view of a tty as I can't get the information. For example it seems that U+1F400 to U+1F579 have an east-asian width of N but will actually occupy two columns in the built-in osx terminal; of course these scalar values are not east asian text per se. > How are you handling ligatures between grapheme clusters, > e.g. English ? Here again I'd need font information for that, I expect the tty not to make ligatures between f and i. Of course the best way would be to be able to hand out a string to the tty for it to measure. But then it already seems impossible to test whether a terminal is able to handle UTF-8 or not? Maybe trying to use that east asian width property, was not a good idea to start with. Best, Daniel From daniel.buenzli at erratique.ch Wed Sep 16 16:56:42 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 16 Sep 2015 22:56:42 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <55F9DBA3.20400@ix.netcom.com> References: <55F9DBA3.20400@ix.netcom.com> Message-ID: Le mercredi, 16 septembre 2015 ? 22:14, Asmus Freytag (t) a ?crit : > "N" doesn't mean "narrow" but "neutral" - that is, the width is given by other consideration. Ah right ! Thanks. Narrow is Na. So a refined algorithm would be to actually do the summation in each grapheme cluster as I initially wanted to do with the mapping (F, W -> 2), (Na, H -> 1) (N -> 0) and if I get a 0 fallback on 1 or maybe try to make an educated guess according to the script/block. Best, Daniel From richard.wordingham at ntlworld.com Wed Sep 16 19:19:39 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 17 Sep 2015 01:19:39 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: References: <55F9DBA3.20400@ix.netcom.com> Message-ID: <20150917011939.22861725@JRWUBU2> On Wed, 16 Sep 2015 22:56:42 +0100 Daniel B?nzli wrote: > Le mercredi, 16 septembre 2015 ? 22:14, Asmus Freytag (t) a ?crit : > > "N" doesn't mean "narrow" but "neutral" - that is, the width is > > given by other consideration. > > Ah right ! Thanks. Narrow is Na. > > So a refined algorithm would be to actually do the summation in each > grapheme cluster as I initially wanted to do with the mapping (F, W > -> 2), (Na, H -> 1) (N -> 0) and if I get a 0 fallback on 1 or maybe > try to make an educated guess according to the script/block. I think you have a problem with U+302E HANGUL SINGLE DOT TONE MARK and U+302F HANGUL DOUBLE DOT TONE MARK, contrary to what I said earlier. They are preposed combining marks with Grapheme_Extend=Yes and EAW=Wide. I'm not sure whether the (legacy & extended) grapheme cluster should occupy 2, 3 or 4 cells. I think 2 cells is wrong, so summation works better, contrary to what I said earlier. Does anyone know how EAW=Wide was derived for these characters? Apparently they were wide even when they were non-spacing marks (gc=Mn), e.g.. in Unicode Version 5.0, so I suspect the were not given individual consideration. I suspect they should be EAW=A(mbiguous). Richard. From richard.wordingham at ntlworld.com Wed Sep 16 20:25:47 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 17 Sep 2015 02:25:47 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> Message-ID: <20150917022547.5640ee26@JRWUBU2> On Wed, 16 Sep 2015 22:34:17 +0100 Daniel B?nzli wrote: > Le mercredi, 16 septembre 2015 ? 20:33, Richard Wordingham a ?crit : > > Have you addressed the issue of Indic scripts? There are > > discontiguous grapheme clusters composed of indecomposable code > > points (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code > > points (e.g. U+0BCA TAMIL VOWEL SIGN OO), > > Not sure I understand what you mean here. In Khmer, a sequence is rendered with glyphs in the order /sign E, KA, sign AA/, and in Tamil a sequence is rendered with the glyphs in the order /sign EE, KA, sign AA/. All the glyphs have non-zero advance width. In both cases splits into two legacy grapheme clusters , but are a single extended grapheme cluster. In Tamil, is in NFC but not in NFD, and splits into > > and whether consonant + virama + consonant is one cell or two may > > even depend on the font (e.g. Devanagari). > > Well anything that is related to font metrics is out of scope from > the point of view of a tty as I can't get the information. You asked, "Is there any guidance on how to combine the information given by grapheme clusters and the east asian width property to do fixed-width layouts in terminal emulators ?". From this, I deduced that you are trying to write a terminal emulator. Are you actually trying to work out how a terminal emulator someone else wrote will position characters? Whether consonant + virama +consonant is once cell or two isn't a question of font metrics. For example, consider the sequence . This is composed of two legacy and extended grapheme clusters, and . In the 'Lohit Hindi' font, the two consonants are arranged vertically with no other representation of VIRAMA; horizontally, this is a single cell. In the 'gargi' font, one gets two instances of DDA side by side, with VIRAMA visible below the first. Both fonts are fully compliant with Unicode. If the terminal you are working with emulates a VT100, I believe it should be possible to ask it what the current cursor position is. At http://www.ccs.neu.edu/research/gpc/VonaUtils/vona/terminal/VT100_Escape_Codes.html , the query and response are called getcursor DSR and cursor CPR. > For > example it seems that U+1F400 to U+1F579 have an east-asian width of > N but will actually occupy two columns in the built-in osx terminal; > of course these scalar values are not east asian text per se. In so far as the property is useful, they probably should be ea=Wide. > Of course the best way would be to be able to hand out a string to > the tty for it to measure. But then it already seems impossible to > test whether a terminal is able to handle UTF-8 or not? > Maybe trying to use that east asian width property, was not a good > idea to start with. If you're trying to work out what a particular emulator will do, the starting point is its documentation. For many, the useful documentation may turn out to be the source code, which is not always available. However, a successful dialogue with the terminal would avoid these problems. It may even offer a solution to the problems of terminal size and text wrapping behaviour. Richard. From daniel.buenzli at erratique.ch Thu Sep 17 04:00:29 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 17 Sep 2015 10:00:29 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <20150917022547.5640ee26@JRWUBU2> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> Message-ID: <8523F8113D4A42CABC613375C5F95639@erratique.ch> Le jeudi, 17 septembre 2015 ? 02:25, Richard Wordingham a ?crit : > Are you actually trying to work out how a terminal emulator someone else wrote will position > characters? Yes. Basically given a, let's say single line, UTF-8 string to output to a, let's say an ANSI tty, I'd like to compute its visual extents. > In so far as the property is useful, they probably should be ea=Wide. This seems consistant with what is written in UAX #11 6.4 though. > If you're trying to work out what a particular emulator will do, the > starting point is its documentation. Unfortunately *many* emulators. Thanks, Daniel From richard.wordingham at ntlworld.com Thu Sep 17 07:27:31 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 17 Sep 2015 13:27:31 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <8523F8113D4A42CABC613375C5F95639@erratique.ch> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> Message-ID: <20150917132731.77680f77@JRWUBU2> On Thu, 17 Sep 2015 10:00:29 +0100 Daniel B?nzli wrote: > Le jeudi, 17 septembre 2015 ? 02:25, Richard Wordingham a ?crit : > > If you're trying to work out what a particular emulator will do, the > > starting point is its documentation. > Unfortunately *many* emulators. The best estimator is probably the POSIX function wcswidth(). The terminal emulator might actually use that function to do its layout. Some do. If you need accuracy, you may have to resort to asking the terminal where the cursor is. Of course the latter might not work if only the general concept of a terminal (perhaps, better, teletype) is being emulated. I wouldn't expect either to work for an application being run from the emacs shell program, which works with 'proportional' fonts, though one might get a pleasant surprise. (For example, emacs *might* *convert* the cursor position to nominal cell widths.) Richard. From eliz at gnu.org Thu Sep 17 09:47:53 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 17 Sep 2015 17:47:53 +0300 Subject: Grapheme clusters and east asian width In-Reply-To: <20150917132731.77680f77@JRWUBU2> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> Message-ID: <83vbb9qezq.fsf@gnu.org> > Date: Thu, 17 Sep 2015 13:27:31 +0100 > From: Richard Wordingham > > The best estimator is probably the POSIX function wcswidth(). Only on glibc-based systems, I'm quite sure. > The > terminal emulator might actually use that function to do its layout. > Some do. If you need accuracy, you may have to resort to asking the > terminal where the cursor is. Of course the latter might not work > if only the general concept of a terminal (perhaps, better, teletype) > is being emulated. I wouldn't expect either to work for an application > being run from the emacs shell program, which works with 'proportional' > fonts, though one might get a pleasant surprise. (For example, emacs > *might* *convert* the cursor position to nominal cell widths.) When Emacs displays on a text terminal, it's up to the terminal to handle the font; Emacs speaks to the terminal in character cell units. When Emacs displays on a graphics terminal, it works in pixels, so cursor position in character units is not useful. In any case, where do you think Emacs takes its idea of the width of every character? What other database could it use, that can be relied upon on any of the modern OSes, except the UCD? From daniel.buenzli at erratique.ch Thu Sep 17 10:51:03 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 17 Sep 2015 16:51:03 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <83vbb9qezq.fsf@gnu.org> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> <83vbb9qezq.fsf@gnu.org> Message-ID: Le jeudi, 17 septembre 2015 ? 15:47, Eli Zaretskii a ?crit : > > Date: Thu, 17 Sep 2015 13:27:31 +0100 > > From: Richard Wordingham > > > > The best estimator is probably the POSIX function wcswidth(). > Only on glibc-based systems, I'm quite sure. Is there a formal definition of the algorithm used ? This [1] is not very helpful. Best, Daniel [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/wcswidth.html From eliz at gnu.org Thu Sep 17 11:24:02 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 17 Sep 2015 19:24:02 +0300 Subject: Grapheme clusters and east asian width In-Reply-To: References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> <83vbb9qezq.fsf@gnu.org> Message-ID: <83oah1qajh.fsf@gnu.org> > Date: Thu, 17 Sep 2015 16:51:03 +0100 > From: Daniel B?nzli > Cc: Richard Wordingham , unicode at unicode.org > > > > Date: Thu, 17 Sep 2015 13:27:31 +0100 > > > From: Richard Wordingham > > > > > > The best estimator is probably the POSIX function wcswidth(). > > Only on glibc-based systems, I'm quite sure. > > Is there a formal definition of the algorithm used ? This [1] is not very helpful. They just use a table of values, AFAIK. From daniel.buenzli at erratique.ch Thu Sep 17 11:25:34 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Thu, 17 Sep 2015 17:25:34 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <83oah1qajh.fsf@gnu.org> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> <83vbb9qezq.fsf@gnu.org> <83oah1qajh.fsf@gnu.org> Message-ID: <85B20FFE60FC4FBA9EE6E867D19CEB90@erratique.ch> Le jeudi, 17 septembre 2015 ? 17:24, Eli Zaretskii a ?crit : > > Is there a formal definition of the algorithm used ? This [1] is not very helpful. > > They just use a table of values, AFAIK. But is it standardized or everyone has its own table ? Daniel From eliz at gnu.org Thu Sep 17 11:30:41 2015 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 17 Sep 2015 19:30:41 +0300 Subject: Grapheme clusters and east asian width In-Reply-To: <85B20FFE60FC4FBA9EE6E867D19CEB90@erratique.ch> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> <83vbb9qezq.fsf@gnu.org> <83oah1qajh.fsf@gnu.org> <85B20FFE60FC4FBA9EE6E867D19CEB90@erratique.ch> Message-ID: <83mvwlqa8e.fsf@gnu.org> > Date: Thu, 17 Sep 2015 17:25:34 +0100 > From: Daniel B?nzli > Cc: richard.wordingham at ntlworld.com, unicode at unicode.org > > Le jeudi, 17 septembre 2015 ? 17:24, Eli Zaretskii a ?crit : > > > Is there a formal definition of the algorithm used ? This [1] is not very helpful. > > > > They just use a table of values, AFAIK. > > But is it standardized or everyone has its own table ? I don't know, but I'm sure you will find out if you look into the glibc sources. They are publicly available. From fantasai.lists at inkedblade.net Thu Sep 17 12:16:36 2015 From: fantasai.lists at inkedblade.net (fantasai) Date: Thu, 17 Sep 2015 13:16:36 -0400 Subject: [CSSWG][css-inline] Updated WD of CSS Inline Layout Message-ID: <55FAF574.2050501@inkedblade.net> The CSS WG has published an updated Working Draft of the CSS Inline Layout Module Level 3 http://www.w3.org/TR/css-inline-3/ This module covers inline vertical alignment and special typographic effects for initial letters, such as drop caps. Changes since the previous WD include: * Addition of initial drafts for 'dominant-baseline' as well as 'vertical-align' and its SVG longhands 'alignment-baseline' and 'baseline-shift'. http://www.w3.org/TR/css-inline-3/#line-height * Addition of the 'initial-letter-wrap' property. http://www.w3.org/TR/css-inline-3/#initial-letter-wrapping * A redesign of the 'initial-letter-align' property. http://www.w3.org/TR/css-inline-3/#aligning-initial-letter * A large variety of fixes, clarifications, and improvements to the initial letter layout model. http://www.w3.org/TR/css-inline-3/#initial-letter-styling We're actively looking for review on all aspects of the draft, and in particular need help with handling non-Western scripts. Please send any comments to the www-style mailing list, , http://lists.w3.org/Archives/Public/www-style/ and please, prefix the subject line with [css-inline] (as I did on this message). For the CSS WG, ~fantasai From richard.wordingham at ntlworld.com Thu Sep 17 13:59:04 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 17 Sep 2015 19:59:04 +0100 Subject: Grapheme clusters and east asian width In-Reply-To: <83mvwlqa8e.fsf@gnu.org> References: <55F9D0AD.1010400@gmx.de> <804C0EBEE0E2487B91D1115BEB97922D@erratique.ch> <20150917022547.5640ee26@JRWUBU2> <8523F8113D4A42CABC613375C5F95639@erratique.ch> <20150917132731.77680f77@JRWUBU2> <83vbb9qezq.fsf@gnu.org> <83oah1qajh.fsf@gnu.org> <85B20FFE60FC4FBA9EE6E867D19CEB90@erratique.ch> <83mvwlqa8e.fsf@gnu.org> Message-ID: <20150917195904.51d3cda2@JRWUBU2> On Thu, 17 Sep 2015 19:30:41 +0300 Eli Zaretskii wrote: > > Date: Thu, 17 Sep 2015 17:25:34 +0100 > > From: Daniel B?nzli > > Cc: richard.wordingham at ntlworld.com, unicode at unicode.org > > > > Le jeudi, 17 septembre 2015 ? 17:24, Eli Zaretskii a ?crit : > > > > Is there a formal definition of the algorithm used ? This [1] > > > > is not very helpful. > > > > > > They just use a table of values, AFAIK. > > > > But is it standardized or everyone has its own table ? > > I don't know, but I'm sure you will find out if you look into the > glibc sources. They are publicly available. Shouldn't be that the locale sources? That then makes sense, for ambiguous width is resolved differently in Eastern and Western traditions. However, the calculation from single character width to string width is quite na?ve - they are just added up, at least in some version of glibc! This doesn't work when a spacing mark decomposes into two spacing marks - gets a length of 2, while the canonically equivalent string gets a length of 3! This affects the positioning of text following them in gnome-terminal. Richard. From rwhlk142 at gmail.com Fri Sep 18 19:56:26 2015 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Fri, 18 Sep 2015 20:56:26 -0400 Subject: Choton Alphabet Message-ID: Hello! Would anybody have a picture with the complete Choton alphabet script?! This conlang also uses a German-based transliteration alphabet ( for /x/ with for , while is for /z/ and stands in for /?/ ...). Initial words in sentences, and proper nouns (at least) get capitalized, like in German. Thank You! Robert Lloyd Wheelock INTERNATIONAL SYMBOLISM RESEARCH INSTITUTE Harmony, ME U.S.A. Augusta, ME U.S.A. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Sun Sep 20 09:48:01 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 20 Sep 2015 07:48:01 -0700 Subject: Concise term for non-ASCII Unicode characters Message-ID: <55FEC721.7040008@seantek.com> What is the most concise term for characters or code points outside of the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these as "extended characters" or "non-ASCII Unicode" but I do not find those terms precise. We are talking about the code points U+0080 - U+10FFFF. I suppose that this also refers to code points/scalar values that are not formally Unicode characters, such as U+FFFF. Basically, I am looking for a concise term for values that would require multiple UTF-8 octets if encoded in UTF-8 (without referring to UTF-8 encoding specifically). "Non-ASCII" is not precise enough since character sets like Shift-JIS are non-ASCII. Also a citation to a relevant standard (whether Unicode or otherwise) would be helpful. The terms "supplementary character" and "supplementary code point" are defined in the Unicode standard, referring to characters or code points above U+FFFF. I am looking for something like those, but for characters or code points above U+007F. Thank you, Sean From petercon at microsoft.com Sun Sep 20 11:52:29 2015 From: petercon at microsoft.com (Peter Constable) Date: Sun, 20 Sep 2015 16:52:29 +0000 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FEC721.7040008@seantek.com> References: <55FEC721.7040008@seantek.com> Message-ID: You already have been using "non-ASCII Unicode", which is about as concise and sufficiently accurate as you'll get. There's no term specifically defined in any standard or conventionally used for this. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: Sunday, September 20, 2015 7:48 AM To: unicode at unicode.org Subject: Concise term for non-ASCII Unicode characters What is the most concise term for characters or code points outside of the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these as "extended characters" or "non-ASCII Unicode" but I do not find those terms precise. We are talking about the code points U+0080 - U+10FFFF. I suppose that this also refers to code points/scalar values that are not formally Unicode characters, such as U+FFFF. Basically, I am looking for a concise term for values that would require multiple UTF-8 octets if encoded in UTF-8 (without referring to UTF-8 encoding specifically). "Non-ASCII" is not precise enough since character sets like Shift-JIS are non-ASCII. Also a citation to a relevant standard (whether Unicode or otherwise) would be helpful. The terms "supplementary character" and "supplementary code point" are defined in the Unicode standard, referring to characters or code points above U+FFFF. I am looking for something like those, but for characters or code points above U+007F. Thank you, Sean From addison at lab126.com Sun Sep 20 12:05:29 2015 From: addison at lab126.com (Phillips, Addison) Date: Sun, 20 Sep 2015 17:05:29 +0000 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> Message-ID: <2d89da431be946d7a3ec5085928e019f@EX13D08UWB002.ant.amazon.com> I agree, although I note that sometimes the additional (redundant) specificity of "non-7-bit-ASCII characters" is needed when talking to people unclear on what "ASCII" means. Addison > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter > Constable > Sent: Sunday, September 20, 2015 9:52 AM > To: Sean Leonard; unicode at unicode.org > Subject: RE: Concise term for non-ASCII Unicode characters > > You already have been using "non-ASCII Unicode", which is about as concise > and sufficiently accurate as you'll get. There's no term specifically defined in > any standard or conventionally used for this. > > > Peter > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean > Leonard > Sent: Sunday, September 20, 2015 7:48 AM > To: unicode at unicode.org > Subject: Concise term for non-ASCII Unicode characters > > What is the most concise term for characters or code points outside of the > US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these as > "extended characters" or "non-ASCII Unicode" but I do not find those terms > precise. We are talking about the code points U+0080 - U+10FFFF. I suppose > that this also refers to code points/scalar values that are not formally > Unicode characters, such as U+FFFF. Basically, I am looking for a concise term > for values that would require multiple UTF-8 octets if encoded in UTF-8 > (without referring to UTF-8 encoding specifically). > "Non-ASCII" is not precise enough since character sets like Shift-JIS are non- > ASCII. > > Also a citation to a relevant standard (whether Unicode or otherwise) would > be helpful. > > The terms "supplementary character" and "supplementary code point" are > defined in the Unicode standard, referring to characters or code points > above U+FFFF. I am looking for something like those, but for characters or > code points above U+007F. > > Thank you, > > Sean From steve at swales.us Sun Sep 20 12:59:52 2015 From: steve at swales.us (Steve Swales) Date: Sun, 20 Sep 2015 10:59:52 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <2d89da431be946d7a3ec5085928e019f@EX13D08UWB002.ant.amazon.com> References: <55FEC721.7040008@seantek.com> <2d89da431be946d7a3ec5085928e019f@EX13D08UWB002.ant.amazon.com> Message-ID: Exactly. I think the reason that non-ASCII feels non-concise is that there is widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is widely confused with Windows-1252). -steve Sent from my iPhone > On Sep 20, 2015, at 10:05 AM, Phillips, Addison wrote: > > I agree, although I note that sometimes the additional (redundant) specificity of "non-7-bit-ASCII characters" is needed when talking to people unclear on what "ASCII" means. > > Addison > >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter >> Constable >> Sent: Sunday, September 20, 2015 9:52 AM >> To: Sean Leonard; unicode at unicode.org >> Subject: RE: Concise term for non-ASCII Unicode characters >> >> You already have been using "non-ASCII Unicode", which is about as concise >> and sufficiently accurate as you'll get. There's no term specifically defined in >> any standard or conventionally used for this. >> >> >> Peter >> >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean >> Leonard >> Sent: Sunday, September 20, 2015 7:48 AM >> To: unicode at unicode.org >> Subject: Concise term for non-ASCII Unicode characters >> >> What is the most concise term for characters or code points outside of the >> US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these as >> "extended characters" or "non-ASCII Unicode" but I do not find those terms >> precise. We are talking about the code points U+0080 - U+10FFFF. I suppose >> that this also refers to code points/scalar values that are not formally >> Unicode characters, such as U+FFFF. Basically, I am looking for a concise term >> for values that would require multiple UTF-8 octets if encoded in UTF-8 >> (without referring to UTF-8 encoding specifically). >> "Non-ASCII" is not precise enough since character sets like Shift-JIS are non- >> ASCII. >> >> Also a citation to a relevant standard (whether Unicode or otherwise) would >> be helpful. >> >> The terms "supplementary character" and "supplementary code point" are >> defined in the Unicode standard, referring to characters or code points >> above U+FFFF. I am looking for something like those, but for characters or >> code points above U+007F. >> >> Thank you, >> >> Sean > > From petercon at microsoft.com Sun Sep 20 14:24:14 2015 From: petercon at microsoft.com (Peter Constable) Date: Sun, 20 Sep 2015 19:24:14 +0000 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> <2d89da431be946d7a3ec5085928e019f@EX13D08UWB002.ant.amazon.com> Message-ID: Well, if the point is to refer to characters that would require two or more code units in UTF-8, then _accurate_ expressions would be, "Unicode characters beyond the Basic Latin block" or "Unicode characters above U+007F". Peter -----Original Message----- From: Steve Swales [mailto:steve at swales.us] Sent: Sunday, September 20, 2015 11:00 AM To: Phillips, Addison Cc: Peter Constable ; Sean Leonard ; unicode at unicode.org Subject: Re: Concise term for non-ASCII Unicode characters Exactly. I think the reason that non-ASCII feels non-concise is that there is widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is widely confused with Windows-1252). -steve Sent from my iPhone > On Sep 20, 2015, at 10:05 AM, Phillips, Addison wrote: > > I agree, although I note that sometimes the additional (redundant) specificity of "non-7-bit-ASCII characters" is needed when talking to people unclear on what "ASCII" means. > > Addison > >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Peter >> Constable >> Sent: Sunday, September 20, 2015 9:52 AM >> To: Sean Leonard; unicode at unicode.org >> Subject: RE: Concise term for non-ASCII Unicode characters >> >> You already have been using "non-ASCII Unicode", which is about as >> concise and sufficiently accurate as you'll get. There's no term >> specifically defined in any standard or conventionally used for this. >> >> >> Peter >> >> -----Original Message----- >> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean >> Leonard >> Sent: Sunday, September 20, 2015 7:48 AM >> To: unicode at unicode.org >> Subject: Concise term for non-ASCII Unicode characters >> >> What is the most concise term for characters or code points outside >> of the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to >> these as "extended characters" or "non-ASCII Unicode" but I do not >> find those terms precise. We are talking about the code points U+0080 >> - U+10FFFF. I suppose that this also refers to code points/scalar >> values that are not formally Unicode characters, such as U+FFFF. >> Basically, I am looking for a concise term for values that would >> require multiple UTF-8 octets if encoded in UTF-8 (without referring to UTF-8 encoding specifically). >> "Non-ASCII" is not precise enough since character sets like Shift-JIS >> are non- ASCII. >> >> Also a citation to a relevant standard (whether Unicode or otherwise) >> would be helpful. >> >> The terms "supplementary character" and "supplementary code point" >> are defined in the Unicode standard, referring to characters or code >> points above U+FFFF. I am looking for something like those, but for >> characters or code points above U+007F. >> >> Thank you, >> >> Sean > > From daniel.buenzli at erratique.ch Sun Sep 20 14:57:10 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sun, 20 Sep 2015 20:57:10 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> <2d89da431be946d7a3ec5085928e019f@EX13D08UWB002.ant.amazon.com> Message-ID: Le dimanche, 20 septembre 2015 ? 18:59, Steve Swales a ?crit : > Exactly. I think the reason that non-ASCII feels non-concise is that there is widespread confusion between ASCII and Latin-1/ISO 8859-1 (which in turn is widely confused with Windows-1252). For this reason I usually use the term US-ASCII, which is the IANA name for the 7-bit-ASCII characters [1]. Someone referring to the non-US-ASCII scalar values of unicode would make precise sense to me. But then maybe Peter's very last suggestion is actually the most precise you can get to. Also if you are talking about UTF-8 I would use the term scalar values rather than "characters" or "code points" since surrogates can't be encoded in UTF-8. Best, Daniel [1] http://www.iana.org/assignments/character-sets From cph13 at case.edu Sun Sep 20 19:13:01 2015 From: cph13 at case.edu (Clive Hohberger) Date: Sun, 20 Sep 2015 19:13:01 -0500 Subject: Obituary for Adrian Frutiger Message-ID: http://www.nytimes.com/2015/09/20/arts/design/adrian-frutiger-dies-at-87-his-type-designs-show-you-the-way.html -- Clive P. Hohberger, PhD MBA Managing Director Clive Hohberger, LLC +1 847 910 8794 cph13 at case.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Sep 20 19:51:32 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 21 Sep 2015 09:51:32 +0900 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FEC721.7040008@seantek.com> References: <55FEC721.7040008@seantek.com> Message-ID: <55FF5494.602@it.aoyama.ac.jp> Hello Sean, On 2015/09/20 23:48, Sean Leonard wrote: > What is the most concise term for characters or code points So we already have two different things we might need a term for. > outside of > the US-ASCII range (U+0000 - U+007F)? Sometimes I have referred to these > as "extended characters" Most of the characters outside the US-ASCII range are perfectly simple and basic characters. I don't think the term 'extended' fits well here. It gives the impression that everything except US-ASCII is somewhat extraordinary, which in this day and age shouldn't be the case anymore. > or "non-ASCII Unicode" but I do not find those > terms precise. We are talking about the code points U+0080 - U+10FFFF. I > suppose that this also refers to code points/scalar values that are not > formally Unicode characters, such as U+FFFF. Again we may need different terms depending on whether these are included or not. > Basically, I am looking for > a concise term for values that would require multiple UTF-8 octets if > encoded in UTF-8 (without referring to UTF-8 encoding specifically). > "Non-ASCII" is not precise enough since character sets like Shift-JIS > are non-ASCII. Well, the non-ASCII characters in Shift-JIS are also contained in Unicode, so depending on exactly what you want to talk about, Non-ASCII characters may be good enough. > Also a citation to a relevant standard (whether Unicode or otherwise) > would be helpful. > > The terms "supplementary character" and "supplementary code point" are > defined in the Unicode standard, referring to characters or code points > above U+FFFF. I am looking for something like those, but for characters > or code points above U+007F. And then in some cases, you may want to exclude the C0 area (U+0000-001F), or part of it, or some syntactically significant characters (e.g. punctuation) in the remaining part. Anyway, what I wanted to show is that depending on what you need it for, there are so many different variations that it doesn't pay off to create specific short terms for all of them, and the term you use currently may be short enough. Regards, Martin. From lists+unicode at seantek.com Mon Sep 21 03:22:14 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Mon, 21 Sep 2015 01:22:14 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FF5494.602@it.aoyama.ac.jp> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> Message-ID: <55FFBE36.5030104@seantek.com> First of all, thank you all for the responses thus far. On 9/20/2015 5:51 PM, Martin J. D?rst wrote: > Hello Sean, > > On 2015/09/20 23:48, Sean Leonard wrote: >> What is the most concise term for characters or code points > > So we already have two different things we might need a term for. > [...] >> >> The terms "supplementary character" and "supplementary code point" are >> defined in the Unicode standard, referring to characters or code points >> above U+FFFF. I am looking for something like those, but for characters >> or code points above U+007F. > Anyway, what I wanted to show is that depending on what you need it > for, there are so many different variations that it doesn't pay off to > create specific short terms for all of them, and the term you use > currently may be short enough. Well what I am getting at is that when writing standards documents in various SDOs (or any other computer science text, for that matter), it is helpful to identify these characters/code points. I think we can limit our inquiry to "characters" and "code points". Both of those are well-defined in Unicode (see ). A [Unicode] code point is any value in the range 0 - 0x10FFFF. A [Unicode] character is an abstract character that is actually assigned a [Unicode] scalar value. Therefore the space is Unicode code point > Unicode scalar value > Unicode character. "supplementary" means outside the BMP, i.e., 0x10000 - 0x10FFFF. "BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0xFFFF. The problem is that the BMP / supplementary distinction makes sense in a UCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 is the way to go. I wish that "non-ASCII characters" and "non-ASCII code points" (and non-ASCII scalar values) were sufficient for me. Maybe they can be. However, in contexts where ASCII is getting extended or supplemented (e.g., in the DNS or in e-mail), one needs to be really clear that the octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not something else. The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in, characters beyond ASCII, code points beyond ASCII) have some support in the Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency" paragraph. Additionally as Peter stated, an expression including "Basic Latin block" (e.g., characters beyond the Basic Latin block) could work. FWIW, the term "non-ASCII" is used in e-mail address internationalization ("EAI") in the IETF; its opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC 2047 from November 1996 but there it has the more expansive meaning (i.e., not limited or targeted to Unicode). Sean From Tony at Jollans.com Mon Sep 21 06:46:48 2015 From: Tony at Jollans.com (Tony Jollans) Date: Mon, 21 Sep 2015 12:46:48 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FFBE36.5030104@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> Message-ID: <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> As an interested outsider may I suggest that the term "ASCII", indeed the concept of ASCII, is only of historical interest and should not be used in any modern context. Computing is riddled with terms, "word" being another in similar vein, that are used to mean something they are not and would be best forgotten. These days, it is pretty sloppy coding that cares how many bytes an encoding of something requires, although there may be many circumstances where legacy support is required. You say that, in some contexts, one needs to be really clear that the octets 0x80 - 0xFF are Unicode. Either something "is" Unicode, or it isn't. Either something uses a recognised encoding, or it doesn't. Using these octets to represent Unicode code points is not ASCII, is not UTF-8, and is not UCS-2/UTF-16; it could, perhaps, be EBCDIC. Whatever it is, say so clearly and explicitly and, if necessary, say why; don't look for some mealy-mouthed expression to avoid so saying. Just my twopenn'orth, and no offence meant, but I can't help thinking you're looking for something that shouldn't exist. Best regards, Tony Jollans -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: 21 September 2015 09:22 To: unicode at unicode.org Subject: Re: Concise term for non-ASCII Unicode characters First of all, thank you all for the responses thus far. On 9/20/2015 5:51 PM, Martin J. D?rst wrote: > Hello Sean, > > On 2015/09/20 23:48, Sean Leonard wrote: >> What is the most concise term for characters or code points > > So we already have two different things we might need a term for. > [...] >> >> The terms "supplementary character" and "supplementary code point" >> are defined in the Unicode standard, referring to characters or code >> points above U+FFFF. I am looking for something like those, but for >> characters or code points above U+007F. > Anyway, what I wanted to show is that depending on what you need it > for, there are so many different variations that it doesn't pay off to > create specific short terms for all of them, and the term you use > currently may be short enough. Well what I am getting at is that when writing standards documents in various SDOs (or any other computer science text, for that matter), it is helpful to identify these characters/code points. I think we can limit our inquiry to "characters" and "code points". Both of those are well-defined in Unicode (see ). A [Unicode] code point is any value in the range 0 - 0x10FFFF. A [Unicode] character is an abstract character that is actually assigned a [Unicode] scalar value. Therefore the space is Unicode code point > Unicode scalar value > Unicode character. "supplementary" means outside the BMP, i.e., 0x10000 - 0x10FFFF. "BMP" means inside the Basic Multilingual Plane, i.e., 0x0 - 0xFFFF. The problem is that the BMP / supplementary distinction makes sense in a UCS-2 / UTF-16 universe. But for much interchange these days, UTF-8 is the way to go. I wish that "non-ASCII characters" and "non-ASCII code points" (and non-ASCII scalar values) were sufficient for me. Maybe they can be. However, in contexts where ASCII is getting extended or supplemented (e.g., in the DNS or in e-mail), one needs to be really clear that the octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not something else. The expressions "beyond [...] ASCII" or "beyond the ASCII range" (as in, characters beyond ASCII, code points beyond ASCII) have some support in the Unicode Standard; see, e.g., Section 2.5 "ASCII Transparency" paragraph. Additionally as Peter stated, an expression including "Basic Latin block" (e.g., characters beyond the Basic Latin block) could work. FWIW, the term "non-ASCII" is used in e-mail address internationalization ("EAI") in the IETF; its opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC 2047 from November 1996 but there it has the more expansive meaning (i.e., not limited or targeted to Unicode). Sean From daniel.buenzli at erratique.ch Mon Sep 21 06:55:04 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Mon, 21 Sep 2015 12:55:04 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FFBE36.5030104@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> Message-ID: <7FCFDF52D20A4BABA6389154520D5A37@erratique.ch> Le lundi, 21 septembre 2015 ? 09:22, Sean Leonard a ?crit : > I think we can limit our inquiry to "characters" and "code points". Both > of those are well-defined in Unicode (see > ). I wouldn't say so. If you actually have a look at the definition for character on this page. There are at least 4 different definitions for the notion of character and if you take the one that has formal one attached, i.e. synonym for abstract character (D7), then an abstract character can actually be represented by a *sequence* of Unicode scalar values. If you are operating in the context of a standard or technical documentation please do use either code points (D9, D10) or scalar values (D76). These notions have precise definitions which makes up for saner discussions and understandings. > I wish that "non-ASCII characters" and "non-ASCII code points" (and > non-ASCII scalar values) were sufficient for me. Maybe they can be. > However, in contexts where ASCII is getting extended or supplemented > (e.g., in the DNS or in e-mail), one needs to be really clear that the > octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not > something else. So it seems that you want terminology to talk about the *encoding* of Unicode scalar values, rather than scalar values themselves. Then I think you should specifically avoid terminology like "octets of 0x80-0xFF are Unicode" since this doesn't really make sense, there no Unicode property on octets. You should rather say something like "these octets may belong to the UTF-8 encoding scheme (D95) of Unicode scalar values greater than U+001F". Best, Daniel From doug at ewellic.org Mon Sep 21 10:42:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 21 Sep 2015 08:42:38 -0700 Subject: Concise term for non-ASCII Unicode characters Message-ID: <20150921084238.665a7a7059d7ee80bb4d670165c8327d.3d8fca1ad4.wbe@email03.secureserver.net> Sean Leonard wrote: > Additionally as Peter stated, an expression including "Basic Latin > block" (e.g., characters beyond the Basic Latin block) could work. I was thinking that something like "non?Basic-Latin Unicode" might be useful. It avoids the confusion of referring to ASCII as a range of code points instead of a separate encoding standard. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Mon Sep 21 13:18:29 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Sep 2015 19:18:29 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> Message-ID: <20150921191829.16a502d3@JRWUBU2> On Mon, 21 Sep 2015 12:46:48 +0100 "Tony Jollans" wrote: > These days, it is pretty sloppy coding that cares how many bytes an > encoding of something requires, although there may be many > circumstances where legacy support is required. Wow! Are you saying that code chopping up arbitrary character sequences for legibility (and editability!) and to avoid buffering issues should generally assume it will be read as UTF-8, and avoid splitting well-formed UTF-8 characters? (If the text is actually Windows-1252, there may be a lot of apparently ill-formed UTF-8 characters/gibberish.) > You say that, in some > contexts, one needs to be really clear that the octets 0x80 - 0xFF > are Unicode. Either something "is" Unicode, or it isn't. Either > something uses a recognised encoding, or it doesn't. Using these > octets to represent Unicode code points is not ASCII, is not UTF-8, > and is not UCS-2/UTF-16; it could, perhaps, be EBCDIC. But most of these octets *are* used to represent non-ASCII scalar values. It's just that they have to operate in combinations for UTF-8. Richard. From Tony at Jollans.com Mon Sep 21 14:54:23 2015 From: Tony at Jollans.com (Tony Jollans) Date: Mon, 21 Sep 2015 20:54:23 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150921191829.16a502d3@JRWUBU2> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> <20150921191829.16a502d3@JRWUBU2> Message-ID: <000801d0f4a7$5643e3f0$02cbabd0$@Jollans.com> Goodness, sorry, no, I didn't mean that at all!!! What I meant was that a recognised encoding should be used consistently, regardless of the number of bytes required, and all encodings of Unicode code points are necessarily potentially multi-byte. Single-byte encodings may save a little bit of space, and may be Windows-1252, or Windows-1253, or one of many other encodings but not, in any sense, Unicode encodings. Windows code pages and their ilk predate Unicode, and I would only ever expect to see them used in environments where legacy support is needed, and would not expect a significant amount of new documentation about them to be written. When it is necessary to describe them, one should do so fully and properly, which is whatever it is, but they really have no meaning in a Unicode context. Nor, as far as I'm aware, do the 0x80 to 0xFF octets have any special meaning in Unicode that would require there to be a recognisable term to describe them. Code that processes arbitrary *character* sequences (for legibility or any other reason) should, surely, work with characters, which may be sequences of code points, each of which may be a sequence of bytes. I can think of no reason for chopping up byte sequences except where they are going to be recombined later, by the reverse treatment, and code, if required, that does so probably has no idea of, and need not have any idea of, meaning, and can only, surely, work with bytes. The actual octets are, of course, used in combinations, but not singly in any way that requires them to be described in Unicode terms. Or am I missing something fundamental? Best, Tony -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: 21 September 2015 19:18 To: unicode at unicode.org Subject: Re: Concise term for non-ASCII Unicode characters On Mon, 21 Sep 2015 12:46:48 +0100 "Tony Jollans" wrote: > These days, it is pretty sloppy coding that cares how many bytes an > encoding of something requires, although there may be many > circumstances where legacy support is required. Wow! Are you saying that code chopping up arbitrary character sequences for legibility (and editability!) and to avoid buffering issues should generally assume it will be read as UTF-8, and avoid splitting well-formed UTF-8 characters? (If the text is actually Windows-1252, there may be a lot of apparently ill-formed UTF-8 characters/gibberish.) > You say that, in some > contexts, one needs to be really clear that the octets 0x80 - 0xFF are > Unicode. Either something "is" Unicode, or it isn't. Either something > uses a recognised encoding, or it doesn't. Using these octets to > represent Unicode code points is not ASCII, is not UTF-8, and is not > UCS-2/UTF-16; it could, perhaps, be EBCDIC. But most of these octets *are* used to represent non-ASCII scalar values. It's just that they have to operate in combinations for UTF-8. Richard. From lists+unicode at seantek.com Mon Sep 21 15:51:42 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Mon, 21 Sep 2015 13:51:42 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FEC721.7040008@seantek.com> References: <55FEC721.7040008@seantek.com> Message-ID: <56006DDE.2020408@seantek.com> Related question as I am researching this: How can I acquire (cheaply or free) the latest and most official copy of US-ASCII, namely, the version that Unicode references? The Unicode Standard 8.0 refers to the following document: ANSI X3.4: American National Standards Institute. Coded character set?7-bit American national standard code for information interchange. New York: 1986. (ANSI X3.4-1986). (See page 294.) A quick Google search did not yield results. There are public/university library hard copies but they are hundreds of miles away from my location. Sean From verdy_p at wanadoo.fr Mon Sep 21 16:34:19 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Sep 2015 23:34:19 +0200 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <000801d0f4a7$5643e3f0$02cbabd0$@Jollans.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> <20150921191829.16a502d3@JRWUBU2> <000801d0f4a7$5643e3f0$02cbabd0$@Jollans.com> Message-ID: 2015-09-21 21:54 GMT+02:00 Tony Jollans : > The actual octets are, of course, used in combinations, but not singly in > any way that requires them to be described in Unicode terms. Or am I > missing > something fundamental? > The term you are looking for are described in the standard describing the standard Unicode encoding forms and schemes. If you're speaking at the octet level, the proper term is "8-bit code unit" and then look for the definition of "code units", not "code points" and not "scalar values" or "characters" as well. "Character" has another definition in programming languages, but Unicode is not bound normatively to any programming language and their actual storage size or transport size is not part of the standard, you'll need to look into the technical documenttion of each programming language or transport protocol or storage device: this is out of scope of the standard itself, each environment describing their own API, library or adapter to interface or convert data correctly with Unicode elements and texts, sometimes with several competing interfaces or converters: on this list we are only focused on standard interchange formats, but the problem is solved since long, notably with Internet standards and RFCs such as MIME which has also its own definition of "characters", because these standards are not exclusively bound to Unicode but also support other legacy standards. But even in this case these definitions are only at an upper layer only and the lower layer may use other conversions, including data compression technics, escaping modes, or could even workl with units smaller than octets or even smaller than binary bits, or could multiplex some bits with some complex state representation for example in modems working with bits spread over a matrix of non-binary states with redundancy and autocorrection. Even the order of bits is not defined in the Unicode standard or in the internal lower layers of an interface (these are not the layers concerned for interchange in a large network, they are specific to each physical or virtual link between specific pairs of hosts, buses/cables, hubs, switches, or routers and at this level they do not even have to know if the data is actually containing text or which upper layer encoding forms are used or implied. So let's get back to your focus: you're wondering if there's a term for octets with the high bit set, in the context of texts processed with some standard Unicode algorithms. - We have a term for 16-bit code units used in combinations to encode a single code point : these are "surrogates". - For 8-bit code units, there are at least 3 encodings described : UTF-8, CESU-8 and SCSU. Each one has its own subranges of octets values processed differently. The best way to name these ranges is to look into the standard documentation of these encoding schemes. And these definitions are independant of those used in other encoding schemes/forms (including those defined by TUS), they do not operate at the same level and these independant levels shuold (must?) be blackboxed (their scope is stronly defined, and transparent in all other layers of processing, and all ayers are replaceable by another competing encoding. Note that initially, even TUS did not define any encoding scheme below the level of code points and their scalar values. There was then no concept of "code units", that were stadnardized only because a few encoding schemes (UTFs) were integrated in a stadnard annexe, then directly in TUS itself as they became ubiquitous for handling Unicode texts, and outweighted all other (older) legacy standards (including Internet standards which still survive with their mandarory or optional support of legacy standards: UTF-8 proved to be the easiest encoding working with a basic level of compatibility with these older standards). -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Sep 21 16:47:50 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 21 Sep 2015 23:47:50 +0200 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <56006DDE.2020408@seantek.com> References: <55FEC721.7040008@seantek.com> <56006DDE.2020408@seantek.com> Message-ID: You actually don't need any copy to work with it U+0000 to U+007F are directly bound to US-ASCII. Unicode describe these characters with character properties (and representative glyphs only for the range U+0020..U+007E; the "C0" controls, in U+0000 to U+001F and U+007F, have a pseudo-glyph in charts which may only be usable if you work with them in "visible controls" mode.) If you need ANSI X3.4, it's only about the intended usage of controls, but only a few are prevalent in plain text: TAB, LF, CR, or CR+LF, and FF (NUL and DEL are used as fillers depending on environments or may be used as special escapes or terminators for terminal protocols). Most controls in US-ASCII have their name and most common functions relarted to console/keyboard/printers protocols and not intended to be used in text contents. But there are so many competing protocols that even the ANSI X3.4 descriptions are just informative and deprecated: you'll need to look into each protocol. Unicode (and MIME in Internet protocols) attempt to create an equivalence for line termination only (with LF, CR, or CR+LF; Unicode also added NL for the C1 controls, only for compatibility as well with EBCDIC data converters). 2015-09-21 22:51 GMT+02:00 Sean Leonard : > Related question as I am researching this: > > How can I acquire (cheaply or free) the latest and most official copy of > US-ASCII, namely, the version that Unicode references? > > The Unicode Standard 8.0 refers to the following document: > > ANSI X3.4: American National Standards Institute. Coded character > set?7-bit American > national standard code for information interchange. New York: 1986. (ANSI > X3.4-1986). > > (See page 294.) > > A quick Google search did not yield results. There are public/university > library hard copies but they are hundreds of miles away from my location. > > Sean > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Mon Sep 21 17:04:16 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 21 Sep 2015 23:04:16 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <000801d0f4a7$5643e3f0$02cbabd0$@Jollans.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <003f01d0f463$32ed5a60$98c80f20$@Jollans.com> <20150921191829.16a502d3@JRWUBU2> <000801d0f4a7$5643e3f0$02cbabd0$@Jollans.com> Message-ID: <20150921230416.579a53c3@JRWUBU2> On Mon, 21 Sep 2015 20:54:23 +0100 "Tony Jollans" wrote: > Windows code pages and their ilk predate Unicode, and I would only > ever expect to see them used in environments where legacy support is > needed, and would not expect a significant amount of new > documentation about them to be written. So at what version did Windows ditch 'ANSI code pages' as the default for users' 'plain text'? > Nor, as > far as I'm aware, do the 0x80 to 0xFF octets have any special meaning > in Unicode that would require there to be a recognisable term to > describe them. Such 8-bit *code units* are an unambiguous indicators in that one code unit = one code point no longer applies. The 16-bit analogue to ASCII v. nono-ASCII in scalar values, namely the BMP v. supplementary planes, has a fair amount of terminology. Indeed, there is a special terminology for the 16-bit analogue of octets with high bit set, the surrogate 'code points'. The analogy breaks down because of the existence of the Latin-1 Supplement block - the number 0xC2 serves a double r?le as U+00C2 LATIN CAPITAL LETTER A WITH CIRCUMFLEX and as a UTF-8 lead byte. > Code that processes arbitrary *character* sequences (for legibility > or any other reason) should, surely, work with characters, which may > be sequences of code points, each of which may be a sequence of > bytes. I can think of no reason for chopping up byte sequences except > where they are going to be recombined later, by the reverse > treatment, and code, if required, that does so probably has no idea > of, and need not have any idea of, meaning, and can only, surely, > work with bytes. In the case I have in mind, the catch is that the chopped up sequences are being stored in an intentionally human readable intermediate file. The reason for the file being readable is to allow debugging, and in extreme cases, correction. Now, the application is fairly old, and was created when lines longer than 132 characters caused problems. However, lines many thousands of characters long can still cause problems, and are not amenable to line-by-line differencing. In principle, one might rewrite the presentation part of the package to be aware of Unicode characters (or even grapheme clusters), and that would cause havoc if the text chopped up contained multibyte characters and the reading program assumed that each chunk contained no unbroken characters. > The actual octets are, of course, used in combinations, but not > singly in any way that requires them to be described in Unicode > terms. Or am I missing something fundamental? I believe the relevant distinction is simple that such octets are associated with Unicode characters. They do not occur in ASCII text. Richard. From petercon at microsoft.com Mon Sep 21 18:50:05 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 21 Sep 2015 23:50:05 +0000 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <56006DDE.2020408@seantek.com> References: <55FEC721.7040008@seantek.com> <56006DDE.2020408@seantek.com> Message-ID: Check here: http://webstore.ansi.org/RecordDetail.aspx?sku=INCITS+4-1986%5bR2012%5d -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: Monday, September 21, 2015 1:52 PM To: unicode at unicode.org Subject: Re: Concise term for non-ASCII Unicode characters Related question as I am researching this: How can I acquire (cheaply or free) the latest and most official copy of US-ASCII, namely, the version that Unicode references? The Unicode Standard 8.0 refers to the following document: ANSI X3.4: American National Standards Institute. Coded character set?7-bit American national standard code for information interchange. New York: 1986. (ANSI X3.4-1986). (See page 294.) A quick Google search did not yield results. There are public/university library hard copies but they are hundreds of miles away from my location. Sean From duerst at it.aoyama.ac.jp Mon Sep 21 18:59:51 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 22 Sep 2015 08:59:51 +0900 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150921084238.665a7a7059d7ee80bb4d670165c8327d.3d8fca1ad4.wbe@email03.secureserver.net> References: <20150921084238.665a7a7059d7ee80bb4d670165c8327d.3d8fca1ad4.wbe@email03.secureserver.net> Message-ID: <560099F7.8030509@it.aoyama.ac.jp> Hello Doug, On 2015/09/22 00:42, Doug Ewell wrote: > I was thinking that something like "non?Basic-Latin Unicode" might be Is that non-Basic Latin or not Basic-Latin? > useful. It avoids the confusion of referring to ASCII as a range of code > points instead of a separate encoding standard. But as a three-component term with unclear structure, it's confusing by itself. Regards, Martin. From petercon at microsoft.com Mon Sep 21 19:17:28 2015 From: petercon at microsoft.com (Peter Constable) Date: Tue, 22 Sep 2015 00:17:28 +0000 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FFBE36.5030104@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> Message-ID: From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard Sent: Monday, September 21, 2015 1:22 AM > Well what I am getting at is that when writing standards documents in various SDOs (or any other > computer science text, for that matter), it is helpful to identify these characters/code points. [snip] > However, in contexts where ASCII is getting extended or supplemented (e.g., in the DNS or in e-mail), > one needs to be really > clear that the octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), > and not something else. Well, if you are writing standards that "extend ASCII", then you need to be completely clear that what is being discussed is _not ASCII_. In that sense, I agree with Tony Jollans comments: be clear about what it is that is being discussed ? including what coded character set, or what encoding form for what coded character set. > FWIW, the term "non-ASCII" is used in e-mail address internationalization ("EAI") in the IETF; its > opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC > 2047 from November 1996 but there it has the more expansive meaning (i.e., not limited or > targeted to Unicode). Glancing at the Introduction for RFC 6530, it seems to have clear terminology: " Without the extensions specified in this document, the mailbox name is restricted to a subset of 7-bit ASCII [RFC5321]. Though MIME [RFC2045] enables the transport of non-ASCII data..." Here, "ASCII" means ASCII ? the 7-bit encoding originally defined as ANSI X3.4. And "non-ASCII data" appears to mean data involving any characters other than those in the ASCII coded character set, or any data represented in any other encoded representation but ASCII. The term "all-ASCII" is used in section 4.2, but it is immediately defined: "In this document, an address is "all-ASCII", or just an "ASCII address", if every character in the address is in the ASCII character repertoire [ASCII]; an address is "non-ASCII", or an "i18n-address", if any character is not in the ASCII character repertoire." So, it seems like they had a similar terminology need to what you describe, and the handled it in a satisfactory, clear way. If what you need to describe is UTF-8 sequences of two or more bytes, then I would be clear that the context is Unicode UTF-8, not ASCII or any other coded character set / encoding form; and I would say, "Unicode UTF-8 code unit sequences of two to four bytes" or "Unicode UTF-8 multi-byte sequences" or something along those lines. If you think it's a serious problem that there isn't one conventional term for "characters outside the ASCII repertoire" or "UTF-8 multi-code-unit encoded representations" (since different authors could devise different terminology solutions), then I suggest you submit a document to UTC explaining why it's a problem, documenting inconsistent or unclear terminology that's been used in some standards / public specifications, and requesting that Unicode formally define terminology for these concepts. I can't guarantee that UTC will do it, but I can predict with confidence that it _won't_ do anything of that nature if nobody submits such a document. Peter From jsbien at mimuw.edu.pl Mon Sep 21 23:24:10 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Tue, 22 Sep 2015 06:24:10 +0200 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <56006DDE.2020408@seantek.com> References: <55FEC721.7040008@seantek.com> <56006DDE.2020408@seantek.com> Message-ID: <20150922062410.13904spfspfjg44q@mail.mimuw.edu.pl> Quote/Cytat - Sean Leonard (Mon 21 Sep 2015 10:51:42 PM CEST): > Related question as I am researching this: > > How can I acquire (cheaply or free) the latest and most official > copy of US-ASCII, namely, the version that Unicode references? [...] I've never seen the ASCII standard, but I think is it (almost?) identical to ISO/IEC 646, which in turn is identical to the freely available ECMA-6: http://www.ecma-international.org/publications/standards/Ecma-006.htm Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From richard.wordingham at ntlworld.com Tue Sep 22 02:43:36 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 22 Sep 2015 08:43:36 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> Message-ID: <20150922084336.6afc9edc@JRWUBU2> On Sun, 20 Sep 2015 16:52:29 +0000 Peter Constable wrote: > You already have been using "non-ASCII Unicode", which is about as > concise and sufficiently accurate as you'll get. There's no term > specifically defined in any standard or conventionally used for this. As to standards, UTS#18 'Unicode Regular Expression' Requirement RL1.2 requires the support of the 'property' it calls 'ASCII', which is defined in Section 1.2.1 as the property of being in the range U+0000 to U+007F. This implicitly makes 'not ASCII' a derived property held by all the other codepoints. If you fear that your audience will think that Latin-1 characters are ASCII, you'll just have to go for the clumsy 'not 7-bit ASCII' and accept that there isn't an unambiguous way in English of turning that into an adjective or noun. If a term were invented, you'd generally have to explain it, and you would do better just to remind readers what ASCII is. Richard. From verdy_p at wanadoo.fr Tue Sep 22 03:45:28 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 22 Sep 2015 10:45:28 +0200 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150922084336.6afc9edc@JRWUBU2> References: <55FEC721.7040008@seantek.com> <20150922084336.6afc9edc@JRWUBU2> Message-ID: I would not use the "clumsy 7-bit ASCII" due to the confusion created since long when it could refer to any national version of ISO 646, which reassign some code positions in the rande 0x00 to 0x07F to other characters outside the range U+0000 to U+007F, while still remaining 7-bit encodings. So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make sure it refers to the encoding of 7-bit code positions effectively to U+0000..U+007F. So for code positions outside 0x00..0x7F, I would call them "not US-ASCII" (none of them are bound to any Unicode "character" or "code point" or "scalar value", they are just "code positions" or more precisely "octet values with their most significant bit set to 1" which is really long: "not US-ASCII" is fine as a shorter term). 2015-09-22 9:43 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Sun, 20 Sep 2015 16:52:29 +0000 > Peter Constable wrote: > > > You already have been using "non-ASCII Unicode", which is about as > > concise and sufficiently accurate as you'll get. There's no term > > specifically defined in any standard or conventionally used for this. > > As to standards, UTS#18 'Unicode Regular Expression' Requirement > RL1.2 requires the support of the 'property' it calls 'ASCII', which is > defined in Section 1.2.1 as the property of being in the range U+0000 to > U+007F. This implicitly makes 'not ASCII' a derived property held by all > the other codepoints. If you fear that your audience will think that > Latin-1 characters are ASCII, you'll just have to go for the clumsy > 'not 7-bit ASCII' and accept that there isn't an unambiguous way in > English of turning that into an adjective or noun. > > If a term were invented, you'd generally have to explain it, and you > would do better just to remind readers what ASCII is. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Tue Sep 22 04:27:36 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 22 Sep 2015 02:27:36 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150922062410.13904spfspfjg44q@mail.mimuw.edu.pl> References: <55FEC721.7040008@seantek.com> <56006DDE.2020408@seantek.com> <20150922062410.13904spfspfjg44q@mail.mimuw.edu.pl> Message-ID: <56011F08.3070805@seantek.com> On 9/21/2015 9:24 PM, Janusz S. Bien wrote: > Quote/Cytat - Sean Leonard (Mon 21 Sep > 2015 10:51:42 PM CEST): > >> Related question as I am researching this: >> >> How can I acquire (cheaply or free) the latest and most official copy >> of US-ASCII, namely, the version that Unicode references? > > [...] Thanks to all. I was able to locate a copy of ANSI X3.4-1986 (R1997) [hereinafter ASCII]. (See my subsequent e-mail about the term "ASCII".) > > I've never seen the ASCII standard, but I think is it (almost?) > identical to ISO/IEC 646, which in turn is identical to the freely > available ECMA-6: > > http://www.ecma-international.org/publications/standards/Ecma-006.htm Having just read both standards documents in some detail, I can attest that they are not the same. However, the practical effect for purposes of Unicode is the same. ECMA-6 (1991) is indeed identical to ISO/IEC 646 (as far as I can tell; hereinafter ECMA-6). ECMA-6 "specifies a 7-bit coded character set with a number of options" (Clause 1.2). Specifically, the following positions are ambiguous or subject to national assignment: 2/3 NUMBER SIGN or POUND SIGN 2/4 DOLLAR SIGN or CURRENCY SIGN 4/0 5/11 5/12 5/13 5/14 6/0 7/11 7/12 7/13 7/14 ECMA-6 specifies an International Reference Version (IRV), which exercises the "options". The IRV fills in the graphic characters consistent with ASCII. However, ECMA-6 sort of leaves the C0 region blank...and the IRV (in Annex A, normative) says "if the C0 set [...] is used, it shall be the C0 set of Standard ECMA-48." Sort of fudging. Anyway, the IRV C0 set / ECMA-48 set is the same as ASCII. Overall, the takeaway is that specifying ISO/IEC 646 / ECMA-6 is not sufficient; you need to include "IRV" as well, or ISO IR No. 6 for the G0 set and ISO IR No. 6 for the C0 set. In contrast, if you say ASCII (ANSI X3.4-1986), all positions are fully defined. Regards, Sean From lists+unicode at seantek.com Tue Sep 22 04:42:13 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 22 Sep 2015 02:42:13 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <56011F08.3070805@seantek.com> References: <55FEC721.7040008@seantek.com> <56006DDE.2020408@seantek.com> <20150922062410.13904spfspfjg44q@mail.mimuw.edu.pl> <56011F08.3070805@seantek.com> Message-ID: <56012275.1040900@seantek.com> On 9/22/2015 2:27 AM, Sean Leonard wrote: > Overall, the takeaway is that specifying ISO/IEC 646 / ECMA-6 is not > sufficient; you need to include "IRV" as well, or ISO IR No. 6 for the > G0 set and ISO IR No. 6 for the C0 set. ...which the Unicode Standard does specify, by stating "IRV" explicitly (Section 2.8, Section 7.1). Hence, there is no Unicode problem. [Correction: it's IR No. 1 for the C0 set.] > > In contrast, if you say ASCII (ANSI X3.4-1986), all positions are > fully defined. > > Regards, > > Sean From lists+unicode at seantek.com Tue Sep 22 05:18:46 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 22 Sep 2015 03:18:46 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> <20150922084336.6afc9edc@JRWUBU2> Message-ID: <56012B06.7000608@seantek.com> On 9/22/2015 1:45 AM, Philippe Verdy wrote: > I would not use the "clumsy 7-bit ASCII" due to the confusion created > since long when it could refer to any national version of ISO 646, > which reassign some code positions in the rande 0x00 to 0x07F to other > characters outside the range U+0000 to U+007F, while still remaining > 7-bit encodings. > So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make > sure it refers to the encoding of 7-bit code positions effectively to > U+0000..U+007F. > > So for code positions outside 0x00..0x7F, I would call them "not > US-ASCII" (none of them are bound to any Unicode "character" or "code > point" or "scalar value", they are just "code positions" or more > precisely "octet values with their most significant bit set to 1" > which is really long: "not US-ASCII" is fine as a shorter term). Again having just read through ANSI X3.4-1986 (R1997), I would like to clarify some things. The standard itself is titled: American National Standard for Information Systems - Coded Character Sets - 7-Bit American National Standard Code for Information Interchange (7-Bit ASCII) However, Clause 1.1 states: This standard specifies a set of 128 characters (control characters and graphic characters, such as letters, digits, and symbols) with their coded representation. The American National Standard Code for Information Interchange may also be identified by the acronym ASCII (pronounced ask-ee). To explicitly designate a particular (perhaps prior) edition of this standard, the last two digits of the year of issue may be appended, as in "ASCII 68" or "ASCII 86". According to the title, "7-Bit ASCII" is proper. However, according to the text, "ASCII" is sufficient. The "7-Bit" part really just emphasizes the fact that it is a 7-bit standard. The eighth bit is outside the scope of the standard (but see clause 2.1.1). (Incidentally, Clause 1.1 is not Y2K compliant! Thus you should '86 that part of ASCII 86...hehe) The term "US-ASCII" (see also RFC 2046 for a lot of discussion) is similarly redundant. After all, it is the *American* *National* Standard Code for Information Interchange. Even if you remove the term "National" (which does not appear in ASCII 68 or ASCII 63), it's still American. However, ASCII 68 (partially reprinted in RFC 20: ) actually permits "the notation ASCII (pronounced as'-key) or USASCII (pronounced you-sas'-key) [...] to mean the code prescribed by the latest issue of the standard". That is probably the genesis of US-ASCII. I wasn't alive at the time so I don't know. My suspicion is that "US-ASCII" was meant to disambiguate ASCII 86 from ASCII 68 (which is referred to as "ASCII" in RFC 821) without referring to the year, and since 68 and 86 are transposed numerals, "US-ASCII" eliminates possible mix-ups. My conclusion here is that "ASCII" is sufficient when talking about the range of (code or character) positions 0 - 127, regardless of how they are encoded, so long as they logically evaluate to the bit combinations of the 7-bit code described in ANSI X3.4-1986. "Basic Latin" also works if you want to avoid the historic reference. But there are many systems in use that are ASCII-based (including the Internet, as RFC 20 is still in force), and the term "ASCII" is peppered throughout the Unicode Standard 8.0 with greater frequency than "Basic Latin" (which is acknowledged to be a synonym for "ASCII" in Sections 5.7 and 6.2). Sean From petercon at microsoft.com Tue Sep 22 05:56:06 2015 From: petercon at microsoft.com (Peter Constable) Date: Tue, 22 Sep 2015 10:56:06 +0000 Subject: Concise term for non-ASCII Unicode characters Message-ID: > If a term were invented, you'd generally have to explain it, and you would do better just to remind readers what ASCII is. +1 Peter Sent from Outlook Mail for Windows 10 From: Richard Wordingham Sent: Tuesday, September 22, 2015 12:51 AM To: unicode at unicode.org Subject: Re: Concise term for non-ASCII Unicode characters On Sun, 20 Sep 2015 16:52:29 +0000 Peter Constable wrote: > You already have been using "non-ASCII Unicode", which is about as > concise and sufficiently accurate as you'll get. There's no term > specifically defined in any standard or conventionally used for this. As to standards, UTS#18 'Unicode Regular Expression' Requirement RL1.2 requires the support of the 'property' it calls 'ASCII', which is defined in Section 1.2.1 as the property of being in the range U+0000 to U+007F. This implicitly makes 'not ASCII' a derived property held by all the other codepoints. If you fear that your audience will think that Latin-1 characters are ASCII, you'll just have to go for the clumsy 'not 7-bit ASCII' and accept that there isn't an unambiguous way in English of turning that into an adjective or noun. If a term were invented, you'd generally have to explain it, and you would do better just to remind readers what ASCII is. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Sep 22 10:34:14 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 22 Sep 2015 08:34:14 -0700 Subject: Concise term for non-ASCII Unicode characters Message-ID: <20150922083414.665a7a7059d7ee80bb4d670165c8327d.7f706876b8.wbe@email03.secureserver.net> Martin J. D?rst wrote: >> I was thinking that something like "non?Basic-Latin Unicode" might be > > Is that non-Basic Latin or not Basic-Latin? > >> useful. It avoids the confusion of referring to ASCII as a range of >> code points instead of a separate encoding standard. > > But as a three-component term with unclear structure, it's confusing > by itself. That's why I wrote "non Basic Latin." But I realize that not all fonts will show this clearly, and that the distinction is lost in speech anyway. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Sep 22 15:03:44 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 22 Sep 2015 21:03:44 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150922083414.665a7a7059d7ee80bb4d670165c8327d.7f706876b8.wbe@email03.secureserver.net> References: <20150922083414.665a7a7059d7ee80bb4d670165c8327d.7f706876b8.wbe@email03.secureserver.net> Message-ID: <20150922210344.5f609c45@JRWUBU2> On Tue, 22 Sep 2015 08:34:14 -0700 "Doug Ewell" wrote: > That's why I wrote "non Basic Latin." > > But I realize that not all fonts will show this clearly, and that the > distinction is lost in speech anyway. I think the difference is actually clearer in speech. Richard. From wjgo_10009 at btinternet.com Sat Sep 26 03:15:58 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 26 Sep 2015 09:15:58 +0100 (BST) Subject: Tirhuta Message-ID: <22679859.6046.1443255358722.JavaMail.defaultUser@defaultHost> A thread has been started in the High-Logic forum asking how to create a Tirhuta (Maithili) Keyboard layout. http://forum.high-logic.com/viewtopic.php?f=16&t=5784 Can anyone on this list help please? The High-Logic forum is free to join. Responses could be to that thread or to this mailing list as thought appropriate. If responses are directly to this mailing list thread then I will try to link to them from the High-Logic forum. William Overington 26 September 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Sep 28 10:12:06 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 28 Sep 2015 17:12:06 +0200 (CEST) Subject: Tirhuta In-Reply-To: <22679859.6046.1443255358722.JavaMail.defaultUser@defaultHost> References: <22679859.6046.1443255358722.JavaMail.defaultUser@defaultHost> Message-ID: <1017067683.11830.1443453126157.JavaMail.www@wwinf1h27> On 26 Sep 2015 at 17:56, William_J_G Overington wrote: > A thread has been started in the High-Logic forum asking how to create a Tirhuta (Maithili) Keyboard layout. > http://forum.high-logic.com/viewtopic.php?f=16&t=5784 > Can anyone on this list help please? > The High-Logic forum is free to join. > Responses could be to that thread or to this mailing list as thought appropriate. > If responses are directly to this mailing list thread then I will try to link to them from the High-Logic forum. Hi William, I've just found your e-mail, and I'll hurry up to join the Community on the High-Logic Forum, to see if some piece of my experience might be useful. Other subsribers of the Unicode Mailing List are welcome to join or follow up this thread out there. Cross-reports are welcome on this List if the matter stays in the scope of Unicode (which I would suppose, hence the numerous keyboard posts I did here). Personally however I'll keep away from multiplying my messages, which were not all very happy. I'd prepared one just today at noon, that ends up remaining a draft. Thank you for having brought down the news! All the best, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Mon Sep 28 16:34:31 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Mon, 28 Sep 2015 14:34:31 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <55FEC721.7040008@seantek.com> References: <55FEC721.7040008@seantek.com> Message-ID: <5609B267.5090906@seantek.com> To follow up on this thread: It appears that ASCII is in fact a defined term in the Unicode glossary, and this term is sufficiently broad. http://unicode.org/glossary/#ASCII ASCII is sufficient to identify the range 0 - 127, whether that is simply a "range", "characters", "code points", or "scalar values". (Since they are all the same in that range 0 - 127.) This leaves open the question of how to define the range that is not 0 - 127, but is 128 -> onwards. An e-mail will follow on the topic... Sean *** ASCII. (1)The American Standard Code for Information Interchange, a 7-bit coded character set for information interchange. It is the U.S. national variant of ISO/IEC 646 and is formally the U.S. standard ANSI X3.4. It was proposed by ANSI in 1963 and finalized in 1968. (2) The set of 128 Unicode characters from U+0000 to U+007F, including control codes as well as graphic characters. (3) ASCII has been incorrectly used to refer to various 8-bit character encodings that include ASCII characters in the first 128 code points. From dzo at bisharat.net Tue Sep 29 08:23:40 2015 From: dzo at bisharat.net (Don Osborn) Date: Tue, 29 Sep 2015 09:23:40 -0400 Subject: ASCIIfied vs Boko Hausa on international radio websites Message-ID: <560A90DC.8060403@bisharat.net> In the past, there has been mention on this list of use of extended Latin on the web and using charset=utf-8 in page parameters - neither of which are a novelty anymore. The issue now in some cases is policies of organizations and decisions by web content managers whether and how to use extended Latin and utf-8. For an update with background on these issues in the case of websites of the Hausa services of BBC, CRI, RDW, RFI, and VOA, please see: http://niamey.blogspot.com/2015/09/hausa-on-international-radio-websites.html FWIW I've also floated the Twitter hashtag #???? - along the lines of the #acent?ate campaign that you may have heard about. FYI, two related postings/threads from 2009 (though I haven't duplicated these quick surveys of other BBC & VOA language pages beyond those for Hausa): BBC.co.uk languages - mostly not UTF-8 http://www.unicode.org/mail-arch/unicode-ml/y2009-m04/0066.html VOA- utf-8, lang="en" http://www.unicode.org/mail-arch/unicode-ml/y2009-m04/0103.html Don Osborn From lists+unicode at seantek.com Tue Sep 29 11:20:50 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 29 Sep 2015 09:20:50 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> Message-ID: <560ABA62.3030004@seantek.com> On 9/21/2015 5:17 PM, Peter Constable wrote: > If you think it's a serious problem that there isn't one conventional > term for "characters outside the ASCII repertoire" or "UTF-8 > multi-code-unit encoded representations" (since different authors > could devise different terminology solutions), then I suggest you > submit a document to UTC explaining why it's a problem, documenting > inconsistent or unclear terminology that's been used in some standards > / public specifications, and requesting that Unicode formally define > terminology for these concepts. I can't guarantee that UTC will do it, > but I can predict with confidence that it _won't_ do anything of that > nature if nobody submits such a document. Peter I am of the mind to do just that, then. I have seen different documents, standards, and standards bodies that have invented terminology around this term, and they are not always the same. Since these standards depend on Unicode, it would make a lot of sense for Unicode formally to define terminology for these concepts. With the proliferation of UTF-8 (among other things), the boundary between 0x7F - 0x80 is more significant than the boundary between 0xFFFF - 0x10000. Since this will be my first submission I would appreciate a co-author on this topic. Is anyone willing to help? Thanks in advance. Also, it is not clear if such a document is destined to become a Unicode Technical Report (UTR / PDUTR etc.), or if it should just be an informal write-up. I am guessing this is supposed to be somewhat informal but at the same time it (or the results of it) ought to appear in the UTC Document Search. The current terminology that I am considering pursuing is "beyond ASCII", in various permutations, such as "beyond the ASCII range", "characters beyond ASCII", "code points beyond ASCII", etc. The term "beyond" implies a certain directionality, and to that extent, implies the Unicode repertoire as well as a Unicode encoding. We have seen on this list the blackflips required to clarify "non-ASCII", since things that are not ASCII literally could be a wide range of things. I think there is some confusion about whether the term "Basic Latin" excludes the C0 control character range. Formally the standard seems clear enough to me that it is co-terminus with ASCII, but there is still confusion if you don't pore through the Standard. My thought is that maybe the Blocks.txt data should be modified to say "ASCII (Basic Latin)" instead of just "Basic Latin". (If we "go there", I would appreciate the wisdom of an experienced Unicode co-author. I am not confident touching that just by myself.) Sean From mark at macchiato.com Tue Sep 29 11:33:36 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 29 Sep 2015 18:33:36 +0200 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ABA62.3030004@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> Message-ID: I think the term "non-ASCII Unicode" is just fine, and we don't need anything beyond that. It is clearly those Unicode characters that aren't (2) in http://unicode.org/glossary/#ASCII. Mark *? Il meglio ? l?inimico del bene ?* On Tue, Sep 29, 2015 at 6:20 PM, Sean Leonard wrote: > On 9/21/2015 5:17 PM, Peter Constable wrote: > >> If you think it's a serious problem that there isn't one conventional >> term for "characters outside the ASCII repertoire" or "UTF-8 >> multi-code-unit encoded representations" (since different authors could >> devise different terminology solutions), then I suggest you submit a >> document to UTC explaining why it's a problem, documenting inconsistent or >> unclear terminology that's been used in some standards / public >> specifications, and requesting that Unicode formally define terminology for >> these concepts. I can't guarantee that UTC will do it, but I can predict >> with confidence that it _won't_ do anything of that nature if nobody >> submits such a document. Peter >> > > I am of the mind to do just that, then. I have seen different documents, > standards, and standards bodies that have invented terminology around this > term, and they are not always the same. Since these standards depend on > Unicode, it would make a lot of sense for Unicode formally to define > terminology for these concepts. With the proliferation of UTF-8 (among > other things), the boundary between 0x7F - 0x80 is more significant than > the boundary between 0xFFFF - 0x10000. > > Since this will be my first submission I would appreciate a co-author on > this topic. Is anyone willing to help? Thanks in advance. Also, it is not > clear if such a document is destined to become a Unicode Technical Report > (UTR / PDUTR etc.), or if it should just be an informal write-up. I am > guessing this is supposed to be somewhat informal but at the same time it > (or the results of it) ought to appear in the UTC Document Search. > > The current terminology that I am considering pursuing is "beyond ASCII", > in various permutations, such as "beyond the ASCII range", "characters > beyond ASCII", "code points beyond ASCII", etc. The term "beyond" implies a > certain directionality, and to that extent, implies the Unicode repertoire > as well as a Unicode encoding. We have seen on this list the blackflips > required to clarify "non-ASCII", since things that are not ASCII literally > could be a wide range of things. > > I think there is some confusion about whether the term "Basic Latin" > excludes the C0 control character range. Formally the standard seems clear > enough to me that it is co-terminus with ASCII, but there is still > confusion if you don't pore through the Standard. My thought is that maybe > the Blocks.txt data should be modified to say "ASCII (Basic Latin)" instead > of just "Basic Latin". (If we "go there", I would appreciate the wisdom of > an experienced Unicode co-author. I am not confident touching that just by > myself.) > > Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Sep 29 11:40:47 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 29 Sep 2015 17:40:47 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ABA62.3030004@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> Message-ID: <37E989CA57904F04B77DF2820E2919DB@erratique.ch> I would say there's already enough terminology in the Unicode world to add more to it. This thread already hinted at enough ways of expressing what you'd like, the simplest one being "scalar values greater than U+001F". This is the clearest you can come up with and anybody who has basic knowledge of the Unicode standard will immediately understand what you are talking about without having to lookup further definitions. Best, Daniel From lists+unicode at seantek.com Tue Sep 29 12:30:59 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 29 Sep 2015 10:30:59 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <37E989CA57904F04B77DF2820E2919DB@erratique.ch> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> Message-ID: <560ACAD3.3050606@seantek.com> On 9/29/2015 9:40 AM, Daniel B?nzli wrote: > I would say there's already enough terminology in the Unicode world to add more to it. This thread already hinted at enough ways of expressing what you'd like, the simplest one being "scalar values greater than U+001F". This is the clearest you can come up with and anybody who has basic knowledge of the Unicode standard Uh...I think you mean U+007F? :) Perhaps it's because I'm writing to the Unicode crowd, but honestly there are a lot of very intelligent software engineers/standards folks who do not have the "basic knowledge of the Unicode standard" that is being presumed. They want to focus on other parts of their systems or protocols, and when it comes to the "text part", they just hand-wave and say "Unicode!" and call it a day. In particular there is a flow-down effect where terms from one standards body don't match with another standards body, perhaps because they got redefined over time for various reasons. The distinction between "characters", "abstract characters", "code points", and "scalar values" is not intuitively obvious to people without specialized knowledge of text processing issues. The fact that (modern implementations of) UTF-8 encoders and decoders are not supposed to process the surrogate code points (arbitrarily), for example, is a rather advanced topic that presumes knowledge of the interaction between UTF-16, UTF-8, what surrogate code points actually are, and the security implications of so-doing (UTR-36). Furthermore one has to parse the distinction between "well-formed" and "ill-formed". In the twenty minutes since my last post, I got two different responses...and as you pointed out, there are a lot of ways to express what one would like. I would prefer one, uniform way (hence, "standardized way"). Just surveying the various standards that have tried to tackle this distinction with their own organic terminology will probably be revealing. Evidence-based should be the yardstick. Best regards, Sean From daniel.buenzli at erratique.ch Tue Sep 29 13:02:54 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 29 Sep 2015 19:02:54 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ACAD3.3050606@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> Message-ID: <15768711EDE545B4888F5DEF40F90E1D@erratique.ch> Le mardi, 29 septembre 2015 ? 18:30, Sean Leonard a ?crit : > Uh...I think you mean U+007F? :) Yes? see how it was easy to point out that the definition was wrong. It would also have been, if this was code and we were talking about a protocol whose specification was using this notation rather than a new Unicode concept. > Perhaps it's because I'm writing to the Unicode crowd, but honestly > there are a lot of very intelligent software engineers/standards folks > who do not have the "basic knowledge of the Unicode standard" that is > being presumed. They want to focus on other parts of their systems or > protocols, and when it comes to the "text part", they just hand-wave and > say "Unicode!" and call it a day. Introducing more terminology and jargon is not going to help in this case. Make the definitions as obvious as possible and strive for minimality in the exposed concepts. > The fact that (modern implementations of) UTF-8 encoders and decoders are not supposed to process the surrogate code points (arbitrarily), for example, is a > rather advanced topic I wouldn't say this is advanced knowledge, this is basic knowledge any programmer dealing with Unicode text should have. FWIW this [1] is the absolute minimal knowledge I think programmers should have about Unicode (the last section can be skipped it's specific to a programming language). This corresponds to maybe 3 to 4 A4 pages. If your programmers are not able to grok this small amount of knowledge, hire better ones. Best, Daniel [1] http://erratique.ch/software/uucp/doc/Uucp.html#uminimal From kenwhistler at att.net Tue Sep 29 13:50:40 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 29 Sep 2015 11:50:40 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ACAD3.3050606@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> Message-ID: <560ADD80.3030709@att.net> On 9/29/2015 10:30 AM, Sean Leonard wrote: > On 9/29/2015 9:40 AM, Daniel B?nzli wrote: >> I would say there's already enough terminology in the Unicode world >> to add more to it. This thread already hinted at enough ways of >> expressing what you'd like, the simplest one being "scalar values >> greater than U+001F". This is the clearest you can come up with and >> anybody who has basic knowledge of the Unicode standard > Uh...I think you mean U+007F? :) I agree that "scalar values greater than U+007F" doesn't just trip off the tongue, and while technically accurate, it is bad terminology -- precisely because it begs the question "wtf are 'scalar values'?!" for the average engineer. > > Perhaps it's because I'm writing to the Unicode crowd, but honestly > there are a lot of very intelligent software engineers/standards folks > who do not have the "basic knowledge of the Unicode standard" that is > being presumed. They want to focus on other parts of their systems or > protocols, and when it comes to the "text part", they just hand-wave > and say "Unicode!" and call it a day. ... Well, from this discussion, and from my experience as an engineer, I think this comes down to people in other standards, practices, and protocols dealing with the ages old problem of on beyond zebra for characters, where the comfortable assumptions that byte=character break down and people have to special case their code and documentation. Where buffers overrun, where black hat hackers rub their hands in glee, and where engineers exclaim, "Oh gawd! I can't just cast this character, because it's actually an array!" And nowadays, we are in the age of universal Unicode. All (well, much, anyway) would be cool if everybody were using UTF-32, because then at least we'd be back to 32-bit-word=character, and the programming would be easier. But UTF-32 doesn't play well with existing protocols and APIs and storage and... So instead, we are in the age of "universal Unicode and almost always UTF-8." So that leaves us with two types of characters: 1. "Good characters" These are true ASCII. U+0000..U+007F. Good because they are all single bytes in UTF-8 and because then UTF-8 strings just work like the Computer Science God always intended, and we don't have to do anything special. 2. "Bad characters" Everything else: U+0080..U+10FFFF. Bad because they require multiple bytes to represent in UTF-8 and so break all the simple assumptions about string and buffer length. They make for bugs and more bugs and why oh why do I have to keep dealing with edge cases where character boundaries don't line up with allocated buffer boundaries?!! I think we can agree that there are two types of characters -- and that those code point ranges correctly identify the sets in question. The problem then just becomes a matter of terminology (in the standards sense of "terminology") -- coming up with usable, clear terms for the two sets. To be good terminology, the terms have to be identifiable and neither too generic ("good characters" and "bad characters") or too abstruse or wordy ("scalar values less than or equal to U+007F" and "scalar values greater than U+007F"). They also need to not be confusing. For example, "single-byte UTF-8" and "multi-byte UTF-8" might work for engineers, but is a confusing distinction, because UTF-8 as an encoding form is inherently multi-byte, and such terminology would undermine the meaning of UTF-8 itself. Finally, to be good terminology, the terms needs to have some reasonable chance of catching on and actually being used. It is fairly pointless to have a "standardized way" of distinguishing the #1 and #2 types of characters if people either don't know about that standardized way or find it misleading or not helpful, and instead continue groping about with their existing ad hoc terms anyway. > > In the twenty minutes since my last post, I got two different > responses...and as you pointed out, there are a lot of ways to express > what one would like. I would prefer one, uniform way (hence, > "standardized way"). Mark's point was that it is hard to improve on what we already have: 1. ASCII Unicode [characters] (i.e. U+0000..U+007F) 2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF) If we just highlight that terminology more prominently, emphasize it in the Unicode glossary, and promote it relentlessly, it might catch on more generally, and solve the problem. More irreverently, perhaps we could come up with complete neologisms that might be catchy enough to go viral -- at least among the protocol writers and engineers who matter for this. Riffing on the small/big distinction and connecting it to "u-*nichar*" for the engineers, maybe something along the lines of: 1. skinnichar 2. baloonichar Well, maybe not those! But you get the idea. I'm sure there is a budding terminologist out there who could improve on that suggestion! At any rate, any formal contribution that suggests coming up with terminology for the #1 and #2 sets should take these considerations under advisement. And unless it suggests something that would pretty easily gain consensus as demonstrably better than the #1 and #2 terms suggested above by Mark, it might not result in any change in actual usage. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Tue Sep 29 14:27:28 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 29 Sep 2015 20:27:28 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ADD80.3030709@att.net> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> Message-ID: <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> Le mardi, 29 septembre 2015 ? 19:50, Ken Whistler a ?crit : > I agree that "scalar values greater than U+007F" doesn't just trip off the tongue, > and while technically accurate, it is bad terminology -- precisely because it > begs the question "wtf are 'scalar values'?!" for the average engineer. And an average engineer knows how to lookup definitions, that one being precise and exceptionally well defined in the Unicode glossary ? in stark contrast to the shady (and deceiving for the newbie) notion of "character" that you use subsequently in your message. This is not "bad terminology", it's *precise* terminology and what I would like to see used in protocols and standards. Many programmers I talk to are confused by Unicode because their notion of Unicode "character" is a chaotic mix of scalar values, code points and their various *encodings* (i.e. byte level considerations). Introducing more terminology to talk about that confused idea of Unicode is not going to help. Educating about the difference between scalar values, code points and their various encodings will. Best, Daniel From richard.wordingham at ntlworld.com Tue Sep 29 15:03:54 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 29 Sep 2015 21:03:54 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <37E989CA57904F04B77DF2820E2919DB@erratique.ch> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> Message-ID: <20150929210354.4cf6b154@JRWUBU2> On Tue, 29 Sep 2015 17:40:47 +0100 Daniel B?nzli wrote: > I would say there's already enough terminology in the Unicode world > to add more to it. This thread already hinted at enough ways of > expressing what you'd like, the simplest one being "scalar values > greater than U+001F". Too wordy and clearly prone to error! Richard. From daniel.buenzli at erratique.ch Tue Sep 29 15:59:49 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 29 Sep 2015 21:59:49 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <20150929210354.4cf6b154@JRWUBU2> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <20150929210354.4cf6b154@JRWUBU2> Message-ID: <49AB134C2B1D4B73B5FE9F106CF8CA35@erratique.ch> Le mardi, 29 septembre 2015 ? 21:03, Richard Wordingham a ?crit : > Too wordy and clearly prone to error! Yes and maybe that "average engineer" does not understand negation. So clearly any of non-ASCII, non-Basic Latin or greater than U+007F cannot fit. Bring in the bureaucrats, new terminology is needed, there are not enough useless definitions in the Unicode standard, let's add a few more. Daniel From richard.wordingham at ntlworld.com Tue Sep 29 16:27:02 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 29 Sep 2015 22:27:02 +0100 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> Message-ID: <20150929222702.11905051@JRWUBU2> On Tue, 29 Sep 2015 20:27:28 +0100 Daniel B?nzli wrote: > Le mardi, 29 septembre 2015 ? 19:50, Ken Whistler a ?crit : > > I agree that "scalar values greater than U+007F" doesn't just trip > > off the tongue, and while technically accurate, it is bad > > terminology -- precisely because it begs the question "wtf are > > 'scalar values'?!" for the average engineer. > > And an average engineer knows how to lookup definitions, that one > being precise and exceptionally well defined in the Unicode glossary > ? in stark contrast to the shady (and deceiving for the newbie) > notion of "character" that you use subsequently in your message. The glossary might fool a 'newbie' (the declared target audience), but its riddled enough with errors to dispel confidence. Just looking at the entries before 'ASCII': OK: 'Abstract character sequence' (if one has a usable understanding of 'abstract character'); 'accent mark', 'acrophonic', 'akshara' (though the spelling with neither an 'h' nor a dot below is weird); 'algorithm', 'alphabet' (though saying that modern Lao and pointed Hebrew use alphabets is probably not very helpful), 'alphabetic' (though it's not obvious to me why ARABIC SUKUN is alphabetic but potentially visible viramas are not), 'alphabetic sorting', 'annotation', 'apparatus criticus', 'Arabic Indic digits' (though are 'European digits' derived from the digits of the eastern part of the Arab world?) Dodgy: 'Abjad' (living abjads also mark vowels, with some vowels having characters dignified as 'letters'). Does normal Egyptian hieroglyphic writing constitute an abjad? 'Abstract character' - but then the definition makes no sense. 'Abugida' - needs 'consonants' and 'vowels' to be qualified by 'most', otherwise it won't even work for Classical Sanskrit in Devanagari. Vowel letters and visarga are the principal problems. 'ANSI' - I don't think the Windows code pages for UTF-8 and UTF-16 are 'ANSI'. 'Arabic digits' - aren't the European digits used in western Arabic as native as the eastern Arabic digits (U+0660 etc.) used in eastern Arabic? 11 more-or-less OK versus 5 dodgy does not generate a great deal of confidence in the glossary. I appreciate that the difference between abjad, abugida and alphabet is difficult to capture, as abjads and abugidas can evolve into alphabets. Richard. From lists+unicode at seantek.com Tue Sep 29 22:40:48 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 29 Sep 2015 20:40:48 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> Message-ID: <560B59C0.1040503@seantek.com> On 9/29/2015 12:27 PM, Daniel B?nzli wrote: > Le mardi, 29 septembre 2015 ? 19:50, Ken Whistler a ?crit : >> I agree that "scalar values greater than U+007F" doesn't just trip off the tongue, >> and while technically accurate, it is bad terminology -- precisely because it >> begs the question "wtf are 'scalar values'?!" for the average engineer. > And an average engineer knows how to lookup definitions, that one being precise and exceptionally well defined in the Unicode glossary ? in stark contrast to the shady (and deceiving for the newbie) notion of "character" that you use subsequently in your message. > > This is not "bad terminology", it's *precise* terminology and what I would like to see used in protocols and standards. > > Many programmers I talk to are confused by Unicode because their notion of Unicode "character" is a chaotic mix of scalar values, code points and their various *encodings* (i.e. byte level considerations). +1 I like the definition of "character" in ASCII: 3.3 Character. A member of a set of elements used for the organization, control, or representation of data. This, by the way, is the exact same definition as in ISO 646, ISO 2022, and yes, even ISO 10646 (2003). It was the best of times... Sean From asmus-inc at ix.netcom.com Tue Sep 29 23:14:58 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Tue, 29 Sep 2015 21:14:58 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560B59C0.1040503@seantek.com> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> <3BBF492EEB4647AEA0C90E0EC8FAE6C6@erratique.ch> <560B59C0.1040503@seantek.com> Message-ID: <560B61C2.7030006@ix.netcom.com> An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Wed Sep 30 00:07:35 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 29 Sep 2015 22:07:35 -0700 Subject: Concise term for non-ASCII Unicode characters In-Reply-To: <560ADD80.3030709@att.net> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> Message-ID: <560B6E17.9050504@seantek.com> On 9/29/2015 11:50 AM, Ken Whistler wrote: > > > On 9/29/2015 10:30 AM, Sean Leonard wrote: >> On 9/29/2015 9:40 AM, Daniel B?nzli wrote: >>> I would say there's already enough terminology in the Unicode world >>> to add more to it. This thread already hinted at enough ways of >>> expressing what you'd like, the simplest one being "scalar values >>> greater than U+001F". This is the clearest you can come up with and >>> anybody who has basic knowledge of the Unicode standard >> Uh...I think you mean U+007F? :) > > I agree that "scalar values greater than U+007F" doesn't just trip off > the tongue, > and while technically accurate, it is bad terminology -- precisely > because it > begs the question "wtf are 'scalar values'?!" for the average engineer. > >> >> Perhaps it's because I'm writing to the Unicode crowd, but honestly >> there are a lot of very intelligent software engineers/standards >> folks who do not have the "basic knowledge of the Unicode standard" >> that is being presumed. They want to focus on other parts of their >> systems or protocols, and when it comes to the "text part", they just >> hand-wave and say "Unicode!" and call it a day. ... > > Well, from this discussion, and from my experience as an engineer, I > think this comes down > to people in other standards, practices, and protocols dealing with > the ages old problem > of on beyond zebra for characters, where the comfortable assumptions > that byte=character > break down and people have to special case their code and > documentation. Where buffers > overrun, where black hat hackers rub their hands in glee, and where > engineers exclaim, "Oh gawd! I > can't just cast this character, because it's actually an array!" > > And nowadays, we are in the age of universal Unicode. All (well, much, > anyway) would be cool > if everybody were using UTF-32, because then at least we'd be back to > 32-bit-word=character, > and the programming would be easier. But UTF-32 doesn't play well with > existing protocols > and APIs and storage and... So instead, we are in the age of > "universal Unicode and almost > always UTF-8." > > So that leaves us with two types of characters: > > 1. "Good characters" > > These are true ASCII. U+0000..U+007F. Good because they are all single > bytes in UTF-8 > and because then UTF-8 strings just work like the Computer Science God > always intended, > and we don't have to do anything special. > > 2. "Bad characters" > > Everything else: U+0080..U+10FFFF. Bad because they require multiple > bytes to represent > in UTF-8 and so break all the simple assumptions about string and > buffer length. > They make for bugs and more bugs and why oh why do I have to keep > dealing with > edge cases where character boundaries don't line up with allocated > buffer boundaries?!! > > I think we can agree that there are two types of characters -- and > that those code point > ranges correctly identify the sets in question. > > The problem then just becomes a matter of terminology (in the > standards sense of > "terminology") -- coming up with usable, clear terms for the two sets. > To be good > terminology, the terms have to be identifiable and neither too generic > ("good characters" > and "bad characters") or too abstruse or wordy ("scalar values less > than or equal to U+007F" and > "scalar values greater than U+007F"). > > They also need to not be confusing. For example, "single-byte UTF-8" > and "multi-byte UTF-8" > might work for engineers, but is a confusing distinction, because > UTF-8 as an encoding > form is inherently multi-byte, and such terminology would undermine > the meaning of UTF-8 > itself. > > Finally, to be good terminology, the terms needs to have some > reasonable chance of > catching on and actually being used. It is fairly pointless to have a > "standardized way" > of distinguishing the #1 and #2 types of characters if people either > don't know about > that standardized way or find it misleading or not helpful, and > instead continue groping > about with their existing ad hoc terms anyway. > >> >> In the twenty minutes since my last post, I got two different >> responses...and as you pointed out, there are a lot of ways to >> express what one would like. I would prefer one, uniform way (hence, >> "standardized way"). > > Mark's point was that it is hard to improve on what we already have: > > 1. ASCII Unicode [characters] (i.e. U+0000..U+007F) > > 2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF) > > If we just highlight that terminology more prominently, emphasize it > in the > Unicode glossary, and promote it relentlessly, it might catch on more > generally, > and solve the problem. > > More irreverently, perhaps we could come up with complete neologisms that > might be catchy enough to go viral -- at least among the protocol > writers and > engineers who matter for this. Riffing on the small/big distinction > and connecting > it to "u-*nichar*" for the engineers, maybe something along the lines of: > > 1. skinnichar > > 2. baloonichar > > Well, maybe not those! But you get the idea. I'm sure there is a > budding terminologist > out there who could improve on that suggestion! > > At any rate, any formal contribution that suggests coming up with > terminology for > the #1 and #2 sets should take these considerations under advisement. > And unless > it suggests something that would pretty easily gain consensus as > demonstrably better than > the #1 and #2 terms suggested above by Mark, it might not result in any > change in actual usage. Thank you for this post. Slightly tongue-in-cheek but I think that it captures the issues at play. Sean From lists+unicode at seantek.com Wed Sep 30 01:12:25 2015 From: lists+unicode at seantek.com (Sean Leonard) Date: Tue, 29 Sep 2015 23:12:25 -0700 Subject: Beyond ASCII In-Reply-To: <560ADD80.3030709@att.net> References: <55FEC721.7040008@seantek.com> <55FF5494.602@it.aoyama.ac.jp> <55FFBE36.5030104@seantek.com> <560ABA62.3030004@seantek.com> <37E989CA57904F04B77DF2820E2919DB@erratique.ch> <560ACAD3.3050606@seantek.com> <560ADD80.3030709@att.net> Message-ID: <560B7D49.1030100@seantek.com> On 9/29/2015 11:50 AM, Ken Whistler wrote: > At any rate, any formal contribution that suggests coming up with > terminology for > the #1 and #2 sets should take these considerations under advisement. The original premise of this thread was (and is) to find the *most concise* term for that range U+0080 - U+10FFFF, regardless of whether that range is for characters, code points, scalar values, or coffee cup icons ??. Preferably, such a concise term would have support in the Unicode Standard, or in some other standard. I was not looking for a totally new, invented term, but rather a term that has empirical, standards-based support. A full survey of the Unicode Standard 8.0 finds that the term "beyond ASCII" has textual support: p. 1 Introduction: While taking the ASCII character set as its starting point, the Unicode Standard goes far beyond ASCII?s limited ability [...] p. 37 ASCII Transparency: [UTF-8] maintains transparency for all of the ASCII code points (0x00..0x7F). That means Unicode code points U+0000..U+007F are [thus] indistinguishable from ASCII itself. [...] Beyond the ASCII range of Unicode, many [...] scripts are represented by two bytes [in UTF-8...] p. 200 Programming Languages: A limitation of the ISO/ANSI C model is its assumption that characters can always be processed in isolation. Implementations that choose to go beyond the ISO/ANSI C model may find it useful to mix widths within their APIs. {This formulation is not "beyond ASCII", but uses the preposition "beyond" in the exact same sense, since ASCII is fixed-width and forms an underlying assumption of the ISO/ANSI C model.} p. 237 Case Mappings: A number of complications to case mappings occur once the repertoire of characters is expanded beyond ASCII. p. 677 Han / CJK Unified Ideographs Extension B: The ideographs in the CJK Unified Ideographs Extension B block represent an additional set of 42,711 unified ideographs beyond the 27,496 included in The Unicode Standard, Version 3.0. {This formulation uses the preposition "beyond" in the exact same sense, namely, a subsequent range that is beyond the original range.} Ditto for Extension C, Extension D, Extension E Finally, (case) "beyond ASCII" is in the Index at p. 237. Perhaps this thread would have gone differently if the original subject was "Beyond ASCII" instead of...that other one. ?? Now, I am not saying that the term *must* be "beyond ASCII". However the term "non-ASCII" (with or without "Unicode") has no support in the Unicode Standard 8.0. The only occurrence is the reference to RFC 2047, and in that document, "non-ASCII" clearly means any and every character encoding ever invented, not specifically Unicode. Another thing is the oxymoron "ASCII Unicode" (the opposite of "non-ASCII Unicode"). Actually ASCII is a formal subset of Unicode...at the beginning. ASCII itself (ANSI X3.4-1986) is a 7-bit character set; it does not limit itself to any particular word length so long as the 7 bits are in those combinations. Therefore U+0000 - U+007F characters encoded in UTF-32 or UTF-16 are in ASCII codes; they are truly ASCII characters. When a bit combination '?' (0x3F) is loaded into a 64-bit register on a CPU, is it still an ASCII character? My view is yes. They are not in ASCII *encoding*, as *encoding* is limited to a sequence of 7-bit or 8-bit combinations (X3.4-1986 Section 2.1.1(1)). My point here is that to be correct, one ought to use some sort of preposition, namely "ASCII in Unicode" or "ASCII [characters/code points/scalar values] in Unicode"--but if you slice off "in Unicode", you are left with "ASCII" and that is just fine. This is another basis for the proposition that "beyond ASCII" (e.g., "characters beyond ASCII [in Unicode]", "beyond the ASCII range [of Unicode]") makes sense. Regards, Sean From jsoconner at gmail.com Wed Sep 30 11:33:25 2015 From: jsoconner at gmail.com (John O'Conner) Date: Wed, 30 Sep 2015 16:33:25 +0000 Subject: Unicode in passwords Message-ID: I'm researching potential problems and best practices for password policies that allow non-Latin-1 Unicode characters. My searching of the unicode.org site showed me a general security considerations document (UTR #36) but nothing specific for password policies using Unicode. Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? Best regards, John O'Conner -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.blanchet at viagenie.ca Wed Sep 30 12:35:05 2015 From: marc.blanchet at viagenie.ca (Marc Blanchet) Date: Wed, 30 Sep 2015 13:35:05 -0400 Subject: Unicode in passwords In-Reply-To: References: Message-ID: <8317696F-EC58-4F84-8CDE-69A521ECF0FF@viagenie.ca> On 30 Sep 2015, at 12:33, John O'Conner wrote: > I'm researching potential problems and best practices for password > policies > that allow non-Latin-1 Unicode characters. My searching of the > unicode.org > site showed me a general security considerations document (UTR #36) > but > nothing specific for password policies using Unicode. > > Can you recommend any documents to help me understand potential issues > (if > any) for password policies and validation methods that allow > characters > from more "exotic" portions of the Unicode space? the IETF have been doing work related to this exact issue. You might want to look at RFC7564 (generic framework) and RFC7613 (username and passwords, used in various IETF protocols). Marc. > > Best regards, > John O'Conner From haberg-1 at telia.com Wed Sep 30 15:29:55 2015 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Wed, 30 Sep 2015 22:29:55 +0200 Subject: Unicode in passwords In-Reply-To: References: Message-ID: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> > On 30 Sep 2015, at 18:33, John O'Conner wrote: > > Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? On UNIX computers, one computes a hash (like SHA-256), which is then used to authenticate the password up to a high probability. The hash is stored in the open, but it is not known how to compute the password from the hash, so knowing the hash does not easily allow authentication. So if the password is encoded in say UTF-8 and then hashed, it would seem to take care of most problems. From clarkcox3 at gmail.com Wed Sep 30 18:15:30 2015 From: clarkcox3 at gmail.com (Clark S. Cox III) Date: Wed, 30 Sep 2015 16:15:30 -0700 Subject: Unicode in passwords In-Reply-To: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> Message-ID: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> > On 2015/09/30, at 13:29, Hans ?berg wrote: > > >> On 30 Sep 2015, at 18:33, John O'Conner wrote: >> >> Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? > > On UNIX computers, one computes a hash (like SHA-256), which is then used to authenticate the password up to a high probability. The hash is stored in the open, but it is not known how to compute the password from the hash, so knowing the hash does not easily allow authentication. > > So if the password is ? normalized and then ? > encoded in say UTF-8 and then hashed, it would seem to take care of most problems. You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different passwords, would you? (assuming that my mail client and/or OS is not interfering, the first is NFC, while the second is NFD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Sep 30 19:23:09 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 1 Oct 2015 01:23:09 +0100 Subject: Unicode in passwords In-Reply-To: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> Message-ID: <20151001012309.4b5f5e85@JRWUBU2> On Wed, 30 Sep 2015 16:15:30 -0700 "Clark S. Cox III" wrote: > You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different > passwords, would you? It'd make them slightly safer to write down! I trust the tradition of truncating Unix passwords to 8 bytes is well and truly defunct - that'd reduce Thai passwords to two characters plus one bit! Richard. From jonathan.rosenne at gmail.com Wed Sep 30 23:11:32 2015 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Thu, 1 Oct 2015 07:11:32 +0300 Subject: Unicode in passwords In-Reply-To: <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> References: <4057F09C-5426-40ED-AB3D-356BBBC85276@telia.com> <2E86845B-6A9F-4D5C-A02A-C2C82F4DB28F@gmail.com> Message-ID: <000601d0fbff$42881070$c7983150$@gmail.com> For languages such as Java, passwords should be handled as byte arrays rather than strings. This may make it difficult to apply normalization. Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Clark S. Cox III Sent: Thursday, October 01, 2015 2:16 AM To: Hans ?berg Cc: unicode at unicode.org; John O'Conner Subject: Re: Unicode in passwords On 2015/09/30, at 13:29, Hans ?berg wrote: On 30 Sep 2015, at 18:33, John O'Conner wrote: Can you recommend any documents to help me understand potential issues (if any) for password policies and validation methods that allow characters from more "exotic" portions of the Unicode space? On UNIX computers, one computes a hash (like SHA-256), which is then used to authenticate the password up to a high probability. The hash is stored in the open, but it is not known how to compute the password from the hash, so knowing the hash does not easily allow authentication. So if the password is ? normalized and then ? encoded in say UTF-8 and then hashed, it would seem to take care of most problems. You really wouldn?t want ?Schl?ssel? and ?Schl?ssel? being different passwords, would you? (assuming that my mail client and/or OS is not interfering, the first is NFC, while the second is NFD) -------------- next part -------------- An HTML attachment was scrubbed... URL: