From alastair at alastairs-place.net Wed Mar 1 03:43:57 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Wed, 1 Mar 2017 09:43:57 +0000 Subject: Northern Khmer on iPhone In-Reply-To: <20170228210056.6e56fcf9@JRWUBU2> References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2> Message-ID: <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net> On 28 Feb 2017, at 21:00, Richard Wordingham wrote: > > On Tue, 28 Feb 2017 07:37:10 +0000 > Richard Wordingham wrote: > >> Does iPhone support the use of Northern Khmer in Thai script? I would >> count an interface in Thai as support. >> >> The reason I ask is that I tried entering the word ??? > CHARACTER KO KAI, U+0E3A THAI CHARACTER PHINTHU, U+0E35 THAI CHARACTER >> SARA II> 'he' and got a dotted circle. I also got a dotted circle for >> the alternative spelling . > > It's been suggested to me that this is just a font issue. > Unfortunately, it seems that one can't change the font without > jailbreaking the phone. It?s definitely a font issue - the same problem exists on macOS Sierra (if I change the message to Rich Text, such that the font used is Helvetica, I see the same dotted circle problem; the fixed-width font I use, SF Mono, does not have this problem?). The best solution here may be to file a bug report at asking for font support, assuming the program you were using is using one of the Apple supplied fonts. (Also, FYI, iOS applications can - and some do - install and use their own fonts. It?s per-application, though; you can?t install them system-wide.) Kind regards, Alastair. -- http://alastairs-place.net From jean.aurambault at gmail.com Wed Mar 1 14:56:23 2017 From: jean.aurambault at gmail.com (Jean Aurambault) Date: Wed, 1 Mar 2017 12:56:23 -0800 Subject: Translations of city names Message-ID: Hi, I'm looking for (lightweight) libraries to translate city names, potentially country as well (but I know that's available in CLDR/ICU in some ways). Ideally it wouldn't need a database but rely on static assets. I'm wondering if there is any standard that defines a universal city id (similar to country codes). Wikipedia has lots of information on exonyms in different languages. I also found thing like http://www.geonames.org/ that seems to have a complete dataset with translations in many language but the relevant data would need to be extracted. Right now we use a old version of Maxmind library to get geolocated data that has no translation. Looks like new version have some translation but not enough language supported Any recommandation? Best, Jean -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Mar 1 15:37:07 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 1 Mar 2017 21:37:07 +0000 Subject: Translations of city names In-Reply-To: References: Message-ID: <20170301213707.733696d4@JRWUBU2> On Wed, 1 Mar 2017 12:56:23 -0800 Jean Aurambault wrote: > I'm wondering if there is any standard that defines a universal city > id (similar to country codes). ISO 3166-2 defines codes for some cities, but its uneven. However, what's a city? Does Constantinople exist? Richard. From unicode at lindenbergsoftware.com Thu Mar 2 03:06:57 2017 From: unicode at lindenbergsoftware.com (Norbert Lindenberg) Date: Thu, 2 Mar 2017 18:06:57 +0900 Subject: Northern Khmer on iPhone In-Reply-To: <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net> References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2> <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net> Message-ID: <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com> On iOS, applications can and do install custom fonts for system-wide use, although the installation user experience is pretty bad: http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html Norbert > On Mar 1, 2017, at 18:43 , Alastair Houghton wrote: [?] > (Also, FYI, iOS applications can - and some do - install and use their own fonts. It?s per-application, though; you can?t install them system-wide.) From sisrivas at blueyonder.co.uk Thu Mar 2 04:22:00 2017 From: sisrivas at blueyonder.co.uk (srivas sinnathurai) Date: Thu, 2 Mar 2017 10:22:00 +0000 (GMT) Subject: Translations of city names In-Reply-To: <20170301213707.733696d4@JRWUBU2> References: <20170301213707.733696d4@JRWUBU2> Message-ID: <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> I think there is a telephone area code, throughout the world. > > On 01 March 2017 at 21:37 Richard Wordingham > wrote: > > > On Wed, 1 Mar 2017 12:56:23 -0800 > Jean Aurambault wrote: > > > I'm wondering if there is any standard that defines a universal city > > id (similar to country codes). > > ISO 3166-2 defines codes for some cities, but its uneven. However, > what's a city? Does Constantinople exist? > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Mar 2 05:20:40 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 2 Mar 2017 12:20:40 +0100 Subject: Translations of city names In-Reply-To: <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> References: <20170301213707.733696d4@JRWUBU2> <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> Message-ID: Wrong, many countries have largely relaxed their phone number plans by using a single nation wide plan and allowed portability of numbers. Area codes are no longer needed (single call rate nation wide, the rate only depends on operators; and ranges of numbers are allocated also nationwide for value added services; long distance calls are things of the past since the very large adoption of mobile phones, also not located by area but only by country). 2017-03-02 11:22 GMT+01:00 srivas sinnathurai : > I think there is a telephone area code, throughout the world. > > > On 01 March 2017 at 21:37 Richard Wordingham com> wrote: > > > On Wed, 1 Mar 2017 12:56:23 -0800 > Jean Aurambault wrote: > > > I'm wondering if there is any standard that defines a universal city > > id (similar to country codes). > > ISO 3166-2 defines codes for some cities, but its uneven. However, > what's a city? Does Constantinople exist? > > Richard. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sisrivas at blueyonder.co.uk Thu Mar 2 09:19:41 2017 From: sisrivas at blueyonder.co.uk (srivas sinnathurai) Date: Thu, 2 Mar 2017 15:19:41 +0000 (GMT) Subject: Translations of city names In-Reply-To: References: <20170301213707.733696d4@JRWUBU2> <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> Message-ID: <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net> Skype for Business,and others cover (free global phone!!) for accounts based on area codes. Microsoft might have a list of this apparently adheres to a global standard. Yes, there is single nationwide plans also available, as addition to area plans. Sinnathurai > On 02 March 2017 at 11:20 Philippe Verdy wrote: > > Wrong, many countries have largely relaxed their phone number plans by > using a single nation wide plan and allowed portability of numbers. Area codes > are no longer needed (single call rate nation wide, the rate only depends on > operators; and ranges of numbers are allocated also nationwide for value added > services; long distance calls are things of the past since the very large > adoption of mobile phones, also not located by area but only by country). > > 2017-03-02 11:22 GMT+01:00 srivas sinnathurai mailto:sisrivas at blueyonder.co.uk >: > > > > > > I think there is a telephone area code, throughout the world. > > > > > > > > > > > > On 01 March 2017 at 21:37 Richard Wordingham > > > > > > wrote: > > > > > > > > > On Wed, 1 Mar 2017 12:56:23 -0800 > > > Jean Aurambault > > mailto:jean.aurambault at gmail.com > wrote: > > > > > > > I'm wondering if there is any standard that defines a > > > > universal city > > > > id (similar to country codes). > > > > > > ISO 3166-2 defines codes for some cities, but its uneven. > > > However, > > > what's a city? Does Constantinople exist? > > > > > > Richard. > > > > > > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Mar 2 09:20:18 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 2 Mar 2017 16:20:18 +0100 Subject: Northern Khmer on iPhone In-Reply-To: <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com> References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2> <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net> <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com> Message-ID: On Thu, Mar 2, 2017 at 10:06 AM, Norbert Lindenberg < unicode at lindenbergsoftware.com> wrote: > http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html ?Thanks for writing that, Norbert. Sounds a tad painful.? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Thu Mar 2 10:01:00 2017 From: tom at bluesky.org (Tom Gewecke) Date: Thu, 2 Mar 2017 09:01:00 -0700 Subject: Northern Khmer on iPhone In-Reply-To: References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2> <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net> <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com> Message-ID: <8F75DFA8-D34D-4D05-92B3-1C40AB0CB175@bluesky.org> > On Mar 2, 2017, at 8:20 AM, Mark Davis ?? wrote: > > > On Thu, Mar 2, 2017 at 10:06 AM, Norbert Lindenberg > wrote: > http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html > ?Thanks for writing that, Norbert. Sounds a tad painful.? From the standpoint of the ordinary user, adding fonts to iOS is pretty simple, as since iOS 7 there are apps that let you do it for anything you can download or get via email. Of course that is no guarantee that a particular font will work perfectly, and there's also no way to get a downloaded font to substitute for the iOS default font in the many apps where the user is not given any way to choose fonts. -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Thu Mar 2 10:22:57 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Thu, 2 Mar 2017 17:22:57 +0100 Subject: Translations of city names In-Reply-To: References: Message-ID: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> It looks like the community having the expertise to answer such question is the GIS (Geography Information Systems) community, more than the Unicode community. Have you tried asking a question on http://gis.stackexchange.com/ ? Fr?d?ric Le 01/03/2017 ? 21:56, Jean Aurambault a ?crit : > Hi, > > I'm looking for (lightweight) libraries to translate city names, > potentially country as well (but I know that's available in CLDR/ICU > in some ways). Ideally it wouldn't need a database but rely on static > assets. > > I'm wondering if there is any standard that defines a universal city > id (similar to country codes). > > Wikipedia has lots of information on exonyms in different languages. > > I also found thing like http://www.geonames.org/ that seems to have a > complete dataset with translations in many language but the relevant > data would need to be extracted. > > Right now we use a old version of Maxmind library to get geolocated > data that has no translation. Looks like new version have some > translation but not enough language supported > > Any recommandation? > > Best, > Jean > > > From mheijdra at princeton.edu Thu Mar 2 10:29:17 2017 From: mheijdra at princeton.edu (Martin Heijdra) Date: Thu, 2 Mar 2017 16:29:17 +0000 Subject: Translations of city names In-Reply-To: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> References: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> Message-ID: <0001012FBBD4FE40857959B0B65DE95B821103B5@CSGMBX212W.pu.win.princeton.edu> Libraries in the US are required to follow the BGN: https://geonames.usgs.gov/. Martin Heijdra -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Fr?d?ric Grosshans Sent: Thursday, March 02, 2017 11:23 AM To: unicode at unicode.org Subject: Re: Translations of city names It looks like the community having the expertise to answer such question is the GIS (Geography Information Systems) community, more than the Unicode community. Have you tried asking a question on http://gis.stackexchange.com/ ? Fr?d?ric Le 01/03/2017 ? 21:56, Jean Aurambault a ?crit : > Hi, > > I'm looking for (lightweight) libraries to translate city names, > potentially country as well (but I know that's available in CLDR/ICU > in some ways). Ideally it wouldn't need a database but rely on static > assets. > > I'm wondering if there is any standard that defines a universal city > id (similar to country codes). > > Wikipedia has lots of information on exonyms in different languages. > > I also found thing like http://www.geonames.org/ that seems to have a > complete dataset with translations in many language but the relevant > data would need to be extracted. > > Right now we use a old version of Maxmind library to get geolocated > data that has no translation. Looks like new version have some > translation but not enough language supported > > Any recommandation? > > Best, > Jean > > > From kenwhistler at att.net Thu Mar 2 11:47:22 2017 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 2 Mar 2017 09:47:22 -0800 Subject: Translations of city names In-Reply-To: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> References: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> Message-ID: The UN Group of Experts on Geographical Names (UNGEGN) is also relevant: https://unstats.un.org/unsd/geoinfo/ungegn/default.html They keep up a list of searchable geographical names databases in a wide variety of languages: https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html --Ken On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote: > It looks like the community having the expertise to answer such > question is the GIS (Geography Information Systems) community, more > than the Unicode community. Have you tried asking a question on > http://gis.stackexchange.com/ ? From doug at ewellic.org Thu Mar 2 13:31:45 2017 From: doug at ewellic.org (Doug Ewell) Date: Thu, 02 Mar 2017 12:31:45 -0700 Subject: Translations of city names Message-ID: <20170302123145.665a7a7059d7ee80bb4d670165c8327d.7f4470c2bc.wbe@email03.godaddy.com> Some clarifications... ISO 3166-2 defines code elements for (normally) first-level country subdivisions (states, provinces, regions, districts, etc.), but these almost never correlate in general to cities. In some countries, the name of a subdivision may be the same as that of its capital or another city, but that leaves out all the other cities within that subdivision, and in any case this convention very seldom applies to Northern America. Telephone area codes are not relevant in this regard, because they also may not correlate to cities per se, so again the desired granularity is not available. Area codes in Northern America may apply to an entire state or province, hundreds of thousands of square kilometers in size. (Number portability and calling plans are even less relevant to this.) In addition to the other standards given, there is UN/LOCODE [1], which provides code elements for "trade and transport locations," which may or may not correlate to "cities" depending on your needs. [1] http://www.unece.org/cefact/locode/welcome.html -- Doug Ewell | Thornton, CO, US | ewellic.org From jr at qsm.co.il Thu Mar 2 14:05:59 2017 From: jr at qsm.co.il (Jonathan Rosenne) Date: Thu, 2 Mar 2017 20:05:59 +0000 Subject: Translations of city names In-Reply-To: References: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> Message-ID: FWIW, I looked up Copenhagen in the Danish list and received "S?gning gav ikke noget resultat" which means literally "Search produced no result" (I do know that in Danish it is K?benhavn). Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Thursday, March 02, 2017 7:47 PM To: Fr?d?ric Grosshans Cc: unicode at unicode.org Subject: Re: Translations of city names The UN Group of Experts on Geographical Names (UNGEGN) is also relevant: https://unstats.un.org/unsd/geoinfo/ungegn/default.html They keep up a list of searchable geographical names databases in a wide variety of languages: https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html --Ken On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote: > It looks like the community having the expertise to answer such > question is the GIS (Geography Information Systems) community, more > than the Unicode community. Have you tried asking a question on > http://gis.stackexchange.com/ ? From jr at qsm.co.il Thu Mar 2 14:12:21 2017 From: jr at qsm.co.il (Jonathan Rosenne) Date: Thu, 2 Mar 2017 20:12:21 +0000 Subject: Translations of city names References: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com> Message-ID: P.S. The US database does a good job on Copenhagen. Best Regards, Jonathan Rosenne -----Original Message----- From: Jonathan Rosenne Sent: Thursday, March 02, 2017 10:06 PM To: 'Ken Whistler'; Fr?d?ric Grosshans Cc: unicode at unicode.org; 'navneforskning at hum.ku.dk' Subject: RE: Translations of city names FWIW, I looked up Copenhagen in the Danish list and received "S?gning gav ikke noget resultat" which means literally "Search produced no result" (I do know that in Danish it is K?benhavn). Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Thursday, March 02, 2017 7:47 PM To: Fr?d?ric Grosshans Cc: unicode at unicode.org Subject: Re: Translations of city names The UN Group of Experts on Geographical Names (UNGEGN) is also relevant: https://unstats.un.org/unsd/geoinfo/ungegn/default.html They keep up a list of searchable geographical names databases in a wide variety of languages: https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html --Ken On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote: > It looks like the community having the expertise to answer such > question is the GIS (Geography Information Systems) community, more > than the Unicode community. Have you tried asking a question on > http://gis.stackexchange.com/ ? From verdy_p at wanadoo.fr Fri Mar 3 07:01:10 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 3 Mar 2017 14:01:10 +0100 Subject: Translations of city names In-Reply-To: <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net> References: <20170301213707.733696d4@JRWUBU2> <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net> Message-ID: At least in the European Union, portability of numbers is open to every customers. And almost everywhere local call rates are disappearing for all operators, going to a situation with a single national rate. What replaces the local call rates is different rates depending on source and target operators or the kind of service (fixed line or mobile) rather than the actual location of callers and callees. Garanti sans virus. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> 2017-03-02 16:19 GMT+01:00 srivas sinnathurai : > Skype for Business,and others cover (free global phone!!) for accounts > based on area codes. > > Microsoft might have a list of this apparently adheres to a global > standard. > > > Yes, there is single nationwide plans also available, as addition to area > plans. > > > Sinnathurai > > > > On 02 March 2017 at 11:20 Philippe Verdy wrote: > > Wrong, many countries have largely relaxed their phone number plans by > using a single nation wide plan and allowed portability of numbers. Area > codes are no longer needed (single call rate nation wide, the rate only > depends on operators; and ranges of numbers are allocated also nationwide > for value added services; long distance calls are things of the past since > the very large adoption of mobile phones, also not located by area but only > by country). > > 2017-03-02 11:22 GMT+01:00 srivas sinnathurai : > > I think there is a telephone area code, throughout the world. > > > On 01 March 2017 at 21:37 Richard Wordingham < > richard.wordingham at ntlworld.com> wrote: > > > On Wed, 1 Mar 2017 12:56:23 -0800 > Jean Aurambault wrote: > > > I'm wondering if there is any standard that defines a universal city > > id (similar to country codes). > > ISO 3166-2 defines codes for some cities, but its uneven. However, > what's a city? Does Constantinople exist? > > Richard. > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon at simon-cozens.org Mon Mar 6 16:48:45 2017 From: simon at simon-cozens.org (Simon Cozens) Date: Tue, 7 Mar 2017 09:48:45 +1100 Subject: Stokoe Notation (sign language) Message-ID: Hello, A few years back, there was a set of questions to the UTC (L2/12-133) asking for direction on encoding Stokoe notation. Did these ever get an answer, and is there anything currently happening with Stokoe encoding? Simon From verdy_p at wanadoo.fr Mon Mar 6 19:59:28 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 7 Mar 2017 02:59:28 +0100 Subject: Stokoe Notation (sign language) In-Reply-To: References: Message-ID: And probably the same question could be asked again for the few other sign languages notations (at least those listed in Wikipedia), but I wonder if some of them may just be variants/simplifications of SingWriting, but more usable in handwritten text, or not needing complax layouts for precise reproduction of gesture (in a way similar to alphabets for spoken languages that simplify a lot the actual phonetic representation, or even the phonemic one). It seems that those simplified alphabet-like notations are much easier to encode, than the long waited complex SignWriting notation. In addition they could already use existing font technics without complex development (and already some of them already have working fonts, usable on vaerious systems, so they should already become interoperable). 2017-03-06 23:48 GMT+01:00 Simon Cozens : > Hello, > A few years back, there was a set of questions to the UTC > (L2/12-133) > asking for direction on encoding Stokoe notation. Did these ever get an > answer, and is there anything currently happening with Stokoe encoding? > > Simon > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Mon Mar 6 22:15:42 2017 From: c933103 at gmail.com (gfb hjjhjh) Date: Tue, 7 Mar 2017 12:15:42 +0800 Subject: Stokoe Notation (sign language) In-Reply-To: References: Message-ID: According to Wikipedia, that's what exactly the Stokoe notation is. quoted below: The Stokoe notation is mostly restricted to linguists and academics. The notation is arranged linearly on the page and can be written with a typewriter that has the proper font installed. Unlike SignWriting or the Hamburg Notation System , it is based on the Latin alphabet and is phonemic , being restricted to the symbols needed to meet the requirements of ASL (or extended to BSL, etc.) rather than accommodating all possible signs. For example, there is a single symbol for circling movement, regardless of whether the plane of the movement is horizontal or vertical. *Writing direction* Stokoe notation is written horizontally left to right like the Latin alphabet (plus limited vertical stacking of movement symbols, and some diacritical marks written above or below other symbols). This contrasts with SignWriting , which is written vertically from top to bottom (plus partially free two-dimensional placement of components within the writing of a single sign). 2017?3?7? 10:05 ? "Philippe Verdy" ??? > And probably the same question could be asked again for the few other sign > languages notations (at least those listed in Wikipedia), but I wonder if > some of them may just be variants/simplifications of SingWriting, but more > usable in handwritten text, or not needing complax layouts for precise > reproduction of gesture (in a way similar to alphabets for spoken languages > that simplify a lot the actual phonetic representation, or even the > phonemic one). > > It seems that those simplified alphabet-like notations are much easier to > encode, than the long waited complex SignWriting notation. In addition they > could already use existing font technics without complex development (and > already some of them already have working fonts, usable on vaerious > systems, so they should already become interoperable). > > > 2017-03-06 23:48 GMT+01:00 Simon Cozens : > >> Hello, >> A few years back, there was a set of questions to the UTC >> (L2/12-133) >> asking for direction on encoding Stokoe notation. Did these ever get an >> answer, and is there anything currently happening with Stokoe encoding? >> >> Simon >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Tue Mar 7 11:04:31 2017 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 7 Mar 2017 09:04:31 -0800 Subject: Stokoe Notation (sign language) In-Reply-To: References: Message-ID: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net> On 3/6/2017 2:48 PM, Simon Cozens wrote: > A few years back, there was a set of questions to the UTC (L2/12-133) > asking for direction on encoding Stokoe notation. Did these ever get an > answer, and is there anything currently happening with Stokoe encoding? > The short answer is no. Stokoe notation has a bunch of features that make it a very low priority for UTC attention. And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this: Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? If not, why would you expect the UTC to devote time to figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers. --Ken From lorna_evans at sil.org Tue Mar 7 11:46:41 2017 From: lorna_evans at sil.org (Lorna Evans) Date: Tue, 7 Mar 2017 11:46:41 -0600 Subject: Stokoe Notation (sign language) In-Reply-To: References: Message-ID: Hi Simon, I did a lot of research on Stokoe Notation between 2010-2012. It has primarily been used in dictionaries. When I presented that document (L2/12-133) to UTC, these are the summary notes I took: We only had about 10-15 minutes to discuss Stokoe Notation in the main UTC but they decided to have an ad-hoc meeting at lunch so we had about 40 minutes over lunch on Stokoe. There was no support for encoding it as a script. They feel it "should not be encoded as a script any more than math or music is a script. " It "doesn't make sense to do it as plain text...would be a serious mistake". So, they want it to use all the existing Latin characters and symbols in the standard and just encode new characters as symbols and then use a higher level protocol for the shaping. All the fancy layout can be expressed in MathML. I think I still disagree (I think it should be encoded as a writing system), but Stokoe isn't high on my list of priorities and I haven't had a chance to do further research. Major dictionaries that have been produced using Stokoe Notation are for British Sign Language (BSL), American Sign Language (ASL), Hong Kong Sign Language (HKSL), Signed Swedish (SS), Italian Sign Language (LIS) and Czech Sign Language. I also reviewed books or documents discussing Dutch Sign Language (DSE) and Australian Aboriginal (ASL). Each and every one of these had differing levels of rendering requirements (from very minor all the way to the need for control codes for positioning) because they had "enhanced" the original ASL Stokoe Notation. I feel it's complex enough that it should definitely have further research done on it. I just don't have the time to put on it at this point. (The L2/12-133 document didn't include a review of Czech Sign Language. When I did get a copy of that dictionary I discovered even more enhancements. I never documented those.) Lorna -------- Original Message -------- Subject: Stokoe Notation (sign language) From: Simon Cozens To: unicode Unicode Discussion CC: lorna_evans at sil.org Date: 3/6/2017 4:48 PM > Hello, > A few years back, there was a set of questions to the UTC (L2/12-133) > asking for direction on encoding Stokoe notation. Did these ever get an > answer, and is there anything currently happening with Stokoe encoding? > > Simon From jean.aurambault at gmail.com Tue Mar 7 20:40:13 2017 From: jean.aurambault at gmail.com (Jean Aurambault) Date: Tue, 7 Mar 2017 18:40:13 -0800 Subject: Translations of city names In-Reply-To: References: <20170301213707.733696d4@JRWUBU2> <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net> <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net> Message-ID: thank you all for your input! Jean On Fri, Mar 3, 2017 at 5:01 AM, Philippe Verdy wrote: > At least in the European Union, portability of numbers is open to every > customers. And almost everywhere local call rates are disappearing for all > operators, going to a situation with a single national rate. > What replaces the local call rates is different rates depending on source > and target operators or the kind of service (fixed line or mobile) rather > than the actual location of callers and callees. > > > Garanti > sans virus. www.avast.com > > <#m_1585433171908523202_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > > 2017-03-02 16:19 GMT+01:00 srivas sinnathurai : > >> Skype for Business,and others cover (free global phone!!) for accounts >> based on area codes. >> >> Microsoft might have a list of this apparently adheres to a global >> standard. >> >> >> Yes, there is single nationwide plans also available, as addition to area >> plans. >> >> >> Sinnathurai >> >> >> >> On 02 March 2017 at 11:20 Philippe Verdy wrote: >> >> Wrong, many countries have largely relaxed their phone number plans by >> using a single nation wide plan and allowed portability of numbers. Area >> codes are no longer needed (single call rate nation wide, the rate only >> depends on operators; and ranges of numbers are allocated also nationwide >> for value added services; long distance calls are things of the past since >> the very large adoption of mobile phones, also not located by area but only >> by country). >> >> 2017-03-02 11:22 GMT+01:00 srivas sinnathurai > >: >> >> I think there is a telephone area code, throughout the world. >> >> >> On 01 March 2017 at 21:37 Richard Wordingham < >> richard.wordingham at ntlworld.com> wrote: >> >> >> On Wed, 1 Mar 2017 12:56:23 -0800 >> Jean Aurambault wrote: >> >> > I'm wondering if there is any standard that defines a universal city >> > id (similar to country codes). >> >> ISO 3166-2 defines codes for some cities, but its uneven. However, >> what's a city? Does Constantinople exist? >> >> Richard. >> >> >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Mar 8 09:45:05 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 8 Mar 2017 15:45:05 +0000 (GMT) Subject: Stokoe Notation (sign language) In-Reply-To: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net> References: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net> Message-ID: <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost> Ken Whistler asked: > And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this: Well, I am not quite congruently in that category, but not far off, so I will answer the question anyway. > Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? Yes, I would. It seems a very worthwhile project. I am not a linguist, though I am interested linguistics. I have very little knowledge of sign language. I do not remember knowing of Stokoe Notation before reading this thread. What interests me about this project and where I feel that I could make a contribution to a group effort is that Ken included the following. > .... figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? Now that interests me and is the sort of problem that I enjoy trying to solve. Some time ago there was discussion of encoding Ancient Egyptian and I devised an idea for solving the advanced issues of that encoding. At first glance, the encoding of Stokoe Notation seems to have some similarities to what is needed regarding the encoding of the advanced glyph layout of Ancient Egyptian. I published my ideas, in fact including them as a chapter in my novel. http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_009.pdf I used the technique of including the idea in a chapter of the novel as it allows a dialogue of discussion about the ideas. The document has been deposited at the British Library. Today Unicode has tag sequences available as a technique and it might be that by using the ideas in Chapter 9 of my novel, in particular of having a Glyph as a type in the object code of a virtual computer so that glyphs could be scaled, moved and added together, that the implementation would be fairly straightforward by using short pieces of software each expressed as a tag sequence to produce a result. Thus implementing the spatial layout of the system by software in a virtual computer rather than by a sort of hardwired encoding. Ken also wrote: > Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers. Well maybe the implementation of that complexity might make a good student project or a good student group project somewhere. I opine that progress is important. William Overington Wednesday 8 March 2017 ----Original message---- >From : kenwhistler at att.net Date : 07/03/2017 - 17:04 (GMTST) To : simon at simon-cozens.org Cc : unicode at unicode.org Subject : Re: Stokoe Notation (sign language) On 3/6/2017 2:48 PM, Simon Cozens wrote: > A few years back, there was a set of questions to the UTC (L2/12-133) > asking for direction on encoding Stokoe notation. Did these ever get an > answer, and is there anything currently happening with Stokoe encoding? > The short answer is no. Stokoe notation has a bunch of features that make it a very low priority for UTC attention. And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this: Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? If not, why would you expect the UTC to devote time to figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers. --Ken From petercon at microsoft.com Thu Mar 9 10:49:36 2017 From: petercon at microsoft.com (Peter Constable) Date: Thu, 9 Mar 2017 16:49:36 +0000 Subject: Stokoe Notation (sign language) In-Reply-To: <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost> References: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net> <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost> Message-ID: I opine that opining an opinion is periphrastic, circumlocutious, consumptive, wasteful spending of one's own and others' resources. Just say it. "Progress is important." Thank you for that most insightful of generalizations. /S Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington Sent: Wednesday, March 8, 2017 7:45 AM To: c933103 at gmail.com; kenwhistler at att.net; verdy_p at wanadoo.fr; simon at simon-cozens.org; lorna_evans at sil.org; unicode at unicode.org Subject: Re: Stokoe Notation (sign language) Ken Whistler asked: > And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this: Well, I am not quite congruently in that category, but not far off, so I will answer the question anyway. > Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? Yes, I would. It seems a very worthwhile project. I am not a linguist, though I am interested linguistics. I have very little knowledge of sign language. I do not remember knowing of Stokoe Notation before reading this thread. What interests me about this project and where I feel that I could make a contribution to a group effort is that Ken included the following. > .... figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? Now that interests me and is the sort of problem that I enjoy trying to solve. Some time ago there was discussion of encoding Ancient Egyptian and I devised an idea for solving the advanced issues of that encoding. At first glance, the encoding of Stokoe Notation seems to have some similarities to what is needed regarding the encoding of the advanced glyph layout of Ancient Egyptian. I published my ideas, in fact including them as a chapter in my novel. http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_009.pdf I used the technique of including the idea in a chapter of the novel as it allows a dialogue of discussion about the ideas. The document has been deposited at the British Library. Today Unicode has tag sequences available as a technique and it might be that by using the ideas in Chapter 9 of my novel, in particular of having a Glyph as a type in the object code of a virtual computer so that glyphs could be scaled, moved and added together, that the implementation would be fairly straightforward by using short pieces of software each expressed as a tag sequence to produce a result. Thus implementing the spatial layout of the system by software in a virtual computer rather than by a sort of hardwired encoding. Ken also wrote: > Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers. Well maybe the implementation of that complexity might make a good student project or a good student group project somewhere. I opine that progress is important. William Overington Wednesday 8 March 2017 ----Original message---- >From : kenwhistler at att.net Date : 07/03/2017 - 17:04 (GMTST) To : simon at simon-cozens.org Cc : unicode at unicode.org Subject : Re: Stokoe Notation (sign language) On 3/6/2017 2:48 PM, Simon Cozens wrote: > A few years back, there was a set of questions to the UTC (L2/12-133) > asking for direction on encoding Stokoe notation. Did these ever get > an answer, and is there anything currently happening with Stokoe encoding? > The short answer is no. Stokoe notation has a bunch of features that make it a very low priority for UTC attention. And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this: Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? If not, why would you expect the UTC to devote time to figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers. --Ken From petercon at microsoft.com Thu Mar 9 10:56:42 2017 From: petercon at microsoft.com (Peter Constable) Date: Thu, 9 Mar 2017 16:56:42 +0000 Subject: Northern Khmer on iPhone In-Reply-To: <20170228073710.75af64d4@JRWUBU2> References: <20170228073710.75af64d4@JRWUBU2> Message-ID: Too bad more people didn't use Windows Phones, as your word displays as expected on mine. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham Sent: Monday, February 27, 2017 11:37 PM To: unicode at unicode.org Subject: Northern Khmer on iPhone Does iPhone support the use of Northern Khmer in Thai script? I would count an interface in Thai as support. The reason I ask is that I tried entering the word ??? 'he' and got a dotted circle. I also got a dotted circle for the alternative spelling . This might be an application issue. The application I was using was Line. Richard. From petercon at microsoft.com Fri Mar 10 11:00:55 2017 From: petercon at microsoft.com (Peter Constable) Date: Fri, 10 Mar 2017 17:00:55 +0000 Subject: "A Programmer's Introduction to Unicode" Message-ID: FYI: http://reedbeta.com/blog/programmers-intro-to-unicode/ The visuals may be the most interesting part. E.g., in the usage heat map, Arabic Presentation Forms-B lights up much more than I would have expected - as much as a lot of emoji. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From khaledhosny at eglug.org Fri Mar 10 11:53:27 2017 From: khaledhosny at eglug.org (Khaled Hosny) Date: Fri, 10 Mar 2017 19:53:27 +0200 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: Message-ID: <20170310175234.GA8291@macbook> On Fri, Mar 10, 2017 at 05:00:55PM +0000, Peter Constable wrote: > FYI: > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > The visuals may be the most interesting part. E.g., in the usage heat > map, Arabic Presentation Forms-B lights up much more than I would have > expected I often see U+FEFB and other lam-alef ligatures used on social media (I easily spot it because my default font does not have them so they end up using fallback font). My guess is that might be because some keyboard layouts (Xorg, Android?) use them for the lam-alef keys on the keyboard (I?m guilty of doing this for Xorg keyboard layout because it didn?t handle more than one character per key, this was then decomposed back inside XIM input method, but many people don?t use XIM and the decomposition does not happen, it was messy overall). Regards, Khaled From manish at mozilla.com Fri Mar 10 12:55:44 2017 From: manish at mozilla.com (Manish Goregaokar) Date: Fri, 10 Mar 2017 10:55:44 -0800 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: Message-ID: I recently wrote http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ , which sort of addresses the whole hangup programmers have with treating code points as "characters". I also wrote http://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/ that provides a useful list of scripts to check against when figuring out if your design makes sense uniformly across scripts. There's also https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ -Manish On Fri, Mar 10, 2017 at 9:00 AM, Peter Constable wrote: > FYI: > > > > http://reedbeta.com/blog/programmers-intro-to-unicode/ > > > > The visuals may be the most interesting part. E.g., in the usage heat map, > Arabic Presentation Forms-B lights up much more than I would have expected ? > as much as a lot of emoji. > > > > > > > > Peter From jsbien at mimuw.edu.pl Sun Mar 12 00:04:56 2017 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Sun, 12 Mar 2017 07:04:56 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: (Manish Goregaokar's message of "Fri, 10 Mar 2017 10:55:44 -0800") References: Message-ID: <864lyzhvxz.fsf@mimuw.edu.pl> On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes: > I recently wrote > http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ > , which sort of addresses the whole hangup programmers have with > treating code points as "characters". [...] This is just another confirmation that the present Unicode terminology is confusing. Let me remind below a fragment of an old thread about "textels". Best regards Janusz On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes: > On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes: > > [...] > >> In the new Swift programming language, which is white-hot in the Apple >> community, Apple is moving toward a model of a transparent, generic >> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, >> but in which a ?character? contains however many code points it needs >> (?e? with a stacked macron, acute accent, and dieresis is >> algorithmically one ?character? in Swift). Moreover, >> e-with-an-acute-accent and e followed by a combining acute accent, for >> example, compare as equal. At present, the underlying code is still >> UTF-16LE. > > For several years I use the name "textel" (text element, in Polish > "tekstel") for such objects. I do it mostly orally in my presentations > for my students, but I used it also in writing e.g. in > http://bc.klf.uw.edu.pl/118/, unfortunately without a proper > definition. A rudymentary definition was provided for me only in my > recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply > (on p. 69) "an elementary text element independently of its Unicode > representation" (meaning in particular composed vs precomposed). I still > hope to formulate sooner or later a more satisfactory definition :-) > > I think Swift confirms that such a notion is really needed. > > Best regards > > Janusz On Wed, Sep 21 2016 at 6:44 CEST, jsbien at mimuw.edu.pl writes: > On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes: >> Janusz Bie? wrote: >> >>> For me it means that Swift's characters are equivalence classes of the >>> set of extended grapheme clusters by canonical equivalence relation. >> >> I still hope we can come to some conclusion on the correct Unicode name >> for this concept. I don't think non-Unicode interpretations of terms >> like "grapheme" are grounds for throwing out "grapheme cluster," > > I agree. > >> but I can see that the equivalence class itself is lacking a name. > > I'glad. > >> >> Note that the Swift definition doesn't say that <00E9> and <0065 0301> >> are identical entities, only that the language compares them as equal. > > I'm fully aware of this. > > Best regards > > Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From manish at mozilla.com Sun Mar 12 13:43:22 2017 From: manish at mozilla.com (Manish Goregaokar) Date: Sun, 12 Mar 2017 11:43:22 -0700 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <864lyzhvxz.fsf@mimuw.edu.pl> References: <864lyzhvxz.fsf@mimuw.edu.pl> Message-ID: > This is just another confirmation that the present Unicode terminology is confusing. I find this to be a symptom of our pedagogy around "characters" in programming; most folks get taught that characters are bytes are code points, especially because many languages try to make this the case. The name "grapheme cluster" could be improved upon, but it's not the primary source of this confusion. -Manish On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bie? wrote: > On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes: >> I recently wrote >> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/ >> , which sort of addresses the whole hangup programmers have with >> treating code points as "characters". > > [...] > > This is just another confirmation that the present Unicode terminology > is confusing. Let me remind below a fragment of an old thread about > "textels". > > Best regards > > Janusz > > > On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes: >> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes: >> >> [...] >> >>> In the new Swift programming language, which is white-hot in the Apple >>> community, Apple is moving toward a model of a transparent, generic >>> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary, >>> but in which a ?character? contains however many code points it needs >>> (?e? with a stacked macron, acute accent, and dieresis is >>> algorithmically one ?character? in Swift). Moreover, >>> e-with-an-acute-accent and e followed by a combining acute accent, for >>> example, compare as equal. At present, the underlying code is still >>> UTF-16LE. >> >> For several years I use the name "textel" (text element, in Polish >> "tekstel") for such objects. I do it mostly orally in my presentations >> for my students, but I used it also in writing e.g. in >> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper >> definition. A rudymentary definition was provided for me only in my >> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply >> (on p. 69) "an elementary text element independently of its Unicode >> representation" (meaning in particular composed vs precomposed). I still >> hope to formulate sooner or later a more satisfactory definition :-) >> >> I think Swift confirms that such a notion is really needed. >> >> Best regards >> >> Janusz > > On Wed, Sep 21 2016 at 6:44 CEST, jsbien at mimuw.edu.pl writes: >> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes: >>> Janusz Bie? wrote: >>> >>>> For me it means that Swift's characters are equivalence classes of the >>>> set of extended grapheme clusters by canonical equivalence relation. >>> >>> I still hope we can come to some conclusion on the correct Unicode name >>> for this concept. I don't think non-Unicode interpretations of terms >>> like "grapheme" are grounds for throwing out "grapheme cluster," >> >> I agree. >> >>> but I can see that the equivalence class itself is lacking a name. >> >> I'glad. >> >>> >>> Note that the Swift definition doesn't say that <00E9> and <0065 0301> >>> are identical entities, only that the language compares them as equal. >> >> I'm fully aware of this. >> >> Best regards >> >> Janusz > > > -- > , > Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) > Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ > From jsbien at mimuw.edu.pl Sun Mar 12 14:02:28 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sun, 12 Mar 2017 20:02:28 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> Message-ID: <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> Quote/Cytat - Manish Goregaokar (Sun 12 Mar 2017 07:43:22 PM CET): >> This is just another confirmation that the present Unicode terminology > is confusing. > > I find this to be a symptom of our pedagogy around "characters" in > programming; most folks get taught that characters are bytes are code > points, especially because many languages try to make this the case. > The name "grapheme cluster" could be improved upon, but it's not the > primary source of this confusion. I agree that it's not the primary source. However the pedagogy depends on the terminology used. If the basic notion has to be referred in a cumbersome way as "extended grapheme cluster" then it is easier to talk about "Unicode characters" despite the fact that they have a rather loose relation to real-life/user-perceived characters. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From richard.wordingham at ntlworld.com Sun Mar 12 15:10:22 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 12 Mar 2017 20:10:22 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> Message-ID: <20170312201022.7ec8d858@JRWUBU2> On Sun, 12 Mar 2017 20:02:28 +0100 "Janusz S. Bien" wrote: > If the basic notion has to be referred in a cumbersome way as > "extended grapheme cluster" then it is easier to talk about "Unicode > characters" despite the fact that they have a rather loose relation > to real-life/user-perceived characters. The notion that extended grapheme clusters corresponds to user-perceived characters is also rather dodgy. Whereas it may work for French, it is getting very dubious by the time one adds Hebrew cantillation marks or Vedic accentuation. The Thais revolted when their preposed vowels were joined with the following consonant in the same extended grapheme cluster, and Unicode had to revoke that union. Richard. From jsbien at mimuw.edu.pl Mon Mar 13 05:31:28 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 13 Mar 2017 11:31:28 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170312201022.7ec8d858@JRWUBU2> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> Message-ID: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> Quote/Cytat - Richard Wordingham (Sun 12 Mar 2017 09:10:22 PM CET): > On Sun, 12 Mar 2017 20:02:28 +0100 > "Janusz S. Bien" wrote: > >> If the basic notion has to be referred in a cumbersome way as >> "extended grapheme cluster" then it is easier to talk about "Unicode >> characters" despite the fact that they have a rather loose relation >> to real-life/user-perceived characters. > > The notion that extended grapheme clusters corresponds to > user-perceived characters is also rather dodgy. The idea is not mine, but it appears from time to time on the list in a more or less explicit way. > Whereas it may work > for French, it is getting very dubious by the time one adds Hebrew > cantillation marks or Vedic accentuation. The Thais revolted when > their preposed vowels were joined with the following consonant in the > same extended grapheme cluster, and Unicode had to revoke that union. Just yet another reason for introducing the notion of textel? Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From jsbien at mimuw.edu.pl Mon Mar 13 06:35:01 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 13 Mar 2017 12:35:01 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost> Message-ID: <20170313123501.16162ll9e9d5bqhx@mail.mimuw.edu.pl> Quote/Cytat - William_J_G Overington (Mon 13 Mar 2017 12:24:13 PM CET): > Prof. Janusz S. Bie? wrote: > >> Just yet another reason for introducing the notion of textel? > > I opine that it would be a good idea to introduce several new words, > of which textel would be one, with each such new word having a > precisely-defined meaning so that in precise discussions of > programming techniques people could discuss the situation without > needing to use any of the words character, code point, grapheme > cluster. > > How many such new words would be needed? In my paper (in Polish) http://bc.klf.uw.edu.pl/480/ I propose also the term "texton" meaning a code point from a specific subset, not yet fully defined, but including at least the components of composite characters. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From wjgo_10009 at btinternet.com Mon Mar 13 06:24:13 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 13 Mar 2017 11:24:13 +0000 (GMT) Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> Message-ID: <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost> Prof. Janusz S. Bie? wrote: > Just yet another reason for introducing the notion of textel? I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster. How many such new words would be needed? I remember how in electronics the introduction of the term Hertz to be used instead of cycles per second helped discussions. After the introduction of the term Hertz it became easy to refer to twenty cycles of a fifty Hertz signal without confusion over one's meaning. So introducing several new precisely-defined words now could help lots of discussions in the future. Perhaps, apart from textel, the definitions could be produced first and then people can decide, for each such definition, which new word would be a good word to have that definition. The recent introduction into Unicode of ZWJ sequences for some emoji and the introduction into Unicode of tag sequences applied to a base character does could mean that the introducing of such new words becomes of increasing importance due to the programming implications of those recently introduced techniques. William Overington Monday 13 March 2017 From asmusf at ix.netcom.com Mon Mar 13 12:00:08 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 13 Mar 2017 10:00:08 -0700 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> Message-ID: <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Mar 13 12:15:31 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 13 Mar 2017 18:15:31 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> Message-ID: <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> Quote/Cytat - Asmus Freytag (Mon 13 Mar 2017 06:00:08 PM CET): [...] This (or similar) scenarios indicate the impossibility to come to a single, universal definition of a "textel" -- the main reason why this term is of lower utility than "pixel". I agree that it is impossible to come to a single, universal definition of text elements, but it seems possible to reach a consensus on a kind of the least common denominator of them and call it "textel" or something else. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From d3ck0r at gmail.com Mon Mar 13 12:55:18 2017 From: d3ck0r at gmail.com (J Decker) Date: Mon, 13 Mar 2017 10:55:18 -0700 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> Message-ID: I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. https://blog.golang.org/strings Doesn't solve the problem for composited codepoints though... texel looks to be defined as a graphic element already. TEXture ELement. On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bien wrote: > Quote/Cytat - Asmus Freytag (Mon 13 Mar 2017 > 06:00:08 PM CET): > > [...] > > This (or similar) scenarios indicate the impossibility to come to a > single, universal definition of a "textel" -- the main reason why this > term is of lower utility than "pixel". > > I agree that it is impossible to come to a single, universal definition > of text elements, but it seems possible to reach a consensus on a kind of > the least common denominator of them and call it "textel" or something else. > > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Mar 13 13:02:39 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 13 Mar 2017 19:02:39 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> Message-ID: <20170313190239.12215ot2zpq9i2m7@mail.mimuw.edu.pl> Quote/Cytat - J Decker (Mon 13 Mar 2017 06:55:18 PM CET): > texel looks to be defined as a graphic element already. TEXture ELement. I'm aware of it, but homonymy/polysemy is something we have to live with. I think there is no risk of confusing texture elements with text elements, despite the fact that 'texture' and 'text' have similar origin. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From alastair at alastairs-place.net Mon Mar 13 14:18:00 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Mon, 13 Mar 2017 19:18:00 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> Message-ID: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> On 13 Mar 2017, at 17:55, J Decker wrote: > > I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. > https://blog.golang.org/strings IMO, returning code points by index is a mistake. It over-emphasises the importance of the code point, which helps to continue the notion in some developers? minds that code points are somehow ?characters?. It also leads to people unnecessarily using UCS-4 as an internal representation, which seems to have very few advantages in practice over UTF-16. > Doesn't solve the problem for composited codepoints though... > > texel looks to be defined as a graphic element already. TEXture ELement. Yes, but I thought the proposal was ?textel?, with the extra ?t?. Re-using ?texel? would be quite inappropriate; there are certainly people who work on rendering software who would strongly object to that, for very good reasons. I would caution, however, that there?s already a lot of terminology associated with Unicode, perhaps for understandable reasons, but if the word ?textel? is going to have a definition that differs from (say) an extended grapheme cluster, I think a great deal of consideration should be given to what exactly that definition should be. We already have ?characters?, code units, code points, combining sequences, graphemes, grapheme clusters, extended grapheme clusters and probably other things I?ve missed off that list. Merely adding yet another bit of terminology isn?t going to fix the problem of developers misunderstanding or simply not being aware of the correct terminology or of some aspect of Unicode?s behaviour. Kind regards, Alastair. -- http://alastairs-place.net From khaledhosny at eglug.org Mon Mar 13 16:10:11 2017 From: khaledhosny at eglug.org (Khaled Hosny) Date: Mon, 13 Mar 2017 23:10:11 +0200 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> Message-ID: <20170313211011.GE1429@macbook> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote: > On 13 Mar 2017, at 17:55, J Decker wrote: > > > > I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. > > https://blog.golang.org/strings > > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers? minds that code points are somehow ?characters?. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. But there are many text operations that require access to Unicode code points. Take for example text layout, as mapping characters to glyphs and back has to operate on code points. The idea that you never need to work with code points is too simplistic. Regards, Khaled From richard.wordingham at ntlworld.com Mon Mar 13 16:47:04 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 13 Mar 2017 21:47:04 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313211011.GE1429@macbook> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> Message-ID: <20170313214704.55372dfb@JRWUBU2> On Mon, 13 Mar 2017 23:10:11 +0200 Khaled Hosny wrote: > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need > to work with code points is too simplistic. There are advantages to interpreting and operating on text as though it were in form NFD. However, there are still cases where one needs fractions of a character, such as word boundaries in Sanskrit, though I think the locations are liable to be specified in a language-specific form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it in at least 4 ways. Richard. From manish at mozilla.com Mon Mar 13 17:26:00 2017 From: manish at mozilla.com (Manish Goregaokar) Date: Mon, 13 Mar 2017 15:26:00 -0700 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313214704.55372dfb@JRWUBU2> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2> Message-ID: Do you have examples of AA being split that way (and further reading)? I think I'm aware of what you're talking about, but would love to read more about it. -Manish On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham wrote: > On Mon, 13 Mar 2017 23:10:11 +0200 > Khaled Hosny wrote: > >> But there are many text operations that require access to Unicode code >> points. Take for example text layout, as mapping characters to glyphs >> and back has to operate on code points. The idea that you never need >> to work with code points is too simplistic. > > There are advantages to interpreting and operating on text as though it > were in form NFD. However, there are still cases where one needs > fractions of a character, such as word boundaries in Sanskrit, though I > think the locations are liable to be specified in a language-specific > form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it > in at least 4 ways. > > Richard. From richard.wordingham at ntlworld.com Mon Mar 13 18:48:37 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 13 Mar 2017 23:48:37 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2> Message-ID: <20170313234837.5d891338@JRWUBU2> On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokar wrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard. From mark at kli.org Mon Mar 13 19:20:25 2017 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 13 Mar 2017 20:20:25 -0400 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2> Message-ID: A word ending in A *or* AA preceding a word beginning in A *or* AA will all coalesce to a single AA in Sanskrit. That's four possibilities, and that doesn't count a word ending in a consonant preceding a word beginning in AA, which would be written the same. My memory is rusty, so I should actually be looking things up, but I think these are valid constructions: ? + ??????? ? ???????? ? + ??????? ? ???????? (and indeed, ??????? is the upasarga ? plus ???????, so there too the A + AA coalesced.) I should probably find you examples for all the other possibilities. Sanskrit external vowel sandhi is comparatively straightforward (compared to consonant sandhi), and it frequently loses information. A *or* AA plus I is E; A *or* AA plus U is O (you need A + O to get AU). ~mark On 03/13/2017 06:26 PM, Manish Goregaokar wrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. > -Manish > > > On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham > wrote: >> On Mon, 13 Mar 2017 23:10:11 +0200 >> Khaled Hosny wrote: >> >>> But there are many text operations that require access to Unicode code >>> points. Take for example text layout, as mapping characters to glyphs >>> and back has to operate on code points. The idea that you never need >>> to work with code points is too simplistic. >> There are advantages to interpreting and operating on text as though it >> were in form NFD. However, there are still cases where one needs >> fractions of a character, such as word boundaries in Sanskrit, though I >> think the locations are liable to be specified in a language-specific >> form. U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it >> in at least 4 ways. >> >> Richard. From richard.wordingham at ntlworld.com Mon Mar 13 20:56:23 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 14 Mar 2017 01:56:23 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2> Message-ID: <20170314015623.446cb440@JRWUBU2> On Mon, 13 Mar 2017 20:20:25 -0400 "Mark E. Shoulson" wrote: > Sanskrit external vowel sandhi is comparatively > straightforward (compared to consonant sandhi), and it frequently > loses information. A *or* AA plus I is E; A *or* AA plus U is O (you > need A + O to get AU). Indeed, E can not only be A or AA plus I or II: it can also be E + A. In the latter case avagraha is usual, at least in European practice. (Would that generally be locale sa_Deva_GB?) I'd like advice on modern Indian practice, and on the spacing and syllable division. I've seen a claim that avagraha always belongs with the preceding vowel, but I'm not sure that that rule applies in this case. In a similar fashion, O can -AS + A-, an interesting case of visarga sandhi. However, I'm not sure that one would want to *divide* the E or O. Richard. From richard.wordingham at ntlworld.com Mon Mar 13 21:03:56 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 14 Mar 2017 02:03:56 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> Message-ID: <20170314020356.26ff5e89@JRWUBU2> On Mon, 13 Mar 2017 19:18:00 +0000 Alastair Houghton wrote: > IMO, returning code points by index is a mistake. It over-emphasises > the importance of the code point, which helps to continue the notion > in some developers? minds that code points are somehow ?characters?. > It also leads to people unnecessarily using UCS-4 as an internal > representation, which seems to have very few advantages in practice > over UTF-16. The problem is that UTF-16 based code can very easily overlook the handling of surrogate pairs, and one very easily get confused over what string lengths mean. Richard. From manish at mozilla.com Tue Mar 14 00:57:03 2017 From: manish at mozilla.com (Manish Goregaokar) Date: Mon, 13 Mar 2017 22:57:03 -0700 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2> <20170313234837.5d891338@JRWUBU2> Message-ID: Ah, it was what I thought you were talking about -- I wasn't aware they were considered word boundaries :) Thanks for the links! On Mar 13, 2017 4:54 PM, "Richard Wordingham" < richard.wordingham at ntlworld.com> wrote: On Mon, 13 Mar 2017 15:26:00 -0700 Manish Goregaokar wrote: > Do you have examples of AA being split that way (and further reading)? > I think I'm aware of what you're talking about, but would love to read > more about it. Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution' brings up plenty of papers and discussion, e.g. Hellwig's at http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at https://www.aclweb.org/anthology/C/C16/C16-1048.pdf. There are even technical terms for before and after. Unsplit text is 'samhita text', and text split into words is 'pada text'. Richard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alastair at alastairs-place.net Tue Mar 14 03:44:01 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Tue, 14 Mar 2017 08:44:01 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170313211011.GE1429@macbook> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> Message-ID: <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net> On 13 Mar 2017, at 21:10, Khaled Hosny wrote: > > On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote: >> On 13 Mar 2017, at 17:55, J Decker wrote: >>> >>> I liked the Go implementation of character type - a rune type - which is a codepoint. and strings that return runes from by index. >>> https://blog.golang.org/strings >> >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers? minds that code points are somehow ?characters?. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > But there are many text operations that require access to Unicode code > points. Take for example text layout, as mapping characters to glyphs > and back has to operate on code points. The idea that you never need to > work with code points is too simplistic. I didn?t say you never needed to work with code points. What I said is that there?s no advantage to UCS-4 as an encoding, and that there?s no advantage to being able to index a string by code point. As it happens, I?ve written the kind of code you cite as an example, including glyph mapping and OpenType processing, and the fact is that it?s no harder to do it with a UTF-16 string than it is with a UCS-4 string. Yes, certainly, surrogate pairs need to be decoded to map to glyphs; but that?s a *trivial* matter, particularly as the code point to glyph mapping is not 1:1 or even 1:N - it?s N:M, so you already need to cope with being able to map multiple code units in the string to multiple glyphs in the result. Kind regards, Alastair. -- http://alastairs-place.net From alastair at alastairs-place.net Tue Mar 14 03:51:18 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Tue, 14 Mar 2017 08:51:18 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170314020356.26ff5e89@JRWUBU2> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170314020356.26ff5e89@JRWUBU2> Message-ID: <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net> On 14 Mar 2017, at 02:03, Richard Wordingham wrote: > > On Mon, 13 Mar 2017 19:18:00 +0000 > Alastair Houghton wrote: > >> IMO, returning code points by index is a mistake. It over-emphasises >> the importance of the code point, which helps to continue the notion >> in some developers? minds that code points are somehow ?characters?. >> It also leads to people unnecessarily using UCS-4 as an internal >> representation, which seems to have very few advantages in practice >> over UTF-16. > > The problem is that UTF-16 based code can very easily overlook the > handling of surrogate pairs, and one very easily get confused over what > string lengths mean. Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters. As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units. They don?t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping. The *only* think a string length of a Unicode string will tell you is the number of code units. Kind regards, Alastair. -- http://alastairs-place.net From steffen at sdaoden.eu Tue Mar 14 07:21:27 2017 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Tue, 14 Mar 2017 13:21:27 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170313211011.GE1429@macbook> <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net> Message-ID: <20170314122127.g2gcS%steffen@sdaoden.eu> Alastair Houghton wrote: |On 13 Mar 2017, at 21:10, Khaled Hosny wrote: |> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote: |>> On 13 Mar 2017, at 17:55, J Decker wrote: |>>> |>>> I liked the Go implementation of character type - a rune type - \ |>>> which is a codepoint. and strings that return runes from by index. |>>> https://blog.golang.org/strings |>> |>> IMO, returning code points by index is a mistake. It over-emphasises |>> the importance of the code point, which helps to continue the notion |>> in some developers? minds that code points are somehow ?characters?. |>> It also leads to people unnecessarily using UCS-4 as an internal |>> representation, which seems to have very few advantages in practice |>> over UTF-16. |> |> But there are many text operations that require access to Unicode code |> points. Take for example text layout, as mapping characters to glyphs |> and back has to operate on code points. The idea that you never need to |> work with code points is too simplistic. | |I didn?t say you never needed to work with code points. What I said \ |is that there?s no advantage to UCS-4 as an encoding, and that there?s \ Well, you do have eleven bits for flags per codepoint, for example. |no advantage to being able to index a string by code point. As it \ With UTF-32 you can take the very codepoint and look up Unicode classification tables. |happens, I?ve written the kind of code you cite as an example, including \ |glyph mapping and OpenType processing, and the fact is that it?s no \ |harder to do it with a UTF-16 string than it is with a UCS-4 string. \ | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \ |but that?s a *trivial* matter, particularly as the code point to glyph \ |mapping is not 1:1 or even 1:N - it?s N:M, so you already need to cope \ |with being able to map multiple code units in the string to multiple \ |glyphs in the result. If you have to iterate over a string to perform some high-level processing then UTF-8 is a choice almost equally fine, for the very same reasons you bring in. And if the usage pattern "hotness" pictures that this thread has shown up at the beginning is correct, then the size overhead of UTF-8 that the UTF-16 proponents point out turns out to be a flop. But i for one gave up on making a stand against UTF-16 or BOMs. In fact i have turned to think UTF-16 is a pretty nice in-memory representation, and it is a small step to get from it to the real codepoint that you need to decide what something is, and what has to be done with it. I don't know whether i would really use it for this purpose, though, i am pretty sure that my core Unicode functions will (start to /) continue to use UTF-32, because the codepoint to codepoint(s) is what is described, and onto which anything else can be implemented. I.e., you can store three UTF-32 codepoints in a single uint64_t, and i would shoot myself in the foot if i would make this accessible via an UTF-16 or UTF-8 converter, imho; instead, i (will) make it accessible directly as UTF-32, and that serves equally well all other formats. Of course, if it is clear that you are UTF-16 all-through-the-way then you can save the conversion, but (the) most (widespread) Uni(x|ces) are UTF-8 based and it looks as if that would stay. Yes, yes, you can nonetheless use UTF-16, but it will most likely not safe you something on the database side due to storage alignment requirements, and the necessity to be able to access data somewhere. You can have a single index-lookup array and a dynamically sized database storage which uses two-byte alignment, of course, then i can imagine UTF-16 is for the better. I never looked how ICU does it, but i have been impressed by sheer data facts ^.^ --steffen From doug at ewellic.org Tue Mar 14 10:14:33 2017 From: doug at ewellic.org (Doug Ewell) Date: Tue, 14 Mar 2017 08:14:33 -0700 Subject: "A Programmer's Introduction to Unicode" Message-ID: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com> Steffen Nurpmeso wrote: >> I didn?t say you never needed to work with code points. What I said >> is that there?s no advantage to UCS-4 as an encoding, and that > > Well, you do have eleven bits for flags per codepoint, for example. That's not UCS-4; that's a custom encoding. (any UCS-4 code unit) & 0xFFE00000 == 0 -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Tue Mar 14 10:35:48 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 14 Mar 2017 16:35:48 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com> References: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com> Message-ID: Per definition yes, but UTC-4 is not Unicode. As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would allow 32 planes instead of just the 17 first ones). I suppose he meant 21 bits, not 11 bits which covers only a small part of the BMP. 2017-03-14 16:14 GMT+01:00 Doug Ewell : > Steffen Nurpmeso wrote: > > >> I didn?t say you never needed to work with code points. What I said > >> is that there?s no advantage to UCS-4 as an encoding, and that > > > > Well, you do have eleven bits for flags per codepoint, for example. > > That's not UCS-4; that's a custom encoding. > > (any UCS-4 code unit) & 0xFFE00000 == 0 > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 14 11:15:38 2017 From: doug at ewellic.org (Doug Ewell) Date: Tue, 14 Mar 2017 09:15:38 -0700 Subject: "A Programmer's Introduction to Unicode" Message-ID: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com> Philippe Verdy wrote: >>> Well, you do have eleven bits for flags per codepoint, for example. >> >> That's not UCS-4; that's a custom encoding. >> >> (any UCS-4 code unit) & 0xFFE00000 == 0 (changing to "UTF-32" per Ken's observation) > Per definition yes, but UTC-4 is not Unicode. I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting held in 1989? > As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not > Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which > would allow 32 planes instead of just the 17 first ones). I used bitwise arithmetic strictly to address Steffen's premise that the 11 "unused bits" in a UTF-32 code unit were available to store metadata about the code point. Of course UTF-32 does not allow 0x110000 through 0x1FFFFF either. > I suppose he meant 21 bits, not 11 bits which covers only a small part > of the BMP. No, his comment "you do have eleven bits for flags per codepoint" pretty clearly referred to using the "extra" 11 bits beyond what is needed to hold the Unicode scalar value. -- Doug Ewell | Thornton, CO, US | ewellic.org From richard.wordingham at ntlworld.com Tue Mar 14 15:28:33 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 14 Mar 2017 20:28:33 +0000 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net> References: <864lyzhvxz.fsf@mimuw.edu.pl> <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl> <20170312201022.7ec8d858@JRWUBU2> <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl> <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com> <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl> <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net> <20170314020356.26ff5e89@JRWUBU2> <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net> Message-ID: <20170314202833.08eb9d55@JRWUBU2> On Tue, 14 Mar 2017 08:51:18 +0000 Alastair Houghton wrote: > On 14 Mar 2017, at 02:03, Richard Wordingham > wrote: > > > > On Mon, 13 Mar 2017 19:18:00 +0000 > > Alastair Houghton wrote: > > The problem is that UTF-16 based code can very easily overlook the > > handling of surrogate pairs, and one very easily get confused over > > what string lengths mean. > > Yet the same problem exists for UCS-4; it could very easily overlook > the handling of combining characters. That's a different issue. I presume you mean the issues of canonical equivalence and detecting text boundaries. Again, there is the problem of remembering to consider the whole surrogate pair when using UTF-16. (I suppose this could be largely handled by avoiding the concept of arrays.) Now, the supplementary characters where these issues arise are very infrequently used. An error in UTF-16 code might easily not come to attention, whereas a problem with UCS-4 (or UTF-8) comes to light as soon as one handles Thai or IPA. > As for string lengths, string > lengths in code points are no more meaningful than string lengths in > UTF-16 code units. They don?t tell you anything about the number of > user-visible characters; or anything about the width the string will > take up if rendered on the display (even in a fixed-width font); or > anything about the number of glyphs that a given string might be > transformed into by glyph mapping. The *only* think a string length > of a Unicode string will tell you is the number of code units. A string length in codepoints does have the advantage of being independent of encoding. I'm actually using an index for UTF-16 text (I don't know whether its denominated in codepoints or code units) to index into the UTF-8 source code. However, the number of code units is the more commonly used quantity, as it tells one how much memory is required for simple array storage. Richard. From steffen at sdaoden.eu Wed Mar 15 05:40:54 2017 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Wed, 15 Mar 2017 11:40:54 +0100 Subject: "A Programmer's Introduction to Unicode" In-Reply-To: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com> References: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com> Message-ID: <20170315104054.tMouD%steffen@sdaoden.eu> "Doug Ewell" wrote: |Philippe Verdy wrote: |>>> Well, you do have eleven bits for flags per codepoint, for example. |>> |>> That's not UCS-4; that's a custom encoding. |>> |>> (any UCS-4 code unit) & 0xFFE00000 == 0 | |(changing to "UTF-32" per Ken's observation) | |> Per definition yes, but UTC-4 is not Unicode. | |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting |held in 1989? | |> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which |> would allow 32 planes instead of just the 17 first ones). | |I used bitwise arithmetic strictly to address Steffen's premise that the |11 "unused bits" in a UTF-32 code unit were available to store metadata |about the code point. Of course UTF-32 does not allow 0x110000 through |0x1FFFFF either. | |> I suppose he meant 21 bits, not 11 bits which covers only a small part |> of the BMP. | |No, his comment "you do have eleven bits for flags per codepoint" pretty |clearly referred to using the "extra" 11 bits beyond what is needed to |hold the Unicode scalar value. It surely is a weak argument for a general string encoding. But sometimes, and for local use cases it surely is valid. You could store the wcwidth(3) plus a graphem codepoint count both in these bits of the first codepoint of a cluster, for example, and, then, that storage detail hidden under an access method interface. --steffen From 637275 at gmail.com Fri Mar 17 11:53:43 2017 From: 637275 at gmail.com (Rebecca T) Date: Fri, 17 Mar 2017 12:53:43 -0400 Subject: Combining solidus above for transcription of poetic meter Message-ID: When transcribing poetic meter (scansion ), it is common to use two symbols above the line (usually a breve [U+306 ?] for stressed syllables and a solidus / slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex: ? / ? / ? / ? / ? / When I consider how my light is spent (John Milton, On His Blindness) Other symbols used in place of the breve are a cross / x (U+D8 ? or U+78 x) or bullet (U+B7 ? or U+2022 ?). This approach, however, is problematic; the lack of a combining slash above character means that two lines of text must be used, and any non-monospaced font (or any platform where multiple consecutive spaces are truncated into one by default, such as HTML) makes keeping the annotations properly aligned with the text difficult or impossible ? depending on your email client, the above example may be entirely misaligned. Being able to use combining diacritics for scansion would make these problems obsolete and enable a semantic transcription of meter. Would a proposal to add a combining solidus above (and possibly a combining reversed solidus above to support Hamer, Wright, and Trager-Smith notations) be supported? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Fri Mar 17 12:27:47 2017 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Fri, 17 Mar 2017 17:27:47 GMT Subject: Combining solidus above for transcription of poetic meter References: Message-ID: On 2017-03-17, Rebecca T <637275 at gmail.com> wrote: > When transcribing poetic meter (scansion >), it is common to use two symbols > above the line (usually a breve [U+306 ?] for stressed syllables and a > solidus > / slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex: Other way round, as you illustrate > This approach, however, is problematic; the lack of a combining slash above > character means that two lines of text must be used, and any non-monospaced > font (or any platform where multiple consecutive spaces are truncated into > one It won't help to have a "combining solidus a long way above" (which is what you really want) unless you also have "combining breve a long way above". If you are happy to use a typographically normal combining breve for the unstressed syllables, you should be happy to use a typographically normal acute accent for the stressed syllable. > by default, such as HTML) makes keeping the annotations properly aligned > with > the text difficult or impossible ? depending on your email client, the > above > example may be entirely misaligned. Being able to use combining diacritics > for > scansion would make these problems obsolete and enable a semantic > transcription > of meter. If you're working in a situation where you don't have either markup control or the facility to use plain monospaced text, then just use normal breves and acutes. It's not clear to me that laying out aligned text (for which there are many other applications than scansion, e.g. interlinear translation) is something best achieved with combining characters! -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From nobody_uses at outlook.com Fri Mar 17 13:46:45 2017 From: nobody_uses at outlook.com (eduardo marin) Date: Fri, 17 Mar 2017 18:46:45 +0000 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: You would need to propose the entire set of symbols, like the caret the reverse solidus and the x above, furthermore you would need to make the solidus small so it doesn't interfere with the line of text above. So go for it. ________________________________ De: Rebecca T <637275 at gmail.com> Enviado: viernes, 17 de marzo de 2017 10:53 a. m. Para: Unicode Public Asunto: Combining solidus above for transcription of poetic meter When transcribing poetic meter (scansion), it is common to use two symbols above the line (usually a breve [U+306 ?] for stressed syllables and a solidus / slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex: ? / ? / ? / ? / ? / When I consider how my light is spent (John Milton, On His Blindness) Other symbols used in place of the breve are a cross / x (U+D8 ? or U+78 x) or bullet (U+B7 ? or U+2022 ?). This approach, however, is problematic; the lack of a combining slash above character means that two lines of text must be used, and any non-monospaced font (or any platform where multiple consecutive spaces are truncated into one by default, such as HTML) makes keeping the annotations properly aligned with the text difficult or impossible ? depending on your email client, the above example may be entirely misaligned. Being able to use combining diacritics for scansion would make these problems obsolete and enable a semantic transcription of meter. Would a proposal to add a combining solidus above (and possibly a combining reversed solidus above to support Hamer, Wright, and Trager-Smith notations) be supported? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 17 14:03:12 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 17 Mar 2017 20:03:12 +0100 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: Isn't this a use case for interlinear annotations ? What is the current status of interlinear encoding? We were told that the encoded codepoints for these are more or less deprecated (but in HTML there's still interlinear annotation supported by ruby notations). In these annotations, we don't need any diacritics, we could just use base symbols. Garanti sans virus. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> 2017-03-17 18:27 GMT+01:00 Julian Bradfield : > On 2017-03-17, Rebecca T <637275 at gmail.com> wrote: > > When transcribing poetic meter (scansion > >), it is common to use two > symbols > > above the line (usually a breve [U+306 ?] for stressed syllables and a > > solidus > > / slash [U+2F /] for unstressed syllables) to indicate stress patterns. > Ex: > > Other way round, as you illustrate > > > This approach, however, is problematic; the lack of a combining slash > above > > character means that two lines of text must be used, and any > non-monospaced > > font (or any platform where multiple consecutive spaces are truncated > into > > one > > It won't help to have a "combining solidus a long way above" (which is > what you really want) unless you also have "combining breve a long way > above". > If you are happy to use a typographically normal combining breve for > the unstressed syllables, you should be happy to use a typographically > normal acute accent for the stressed syllable. > > > by default, such as HTML) makes keeping the annotations properly aligned > > with > > the text difficult or impossible ? depending on your email client, the > > above > > example may be entirely misaligned. Being able to use combining > diacritics > > for > > scansion would make these problems obsolete and enable a semantic > > transcription > > of meter. > > If you're working in a situation where you don't have either markup > control or the facility to use plain monospaced text, then just use > normal breves and acutes. > It's not clear to me that laying out aligned text (for which there are > many other applications than scansion, e.g. interlinear translation) > is something best achieved with combining characters! > > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 17 14:10:43 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 17 Mar 2017 20:10:43 +0100 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: 2017-03-17 18:27 GMT+01:00 Julian Bradfield : > If you are happy to use a typographically normal combining breve for > the unstressed syllables, you should be happy to use a typographically > normal acute accent for the stressed syllable. > You've understood the reverse! the stressed syllable in those notation uses a breve, the unstressed syllables use a slash/solidus (which many look very similar to an acute accent, but means here exactly the opposite). However using acute accents that are already used in many langauges for vowel distinctions (independantly of stress) would cause problems. It would be better to use the IPA stress mark that looks like a vertical tick just before the syllable (i.e. before its leading consonnant and not on top of its central vowel): these marks are not combining, they are regular spacing symbols. The proposal discusses about *some* specific use where symbols that look like diacritics may be used in a row just above the actual text (in that case it should not be confused with the actual accents). That's why I think this better fits with interlinear annotations (there will be some vertical margin between the notation and the text using its native diacritics, and the interlinear stress marks will align horizontally without colliding wit h the text whose diacritics would have variable placement, not aligned horizontally but depending on base letters or the presence of other diacritics). Garanti sans virus. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 17 14:16:17 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 17 Mar 2017 20:16:17 +0100 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: Final note: the HTML ruby syntax (their standard tags) is not supported by MediaWiki, for your example article in English Wikipedia (but there are some templates that could simulate ruby notation, using equivalent CSS to which the ruby notation should have a default mapping, as specified in an annex of the HTML standard suggesting a default CSS stylesheet for standard HTML tags). 2017-03-17 20:10 GMT+01:00 Philippe Verdy : > 2017-03-17 18:27 GMT+01:00 Julian Bradfield : > >> If you are happy to use a typographically normal combining breve for >> the unstressed syllables, you should be happy to use a typographically >> normal acute accent for the stressed syllable. >> > > You've understood the reverse! the stressed syllable in those notation > uses a breve, the unstressed syllables use a slash/solidus (which many look > very similar to an acute accent, but means here exactly the opposite). > However using acute accents that are already used in many langauges for > vowel distinctions (independantly of stress) would cause problems. > > It would be better to use the IPA stress mark that looks like a vertical > tick just before the syllable (i.e. before its leading consonnant and not > on top of its central vowel): these marks are not combining, they are > regular spacing symbols. > > The proposal discusses about *some* specific use where symbols that look > like diacritics may be used in a row just above the actual text (in that > case it should not be confused with the actual accents). > > That's why I think this better fits with interlinear annotations (there > will be some vertical margin between the notation and the text using its > native diacritics, and the interlinear stress marks will align horizontally > without colliding wit h the text whose diacritics would have variable > placement, not aligned horizontally but depending on base letters or the > presence of other diacritics). > > > > > Garanti > sans virus. www.avast.com > > <#m_2934369818200883392_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Mar 17 14:23:16 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 17 Mar 2017 20:23:16 +0100 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: An article for you to read that privides some basic guides and a presentaiton of the concept and its use in HTML: https://en.wikipedia.org/wiki/Ruby_character Then look at CSS 2.0 for specifications. In Unicode, 3 format control characters were encoded for this (U+FFF9...U+FFFB), but that supports only a minimalist subset of the ruby feature and which are (as far as I know) poorly supported in browsers (almost no one use them, not even for the common ruby text used in Asian languages, notably in Japanese for the Furigana notations using kanas above sinographic Kanji text, or in Chinese for the Bopomofo or Latin notations above sinographic text found in educational books for children). Garanti sans virus. www.avast.com <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> 2017-03-17 20:16 GMT+01:00 Philippe Verdy : > Final note: the HTML ruby syntax (their standard tags) is not supported by > MediaWiki, for your example article in English Wikipedia (but there are > some templates that could simulate ruby notation, using equivalent CSS to > which the ruby notation should have a default mapping, as specified in an > annex of the HTML standard suggesting a default CSS stylesheet for standard > HTML tags). > > 2017-03-17 20:10 GMT+01:00 Philippe Verdy : > >> 2017-03-17 18:27 GMT+01:00 Julian Bradfield : >> >>> If you are happy to use a typographically normal combining breve for >>> the unstressed syllables, you should be happy to use a typographically >>> normal acute accent for the stressed syllable. >>> >> >> You've understood the reverse! the stressed syllable in those notation >> uses a breve, the unstressed syllables use a slash/solidus (which many look >> very similar to an acute accent, but means here exactly the opposite). >> However using acute accents that are already used in many langauges for >> vowel distinctions (independantly of stress) would cause problems. >> >> It would be better to use the IPA stress mark that looks like a vertical >> tick just before the syllable (i.e. before its leading consonnant and not >> on top of its central vowel): these marks are not combining, they are >> regular spacing symbols. >> >> The proposal discusses about *some* specific use where symbols that look >> like diacritics may be used in a row just above the actual text (in that >> case it should not be confused with the actual accents). >> >> That's why I think this better fits with interlinear annotations (there >> will be some vertical margin between the notation and the text using its >> native diacritics, and the interlinear stress marks will align horizontally >> without colliding wit h the text whose diacritics would have variable >> placement, not aligned horizontally but depending on base letters or the >> presence of other diacritics). >> >> >> >> >> Garanti >> sans virus. www.avast.com >> >> <#m_-5720946395316280878_m_2934369818200883392_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Mar 17 14:41:29 2017 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 17 Mar 2017 12:41:29 -0700 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: On 3/17/2017 10:27 AM, Julian Bradfield wrote: > If you're working in a situation where you don't have either markup > control or the facility to use plain monospaced text, then just use > normal breves and acutes. > It's not clear to me that laying out aligned text (for which there are > many other applications than scansion, e.g. interlinear translation) > is something best achieved with combining characters! I concur with Julian here. In fact, the very wiki article on scansion cited by Rebecca makes it clear that this is an interlinear type of annotation that in principle can use many *other* symbols, including x's (or multiplication signs), digits, circumflexes, and other symbols. Furthermore, the application of the scansion marks is to *syllables* and not to individual letters, which further enhances the case for interlinear representation. The simplest implementation of that is precisely as done in that wiki: force the interlinear examples into a monospace font. For simple transposing of an interlinear scansion into a single-line plain text representation, either combining breves and acutes (and circumflexes and graves, ...) can be used and/or spacing versions of breves (and circumflexes...) plus ordinary slashes and backslashes can be dropped into the syllabified text. --Ken > From jcb+unicode at inf.ed.ac.uk Fri Mar 17 15:36:13 2017 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Fri, 17 Mar 2017 20:36:13 +0000 (GMT) Subject: Combining solidus above for transcription of poetic meter References: Message-ID: On 2017-03-17, Philippe Verdy wrote: > 2017-03-17 18:27 GMT+01:00 Julian Bradfield : > >> If you are happy to use a typographically normal combining breve for >> the unstressed syllables, you should be happy to use a typographically >> normal acute accent for the stressed syllable. >> > > You've understood the reverse! the stressed syllable in those notation uses > a breve, the unstressed syllables use a slash/solidus (which many look very > similar to an acute accent, but means here exactly the opposite). I have understood the situation as it actually is (and indeed as it is described in the Wikipedia article). *As I pointed out*, had you bothered to read what I wrote, the OP accidentally reversed the standard notation, in which / indicates a stressed syllable, and a breve an unstressed. Hence there is no clash with the (e.g.) Spanish use of an acute to indicate stress. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From boldewyn at gmail.com Fri Mar 17 15:44:15 2017 From: boldewyn at gmail.com (Manuel Strehl) Date: Fri, 17 Mar 2017 21:44:15 +0100 Subject: New tool unidump Message-ID: Hi, for my work on codepoints.net and Emojipedia I found myself repeatedly in a place, where I needed some tool like hexdump to inspect the content of a string. However, instead of raw bytes I am more interested in the code points that the string is composed of. So I wrote this tool. I reasoned, that it might come in handy for other people on this list. It is, conveniently, named unidump and can be installed via pip (pip3, that is, because it needs Python 3): pip3 install unidump The source code is available on Github, https://github.com/Codepoints/unidump, and the tool is MIT licensed. The README on Github also explains some other use cases, like counting code points in a file (as opposed to bytes) or using it as a replacement for strings(1). If you have any comment, feedback, bug report or other questions, I'm glad to answer any of those. Cheers and have a nice weekend, Manuel From everson at evertype.com Fri Mar 17 15:53:24 2017 From: everson at evertype.com (Michael Everson) Date: Fri, 17 Mar 2017 20:53:24 +0000 Subject: Combining solidus above for transcription of poetic meter In-Reply-To: References: Message-ID: <375186EF-815E-4847-987B-03E94A6C1BBB@evertype.com> http://www.brill.com/files/brill.nl/special_scripts_metrical_characters_unicode.pdf From manish at mozilla.com Fri Mar 17 18:43:04 2017 From: manish at mozilla.com (Manish Goregaokar) Date: Fri, 17 Mar 2017 16:43:04 -0700 Subject: New tool unidump In-Reply-To: References: Message-ID: https://r12a.github.io/uniview/ https://r12a.github.io/apps/conversion/ are excellent tools for this, as well, if you're in a situation where you can copy into a web form. This looks useful for commandline stuff, though, thanks! -Manish On Fri, Mar 17, 2017 at 1:44 PM, Manuel Strehl wrote: > Hi, > > for my work on codepoints.net and Emojipedia I found myself repeatedly > in a place, where I needed some tool like hexdump to inspect the content > of a string. However, instead of raw bytes I am more interested in the > code points that the string is composed of. So I wrote this tool. > > I reasoned, that it might come in handy for other people on this list. > It is, conveniently, named unidump and can be installed via pip (pip3, > that is, because it needs Python 3): > > pip3 install unidump > > The source code is available on Github, > https://github.com/Codepoints/unidump, and the tool is MIT licensed. The > README on Github also explains some other use cases, like counting code > points in a file (as opposed to bytes) or using it as a replacement for > strings(1). > > If you have any comment, feedback, bug report or other questions, I'm > glad to answer any of those. > > Cheers and have a nice weekend, > Manuel From jsbien at mimuw.edu.pl Sat Mar 18 00:42:05 2017 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Sat, 18 Mar 2017 06:42:05 +0100 Subject: New tool unidump In-Reply-To: References: Message-ID: <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl> Quote/Cytat - Manuel Strehl (Fri 17 Mar 2017 09:44:15 PM CET): > Hi, > > for my work on codepoints.net and Emojipedia I found myself repeatedly > in a place, where I needed some tool like hexdump to inspect the content > of a string. However, instead of raw bytes I am more interested in the > code points that the string is composed of. So I wrote this tool. Is somebody maintaining a list of such utilities? There is a page http://www.unicode.org/resources/online-tools.html but I remember that earlier a page on the site used to be links to the programs mentioned in 2012 "Tool to convert characters to character names", in particular to Bill Poser's uniutils (http://billposer.org/Software/unidesc.html) and the orphaned unihist by a student of mine (https://bitbucket.org/jsbien/unihistext). I'm unable to find them now. Best regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From 637275 at gmail.com Sun Mar 19 16:46:28 2017 From: 637275 at gmail.com (Rebecca T) Date: Sun, 19 Mar 2017 17:46:28 -0400 Subject: New tool unidump In-Reply-To: <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl> References: <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl> Message-ID: I maintain a list of various Unicode tools and resources at unicode.9999yea.rs and always welcome new additions! On Sat, Mar 18, 2017 at 1:42 AM, Janusz S. Bien wrote: > Quote/Cytat - Manuel Strehl (Fri 17 Mar 2017 > 09:44:15 PM CET): > > Hi, >> >> for my work on codepoints.net and Emojipedia I found myself repeatedly >> in a place, where I needed some tool like hexdump to inspect the content >> of a string. However, instead of raw bytes I am more interested in the >> code points that the string is composed of. So I wrote this tool. >> > > Is somebody maintaining a list of such utilities? > > There is a page > > http://www.unicode.org/resources/online-tools.html > > but I remember that earlier a page on the site used to be links to the > programs mentioned in 2012 "Tool to convert characters to character names", > in particular to Bill Poser's uniutils (http://billposer.org/Software > /unidesc.html) and the orphaned unihist by a student of mine ( > https://bitbucket.org/jsbien/unihistext). I'm unable to find them now. > > Best regards > > Janusz > > -- > Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~ > jsbien/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.lukyanov at yspu.org Mon Mar 20 04:27:35 2017 From: a.lukyanov at yspu.org (Andrey Lukyanov) Date: Mon, 20 Mar 2017 12:27:35 +0300 Subject: New tool unidump In-Reply-To: References: Message-ID: <3b657ee35ea3e5ff075363d9dd2cc7ea@mail> apropos unidump: It would be nice to add the option of printing not only numbers, but also character names and other info from the NamesList.txt file. I am using a homemade program at my computer: $ typecode fc25 66 63 32 35 $ typecode -l fc25 0066 LATIN SMALL LETTER F 0063 LATIN SMALL LETTER C 0032 DIGIT TWO 0035 DIGIT FIVE $ typecode -f fc25 0066 LATIN SMALL LETTER F 0063 LATIN SMALL LETTER C 0032 DIGIT TWO ~ 0032 FE0E text style ~ 0032 FE0F emoji style 0035 DIGIT FIVE ~ 0035 FE0E text style ~ 0035 FE0F emoji style From c933103 at gmail.com Tue Mar 21 07:12:10 2017 From: c933103 at gmail.com (gfb hjjhjh) Date: Tue, 21 Mar 2017 20:12:10 +0800 Subject: Standaridized variation sequences for the Desert alphabet? Message-ID: According to the wikipedia page for the Desert alphabet, there're critism that in the unicode chart some of the letter encoded for the alphabet used the 1855 design instead of 1859 deisgn of those characters. Would it be a good idea to make ?standardized variation sequences for those characters so that they can be displayed eitherway upon users' wish? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 21 12:41:04 2017 From: doug at ewellic.org (Doug Ewell) Date: Tue, 21 Mar 2017 10:41:04 -0700 Subject: Standaridized variation sequences for the Desert =?UTF-8?Q?alphabet=3F?= Message-ID: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> gfb hjjhjh wrote: > According to the wikipedia page for the Desert alphabet, there're > critism that in the unicode chart some of the letter encoded for the > alphabet used the 1855 design instead of 1859 deisgn of those > characters. Would it be a good idea to make ?standardized variation > sequences for those characters so that they can be displayed eitherway > upon users' wish? Almost any letter in any script can have glyph variations that don't represent a change in semantics. A Deseret font could easily, and conformantly, be constructed with whatever set of glyphs the designer wishes to show, just as it could for a Latin-script font. -- Doug Ewell | Thornton, CO, US | ewellic.org From jameskasskrv at gmail.com Tue Mar 21 18:17:11 2017 From: jameskasskrv at gmail.com (James Kass) Date: Tue, 21 Mar 2017 15:17:11 -0800 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: https://en.wikipedia.org/wiki/Deseret_alphabet An interesting article. The "Encodings" section illustrates the differences between the older and newer forms of the two letters. Doug Ewell wrote, > A Deseret font could easily, and conformantly, be > constructed with whatever set of glyphs the designer > wishes to show, just as it could for a Latin-script > font. If the user community needs to preserve the distinction in plain-text, then variation selection is the right approach. Best regards, James Kass From prosfilaes at gmail.com Tue Mar 21 18:55:26 2017 From: prosfilaes at gmail.com (David Starner) Date: Tue, 21 Mar 2017 23:55:26 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: On Tue, Mar 21, 2017 at 4:50 PM James Kass wrote: > If the user community needs to preserve the distinction in plain-text, > then variation selection is the right approach. > True. However, the user community is tiny, and I suspect that those variation selectors would never get used. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Wed Mar 22 10:47:30 2017 From: everson at evertype.com (Michael Everson) Date: Wed, 22 Mar 2017 15:47:30 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: The right first thing to do is to examine the letterforms and determine on structural grounds whether there is a case to be made for encoding. Beesley claimed in 2002 that the glyphs used for EW [ju] and OI [??] changed between 1855 and 1859. Well, OK. 1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], that is, [?] + [o?] = [?u?], that is, [ju]. 2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], that is, [?] + [?] = [??], that is, [??]. That?s encoded. Now evidently, the glyphs for the 1859 substitutions are as follows: 1. The 1859 glyph for EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], that is, [?] + [?] = [??], that is, [ju]. 2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that is, [??] + [?] = [???], that is, [??]. If there is evidence outside of the Wikipedia for the 1859 letters, they should be encoded as new letters, because their design shows them to be ligatures of different base characters. That means they?re not glyph variants of the currently encoded letters. Michael Everson From wjgo_10009 at btinternet.com Wed Mar 22 10:54:39 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 22 Mar 2017 15:54:39 +0000 (GMT) Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: <3780339.48221.1490198079154.JavaMail.defaultUser@defaultHost> >> If the user community needs to preserve the distinction in plain-text, then variation selection is the right approach. > True. However, the user community is tiny, and I suspect that those variation selectors would never get used. I do not use Deseret myself. I opine that encoding the variation selector sequences would be good. My reason for that opinion is because I opine that Unicode should provide for such situations where they are known to exist, even if the usage of the encoding may be very rare. Am I correct in thinking that making use of such a variation selector encoding would be a font issue rather than an operating system issue? Unicode is intended to be a long-lasting standardized system, so hopefully adding the variation selector sequences into The Unicode Standard now would provide support for a very long time. Am I correct in thinking that the cost of adding the variation selector sequences into The Unicode Standard would be very small? William Overington Wednesday 22 March 2017 From jenkins at apple.com Wed Mar 22 11:50:26 2017 From: jenkins at apple.com (John H. Jenkins) Date: Wed, 22 Mar 2017 10:50:26 -0600 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com> My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts. It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table. From everson at evertype.com Wed Mar 22 12:44:04 2017 From: everson at evertype.com (Michael Everson) Date: Wed, 22 Mar 2017 17:44:04 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com> Message-ID: <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com> On 22 Mar 2017, at 16:50, John H. Jenkins wrote: > > My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. There?s identity in terms of intended usage (two diphthongs), and identity in terms of the origin of the characters (ligatures from different sources). That kind of etymology is indeed something that we take into account when encoding characters. > In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. I think I have to stand by my glyph analysis > There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts. Dunno what you are referring to here. > It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table. I would oppose such a change given the origin of the four characters we have discussed. The old EW and OI and the new EW and OI are clearly *different* letters. Michael From jameskasskrv at gmail.com Wed Mar 22 15:26:31 2017 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 22 Mar 2017 12:26:31 -0800 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com> <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com> Message-ID: Michael Everson wrote, > The old EW and OI and the new EW and OI are > clearly *different* letters. "Different" versus "variant"? Michael's analysis seems correct. If Deseret was not already in the Standard, a new proposal for its encoding including eight characters covering the two dipthongs would not be amiss, would it? An alternative would be to use the ZWJ mechanism to indicate a preference for the desired letters. My opinion that variation selectors would be the right approach was based upon concerns about existing data getting "broken". But, if there isn't any existing data... Best regards, James Kass From everson at evertype.com Wed Mar 22 16:33:39 2017 From: everson at evertype.com (Michael Everson) Date: Wed, 22 Mar 2017 21:33:39 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com> <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com> Message-ID: On 22 Mar 2017, at 20:26, James Kass wrote: > Michael Everson wrote, > >> The old EW and OI and the new EW and OI are clearly *different* letters. > > "Different" versus "variant?? Yes, different. All of them share the SHORT I [?] stroke but the base characters are ?? ?? (1855) and ?? ?? (1859). > Michael's analysis seems correct. If Deseret was not already in the Standard, a new proposal for its encoding including eight characters covering the two dipthongs would not be amiss, would it? Capital and small ?? ?? ?? ?? are already encoded. If the other four are required, nothing prevents them from being proposed and added. > An alternative would be to use the ZWJ mechanism to indicate a preference for the desired letters. Joining what? We encoded ?? ?? ?? ?? explicitly, not as ligatures, though they are in origin ligatures. > My opinion that variation selectors would be the right approach was based upon concerns about existing data getting "broken". But, if there isn't any existing data? If ?? is in origin a ligature of ???? and the 1859 one is in origin a ligature of ???? then the 1855 and 1859 letters are **NOT** ?variants? of one another. They are *different* letters in origin, regardless of their intended use. The choice to use 1855 EW or 1859 EW is a matter of *spelling*, not glyph substitution. If the later letters are really required, they should be added to the standard. We should not abandon the good precedent we have for character identification just for expedience. That?d be a way to turn the UCS into a glyph registry. :-( Michael Everson From prosfilaes at gmail.com Wed Mar 22 16:39:27 2017 From: prosfilaes at gmail.com (David Starner) Date: Wed, 22 Mar 2017 21:39:27 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: On Wed, Mar 22, 2017 at 8:54 AM Michael Everson wrote: > If there is evidence outside of the Wikipedia for the 1859 letters, they > should be encoded as new letters, because their design shows them to be > ligatures of different base characters. That means they?re not glyph > variants of the currently encoded letters. > Does "?ussia" require a new Latin letter because the way R was written has a different origin than the normal R? There's huge variation in Latin script including all sorts of different glyphs, and I suspect ?ussia is way more common than any use of the Deseret script. There's the same characters here, written in different ways. The glyphs may come from a different origin, but it's encoding the same idea. If a user community considers them separate, then they should be separated, but I don't see that happening, and from an idealistic perspective, I think they're platonically the same. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Wed Mar 22 19:03:44 2017 From: everson at evertype.com (Michael Everson) Date: Thu, 23 Mar 2017 00:03:44 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: On 22 Mar 2017, at 21:39, David Starner wrote: > > Does "?ussia" require a new Latin letter because the way R was written has a different origin than the normal R? But it doesn?t. It?s the Latin letter R turned backwards by a designer for a logo. We wouldn?t encode that, because it?s a logo. > There's huge variation in Latin script including all sorts of different glyphs, and I suspect ?ussia is way more common than any use of the Deseret script. In order to represent that logo, people use the Cyrillic letter ?, as you know. > There's the same characters here, written in different ways. No, it?s not. Its the same diphthong (a sound) written with different letters. > The glyphs may come from a different origin, but it's encoding the same idea. We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI). Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another. > If a user community considers them separate, then they should be separated, but I don't see that happening, and from an idealistic perspective, I think they're platonically the same. I do not agree with that analysis. The ligatures and their constituent parts are distinct and distinctive. In fact, it might have been that the choice for revision was to improve the underlying phonology. In any case, there?s no way that the bottom pair in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg can be considered to be ?glyph variants? of the top pair. Usage is one thing. Character identity is another. ? is not ?. A ligature of ?? + ?? is not a ligature of ?? + ??. Michael Everson From charupdate at orange.fr Wed Mar 22 19:20:14 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 23 Mar 2017 01:20:14 +0100 (CET) Subject: Flaw on Side View vs Front View Emoji Pairs? Message-ID: <700225421.24225.1490228414657.JavaMail.www@wwinf1f14> Here is an issue that admittedly is unsignificant when compared to on-going world events, but I need to work on some documents to be finished these days. Some transport emoji pairs appear to have been encoded at the same time (6.0), but have their glyphs swapped in some current font(s). These include: U+1F68C BUS U+1F68D ONCOMING BUS U+1F692 FIRE ENGINE U+1F6F1 ONCOMING FIRE ENGINE U+1F693 POLICE CAR U+1F694 ONCOMING POLICE CAR U+1F695 TAXI U+1F696 ONCOMING TAXI U+1F697 AUTOMOBILE U+1F698 ONCOMING AUTOMOBILE While on cellphones, the first are side views (source: iemoji.com), the latter ones are conformant front views. By contrast, web browsers on Windows use a font or fonts that show the first in front view, while the others are missing. I note that both are ?fully conformant? to the Standard, so far as the name is a mere identifier, not a descriptor, and the glyphs in the charts have little of a prescription. At least, whenever the name is generic as of perspective, any designer of somewhat related glyphs can claim conformance, and Unicode has to endorse the resulting flaw. I note, too, that ?oncoming? is often misunderstood as carrying a connotation of dynamics, whereas in reality, many vehicles are more iconic in front view, while others stand more out in side view. Was it imaginable to be precise and call them simply: U+1F68C BUS SIDE VIEW U+1F68D BUS FRONT VIEW U+1F692 FIRE ENGINE SIDE VIEW U+1F6F1 FIRE ENGINE FRONT VIEW U+1F693 POLICE CAR SIDE VIEW U+1F694 POLICE CAR FRONT VIEW U+1F695 TAXI SIDE VIEW U+1F696 TAXI FRONT VIEW U+1F697 AUTOMOBILE SIDE VIEW U+1F698 AUTOMOBILE FRONT VIEW Or did the first ones already exist in both views, so that it was desirable to add one more character for each one of them to make sure to get front views? That would imply that fonts with the first in front view don?t need to support the second characters, treated as mere glyph variants. In any case we seem to have some drawbacks to choose from data interchange flaws and document rendering flaws. Does the original proposer or anybody else have any clues on how the set was intended, and how to fix the discrepancy? Regards, Marcel From duerst at it.aoyama.ac.jp Thu Mar 23 00:54:03 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 23 Mar 2017 14:54:03 +0900 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> Hello Michael, others, [Fixed script name in subject.] On 2017/03/23 09:03, Michael Everson wrote: > On 22 Mar 2017, at 21:39, David Starner wrote: >> There's the same characters here, written in different ways. > > No, it?s not. Its the same diphthong (a sound) written with different letters. I think this may well be the *historically* correct analysis. And that may have some influence on how to encode this, but it shouldn't be dominant. What's most important is (past and) *current use*. If the distinction is an orthographic one (e.g. different words being written with different shapes), then that's definitely a good indication for splitting. On the other hand, if fonts (before/outside Unicode) only include one variant at the time, if people read over the variant without much ado, if people would be surprised to find both corresponding variants in one and the same text (absent font variations), if there are examples where e.g. the variant is adjusted in quotes from texts that used the 'old' variant inside a text with the 'new' variants, and so on, then all these would be good indications that this is, for actual usage purposes, just a font difference, and should therefore best be handled as such. The closes to the current case that I was able to find was the German ?. It has roots in both an ss and an sz (to be precise, an ?s and an ?z) ligature (see https://en.wikipedia.org/wiki/?). And indeed in some fonts, its right part looks more like an s, and in other fonts more like a z (and in lower case, more often like an s, but in upper case, much more like a (cursive) Z). Nevertheless, there is only one character (or two if you count upper case) encoded, because anything else would be highly confusing to virtually all users. What is right for Deseret has to be decided by and for Deseret users, rather than by script historians. Regards, Martin. >> The glyphs may come from a different origin, but it's encoding the same idea. > > We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI). > > Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another. > >> If a user community considers them separate, then they should be separated, but I don't see that happening, and from an idealistic perspective, I think they're platonically the same. > > I do not agree with that analysis. The ligatures and their constituent parts are distinct and distinctive. In fact, it might have been that the choice for revision was to improve the underlying phonology. In any case, there?s no way that the bottom pair in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg can be considered to be ?glyph variants? of the top pair. Usage is one thing. Character identity is another. ? is not ?. A ligature of ?? + ?? is not a ligature of ?? + ??. > > Michael Everson > . From prosfilaes at gmail.com Thu Mar 23 01:28:26 2017 From: prosfilaes at gmail.com (David Starner) Date: Thu, 23 Mar 2017 06:28:26 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: On Wed, Mar 22, 2017 at 5:09 PM Michael Everson wrote: > On 22 Mar 2017, at 21:39, David Starner wrote: > > > > Does "?ussia" require a new Latin letter because the way R was written > has a different origin than the normal R? > > But it doesn?t. It?s the Latin letter R turned backwards by a designer for > a logo. We wouldn?t encode that, because it?s a logo. > What logo? I honestly don't know what logo you're talking about, but a quick Google search confirms it's used outside of a logo. I was thinking of http://www.sjgames.com/gurps/books/Russia/img/cover_lg.jpg which actually doesn't use the reversed R, but uses other Cyrillic characters. > We don?t encode diphthongs. We encode the elements of writing systems. The > ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one > ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one > ligature of ?? + ?? (1859 OI). > If they're ligatures, they should be encoded as ligatures; if they're indivisible characters, then their glyph forms are of less interest. > Those ligatures are not glyph variants of one another. You might as well > say that ? and ? are glyph variants of one another. > ? and ? have contrasting use; they're used in the same text in distinct ways. Note that n and v? are considered glyph variants of each other, because v? is used in Sutterlin in exactly the places that n is used in typewritten versions of the text. > ? is not ?. > ? is not ? even when they are printed in fonts that make it nearly impossible to tell them apart. It has nothing to do with the glyphs or how those glyphs were created, it's because they're used in different ways. The example of Sutterlin strikes me as quite relevant here; characters get all sorts of weird shapes in handwriting. Sometimes they end up immortalized in printing, and then they usually get encoded. Usually not. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Thu Mar 23 04:33:39 2017 From: jameskasskrv at gmail.com (James Kass) Date: Thu, 23 Mar 2017 01:33:39 -0800 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> Message-ID: Martin J. D?rst wrote, > What is right for Deseret has to be decided by > and for Deseret users, rather than by script > historians. The Universal Character Set is used by everyone, including script historians. While modern day deployment of the script is determined by its users, the proper encoding of the script should be detemined by character encoders based upon expert input from all interested parties. Best regards, James Kass From otto.stolz at uni-konstanz.de Thu Mar 23 05:23:27 2017 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Thu, 23 Mar 2017 11:23:27 +0100 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> Message-ID: <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de> Hello Michael, others, On 2017/03/23 09:03, Michael Everson wrote: > Its the same diphthong (a sound) written with different > letters. Am 23.03.2017 um 06:54 schrieb Martin J. D?rst: > I think this may well be the *historically* correct analysis. And that > may have some influence on how to encode this, but it shouldn't be > dominant. > > What's most important is (past and) *current use*. Same issue as with German sharp S: The blackletter ??? derives from an ?-z ligature (thence its German name ?Eszet?), whilst the Roman type ??? derives from an ?-s ligature. Still, we encode both variants as identical letters. I?ve got a print from 1739 with legends in both German (blackletter) and French (Roman italics), comprising both types of ligatures in one single document. Best wishes, Otto From richard.wordingham at ntlworld.com Thu Mar 23 06:21:28 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 23 Mar 2017 11:21:28 +0000 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de> Message-ID: <20170323112128.49075cb0@JRWUBU2> On Thu, 23 Mar 2017 11:23:27 +0100 Otto Stolz wrote: > Same issue as with German sharp S: The blackletter ??? derives from an > ?-z ligature (thence its German name ?Eszet?), whilst the Roman type > ??? derives from an ?-s ligature. Still, we encode both variants as > identical letters. I?ve got a print from 1739 with legends in both > German (blackletter) and French (Roman italics), comprising both types > of ligatures in one single document. There's another, lesser German analogy. If I understand correctly, in some styles the diaeresis and umlaut marks may be distinguished visually. While it is permissible to use CGJ to mark the difference, the TUS claims (TUS 9.0 p833, in Section 23.2) that CGJ does not affect rendering, except for the direct effect of blocking canonical reordering. (This does appear to be in contrast to its seemingly archaic effect in inhibiting line-breaking.) However, combining marks are, by policy, unified more readily than letters. Richard. From verdy_p at wanadoo.fr Thu Mar 23 07:26:43 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 23 Mar 2017 13:26:43 +0100 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> Message-ID: 2017-03-23 6:54 GMT+01:00 Martin J. D?rst : > Hello Michael, others, > > On 2017/03/23 09:03, Michael Everson wrote: > >> On 22 Mar 2017, at 21:39, David Starner wrote: >> > > There's the same characters here, written in different ways. >>> >> >> No, it?s not. Its the same diphthong (a sound) written with different >> letters. >> > > The closes to the current case that I was able to find was the German ?. > It has roots in both an ss and an sz (to be precise, an ?s and an ?z) > ligature (see https://en.wikipedia.org/wiki/?). And indeed in some fonts, > its right part looks more like an s, and in other fonts more like a z (and > in lower case, more often like an s, but in upper case, much more like a > (cursive) Z). Nevertheless, there is only one character (or two if you > count upper case) encoded, because anything else would be highly confusing > to virtually all users. > This is a good case for encoding explicit variants, including for the two German ?, to distinguish letter forms in historic (medieval?) texts where ?s and ?z were more distinguished. This does not require disuynification, and fonts that can have both forms can choose the correct glyph to use for each variant, and take a default form for the unified character depending on the contextual language (if it is detected) or based on the font style itself (if it was initially designed for a specific language, notably in medieval styles). > What is right for Deseret has to be decided by and for Deseret users, > rather than by script historians. > In historic texts it is not clear which letter form is better than the other, and historic Deseret was basically for a single language (but there may have been regional variants prefering a form instead of the other). I think that now the distinction is in fact more recent, where some eople will want to distinguish them for new uses with dinstinctions. Here also a variant encoding would solve these special cases but we should not disunify the character (and in fact there's not a lot of fonts except for fancy usages, such as trying to mimic handwritten styles for specific authors about how they draw these shapes; I've not seen however any conclusive case of distinction in typesetted texts). In fact we are in a situation similar to the case of shapes for decimal digits like 4 (open or closed), 7 (with an overstriking bar or none), or 0 (with an overstriking slash or dot, or none), 3 (with an angular or circle top part), or letters like g (with a curled leg drawn counterclockwise, or just a bottom foot from right to left: here a distinctive shape was encoded for the IPA symbol) > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Thu Mar 23 08:32:46 2017 From: everson at evertype.com (Michael Everson) Date: Thu, 23 Mar 2017 13:32:46 +0000 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> Message-ID: <540B0347-175E-4B73-9420-2E6410202995@evertype.com> > On 23 Mar 2017, at 05:54, Martin J. D?rst wrote: > > Hello Michael, others, > > [Fixed script name in subject.] > > On 2017/03/23 09:03, Michael Everson wrote: >> On 22 Mar 2017, at 21:39, David Starner wrote: > >>> There's the same characters here, written in different ways. >> >> No, it?s not. Its the same diphthong (a sound) written with different letters. > > I think this may well be the *historically* correct analysis. And that may have some influence on how to encode this, but it shouldn't be dominant. Well, Martin, maybe you?re comfortable with shifting goalposts, but we have used historically correct analysis to identify characters in the past and to continue with this precedent is consistent with good practice. > What's most important is (past and) *current use*. If the distinction is an orthographic one (e.g. different words being written with different shapes), then that's definitely a good indication for splitting. It *is* an orthographic one. For one thing, the 1859 glyphs look NOTHING LIKE the 1855 glyphs. > On the other hand, if fonts (before/outside Unicode) only include one variant at the time, if people read over the variant without much ado, if people would be surprised to find both corresponding variants in one and the same text (absent font variations), if there are examples where e.g. the variant is adjusted in quotes from texts that used the 'old' variant inside a text with the 'new' variants, and so on, then all these would be good indications that this is, for actual usage purposes, just a font difference, and should therefore best be handled as such. Um, yeah. Why have Unicode at all? I mean people in Georgia were happy with ASCII-based font hacks. Lots of people are still using them. Sure, people put up with the unification of Coptic and Greek. Just font differences. Yeah. > The closes to the current case that I was able to find was the German ?. It has roots in both an ss and an sz (to be precise, an ?s and an ?z) ligature (see https://en.wikipedia.org/wiki/?). And indeed in some fonts, its right part looks more like an s, and in other fonts more like a z (and in lower case, more often like an s, but in upper case, much more like a (cursive) Z). Nevertheless, there is only one character (or two if you count upper case) encoded, because anything else would be highly confusing to virtually all users. The situation of the Deseret diphthong letters isn?t anything like German ?. Yes, you can analyse it as something like ?s and ??, but THOSE LOOK VERY NEARLY ALIKE. Ignoring the stroke of SHORT I which is the same for all the Deseret letters being discussed, we have EW represented by ?? and ?? (which look nothing alike) and OI represented by ?? and ?? (which look nothing alike). A unification of these as ?glyph variants? is perverse and not consistent with the way we have encoded things in the past. > What is right for Deseret has to be decided by and for Deseret users, rather than by script historians. Odd. That view doesn?t seem to be applicable to CJK unification. Michael From everson at evertype.com Thu Mar 23 08:48:37 2017 From: everson at evertype.com (Michael Everson) Date: Thu, 23 Mar 2017 13:48:37 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> Message-ID: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> On 23 Mar 2017, at 06:28, David Starner wrote: > > Does "?ussia" require a new Latin letter because the way R was written has a different origin than the normal R? > > But it doesn?t. It?s the Latin letter R turned backwards by a designer for a logo. We wouldn?t encode that, because it?s a logo. > > What logo? Oh, sorry. ?Toys ? Us? which is what I saw when I saw your ??ussia?. > I honestly don't know what logo you're talking about, but a quick Google search confirms it's used outside of a logo. I was thinking of http://www.sjgames.com/gurps/books/Russia/img/cover_lg.jpg which actually doesn't use the reversed R, but uses other Cyrillic characters. Decorative display type and font play on book covers is a very different thing from the development of the Deseret alphabet we are discussing here. >> We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI). > > If they're ligatures, they should be encoded as ligatures; if they're indivisible characters, then their glyph forms are of less interest. We don?t encode ligatures. We encode letters which are historically derived from ligation. That?s what the existing EW and OI are, and that?s what the 1859 revised letters were. >> Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another. > > ? and ? have contrasting use; they're used in the same text in distinct ways. That happens to be the case, but the analogy has to do with the origin of the ligatures. > Note that n and v? are considered glyph variants of each other, because v? is used in Sutterlin in exactly the places that n is used in typewritten versions of the text. It?s n and ? in S?tterlin, not n and v?. > ? is not ? even when they are printed in fonts that make it nearly impossible to tell them apart. It has nothing to do with the glyphs or how those glyphs were created, it's because they're used in different ways. It was an analogy about the structural development of the ligated letters. > The example of Sutterlin strikes me as quite relevant here; characters get all sorts of weird shapes in handwriting. Sometimes they end up immortalized in printing, and then they usually get encoded. Usually not. Again: The source of 1855 EW and OI uses *different* letters than the 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to see. This isn?t random or even systematic natural development of handwriting styles. It was a principled revision done on the basis of phonetic analysis. English diphthongs EW and OI were first represented by ligatures representing [?u?] and [??], and then later by ligatures representing [??] and [???]. Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. Michael Everson From prosfilaes at gmail.com Thu Mar 23 17:03:02 2017 From: prosfilaes at gmail.com (David Starner) Date: Thu, 23 Mar 2017 22:03:02 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On Thu, Mar 23, 2017 at 6:54 AM Michael Everson wrote: > Again: The source of 1855 EW and OI uses *different* letters than the 1859 > EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to > see. This isn?t random or even systematic natural development of > handwriting styles. It was a principled revision done on the basis of > phonetic analysis. English diphthongs EW and OI were first represented by > ligatures representing [?u?] and [??], and then later by ligatures > representing [??] and [???]. > Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of principled revision going on all the time in the world's scripts that doesn't get recorded by Unicode, and this goes double for young constructed scripts, where people are playing around with them. > Indeed I would say to John Jenkins and Ken Beesley that the richness of > the history of the Deseret alphabet would be impoverished by treating the > 1859 letters as identical to the 1855 letters. > And yet the richness of the history of the Latin alphabet is not impoverished by treating https://commons.wikimedia.org/wiki/File:I_littera_in_manuscripto.jpg (a monocase Latin cursive) as identical to part of the modern Latin-script alphabet, which besides casing, has split the i/j and u/v on the basis of phonetic analysis? -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Mar 24 06:34:41 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Fri, 24 Mar 2017 20:34:41 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On 2017/03/23 22:48, Michael Everson wrote: > Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. Well, I might be completely wrong, but John Jenkins may be the person on this list closest to an actual user of Deseret (John, please correct me if I'm wrong one way or another). It may be that actual users of Deseret read these character variants the same way most of us would read serif vs. sans-serif variants: I.e. unless we are designers or typographers, we don't actually consciously notice the difference. If that's the case, it would be utterly annoying to these actual users to have to make a distinction between two characters where there actually is none. The richness of the history of the Deseret alphabet can still be preserved e.g. with different fonts the same way we have thousands of different fonts for Latin and many other scripts that show a lot of rich history. Regards, Martin. From duerst at it.aoyama.ac.jp Fri Mar 24 06:41:14 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Fri, 24 Mar 2017 20:41:14 +0900 Subject: Standaridized variation sequences for the Deseret alphabet? In-Reply-To: <540B0347-175E-4B73-9420-2E6410202995@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp> <540B0347-175E-4B73-9420-2E6410202995@evertype.com> Message-ID: On 2017/03/23 22:32, Michael Everson wrote: >> What is right for Deseret has to be decided by and for Deseret users, rather than by script historians. > > Odd. That view doesn?t seem to be applicable to CJK unification. Well, it may not seem to you, but actually it is. I have had a lot of discussions with Japanese and others about Han unification (mostly in the '90ies), and have studied the history and principles of Han unification in quite some detail. To summarize it, Han unification unifies very much exactly those cases where an average user, in average texts, would consider two forms "the same" (i.e. exchangeable). Exceptions are due to the round trip rule. It also separates very much exactly those cases where an average user, for average texts, may not consider two forms equivalent. If necessary, I can go into further details, but I would have to dig quite deeply for some of the sources. Regards, Martin. From everson at evertype.com Fri Mar 24 09:37:51 2017 From: everson at evertype.com (Michael Everson) Date: Fri, 24 Mar 2017 14:37:51 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On 24 Mar 2017, at 11:34, Martin J. D?rst wrote: > > On 2017/03/23 22:48, Michael Everson wrote: > >> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. > > Well, I might be completely wrong, but John Jenkins may be the person on this list closest to an actual user of Deseret (John, please correct me if I'm wrong one way or another). He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark). > It may be that actual users of Deseret read these character variants the same way most of us would read serif vs. sans-serif variants: I.e. unless we are designers or typographers, we don't actually consciously notice the difference. I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received. > If that's the case, it would be utterly annoying to these actual users to have to make a distinction between two characters where there actually is none. Actually neither of the ligature-letters are used in our Carrollian Deseret volumes. > The richness of the history of the Deseret alphabet can still be preserved e.g. with different fonts the same way we have thousands of different fonts for Latin and many other scripts that show a lot of rich history. You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do. I?m also aware of what principles we have used for determining character identity. I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters. Michael Everson From everson at evertype.com Fri Mar 24 11:11:53 2017 From: everson at evertype.com (Michael Everson) Date: Fri, 24 Mar 2017 16:11:53 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On 23 Mar 2017, at 22:03, David Starner wrote: > On Thu, Mar 23, 2017 at 6:54 AM Michael Everson wrote: >> Again: The source of 1855 EW and OI uses *different* letters than the 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to see. This isn?t random or even systematic natural development of handwriting styles. It was a principled revision done on the basis of phonetic analysis. English diphthongs EW and OI were first represented by ligatures representing [?u?] and [??], and then later by ligatures representing [??] and [???]. > > Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of principled revision going on all the time in the world's scripts that doesn't get recorded by Unicode, and this goes double for young constructed scripts, where people are playing around with them. What?s your point? S?tterlin didn?t invent new letters. Both n and u look a lot alike, and so the latter was marked with a breve, but in the 15th-century Cornish manuscript I was working with at the British Library last week both n and u look a lot alike. This has nothing to do with the origin or identity of two sets of letters used for diphthongs in Deseret. >> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. > > And yet the richness of the history of the Latin alphabet is not impoverished by treating https://commons.wikimedia.org/wiki/File:I_littera_in_manuscripto.jpg (a monocase Latin cursive) as identical to part of the modern Latin-script alphabet, which besides casing, has split the i/j and u/v on the basis of phonetic analysis? Your question has, again, nothing to do with the matter in hand. While it is true that the shapes of the Latin letters in that manuscript differ from the shapes which we use today, their identity as letters (and their Old Italic and Phoenician forerunners) is not in question. Inscriptional Latin from that same period is still quite familiar to us. That i and j are distinguished in that handwritten text isn?t surprising. Centuries later in Europe the j graph was extremely common in numbers (as in xiij ?13?). It?s true that it wasn?t until 1524 that i and j were specifically distinguished *as* separate letters in Italy; this distinction was formally made in English in 1633. But this isn?t analogous to the ligature-based letters used for diphthongs in Deseret. And we *can* distinguish i and j in that Latin text, because we have separate characters encoded for it. And we *have* encoded many other Latin ligature-based letters and sigla of various kinds for the representation of medieval European texts. Indeed, that?s just a stronger argument for distinguishing the ligature-based letters for Deseret, I think. Michael Everson From verdy_p at wanadoo.fr Fri Mar 24 12:31:04 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 24 Mar 2017 18:31:04 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: 2017-03-24 17:11 GMT+01:00 Michael Everson : > On 23 Mar 2017, at 22:03, David Starner wrote: > > On Thu, Mar 23, 2017 at 6:54 AM Michael Everson > wrote: > >> Again: The source of 1855 EW and OI uses *different* letters than the > 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or > to see. This isn?t random or even systematic natural development of > handwriting styles. It was a principled revision done on the basis of > phonetic analysis. English diphthongs EW and OI were first represented by > ligatures representing [?u?] and [??], and then later by ligatures > representing [??] and [???]. > > > > Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of > principled revision going on all the time in the world's scripts that > doesn't get recorded by Unicode, and this goes double for young constructed > scripts, where people are playing around with them. > > What?s your point? S?tterlin didn?t invent new letters. Both n and u look > a lot alike, and so the latter was marked with a breve, but in the > 15th-century Cornish manuscript I was working with at the British Library > last week both n and u look a lot alike. This has nothing to do with the > origin or identity of two sets of letters used for diphthongs in Deseret. > There's a counter example of precedent for the German umlaut which was unfortunately unified with the diaeresis, even if its origin (and still its current semantic) is that of a combining letter e and where it does not play the phonetic role of a diaresis (i.e. the separation of two vowels to avoid creating digrams for a single phonem represented by pairs of letters). So "?" in German is cognate to the "ae" digram, similar to the "ai" digram used in French (or to the "?" ligature used other languages, sometimes as a distinct letter of their basic alphabet), it contains no phonetic diaeresis as there's a single phonem, and no diphtong (like "a?" in French where this is a true diaeresis to break the interpretation as the digram "ai"). Same remark for "?" in German cognate to the digram "oe" (or the ligatured letter "?" in other languages or the variant "?" in Nordic languages), and "?" cognate to "ue". But Unicode just prefered to keep the roundtrip compatiblity with earlier 8-bit encodings (including existing ISO 8859 and DIN standards) so that "?" in German and French also have the same canonical decomposition even if the diacritic is a diaeresis in French and an umlaut in German, with different semantics and origins. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Mar 24 13:33:44 2017 From: doug at ewellic.org (Doug Ewell) Date: Fri, 24 Mar 2017 11:33:44 -0700 Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert =?UTF-8?Q?alphabet=3F=29?= Message-ID: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> Philippe Verdy wrote: > But Unicode just prefered to keep the roundtrip compatiblity with > earlier 8-bit encodings (including existing ISO 8859 and DIN > standards) so that "?" in German and French also have the same > canonical decomposition even if the diacritic is a diaeresis in French > and an umlaut in German, with different semantics and origins. Was this only about compatibility, or perhaps also that the two signs look identical and that disunifying them would have caused endless confusion and misuse among users? -- Doug Ewell | Thornton, CO, US | ewellic.org From haberg-1 at telia.com Fri Mar 24 14:23:52 2017 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Fri, 24 Mar 2017 20:23:52 +0100 Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?) In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> Message-ID: <9C0619FD-9DB9-43C0-AFC7-74564446AC03@telia.com> > On 24 Mar 2017, at 19:33, Doug Ewell wrote: > > Philippe Verdy wrote: > >> But Unicode just prefered to keep the roundtrip compatiblity with >> earlier 8-bit encodings (including existing ISO 8859 and DIN >> standards) so that "?" in German and French also have the same >> canonical decomposition even if the diacritic is a diaeresis in French >> and an umlaut in German, with different semantics and origins. > > Was this only about compatibility, or perhaps also that the two signs > look identical and that disunifying them would have caused endless > confusion and misuse among users? The Swedish letters ??? are simplified ligatures, and not diacritic marks. For ??, in handwritten script style, a tilde, the same as Spanish ?, which is also a simplified ligature. From verdy_p at wanadoo.fr Fri Mar 24 14:34:53 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 24 Mar 2017 20:34:53 +0100 Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?) In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> Message-ID: Given the history of characters and the initial desire to be forward compatible with previous ISO standards, I am convinced that there was no other choice than preserving the unification, otherwise it would have been impossible to reliably remap the zillions documents and databases or applications that were using ISO8859, and other related Windows, MacOS and IBM codepages for OEMs or for EBCDIC. And with the developement of Internet and the disire in both Unicode and ISO 10646 to leave the first page of code points in the UCS and ISO8859-1 fully compatible code for code (and the fact that there was no variant of ISO8859-1 standardized for Germany, Switzerland, Austria, Belgium and Luxembourg, that did not request it (causing nightmares notably in the last three countries, and a lot of legacy softwares on Windows and MacOS needing such bijective mapping; finally the Unicode Consortium initially was developed separately from the IUSO standard and merged later, and at that time, Microsofot and IBM were the most active members and did not want to introduce incompatibilities and causing troubles for other vendors). Later there was a clear statement to keep the basic character properties, stable, and it became impossisble to change the canonical equivalences (after the bad experience found when mlerging efforts between Unicode and ISO notably for enconding Hangul, and a strong initial resistance by China that wanted to develop its own GB standard). Encoding stability is now a rule that will be extremely hard to break. Note: umlauts and diaeresis have not always looked the same, confusion started lately between both during the middle of the 20th century and the starting development of computing. It would have been impossible to reach a large adoption of the UCS without such compromizes (and it took additional years after both projects joined their efforts, before ISO finally closed its working group on legacy 8-bit character sets, and stopped accepting any new variants; ISO 8859-15 was one of the last failed attempt to standardize a new 8-bit encoding, that finally almost nobody really used as they no longer needed it; China resigned as well and finalized the roundtrip mapping of its GB 18030 competing encoding with the UCS, so mappings for GB 18030 no longer needs new updates: any new encoding in the UCS is immediately encoded as well in GB without modifying any line of code or data, and any software or document compatiblle with the UCS should be imediately compatible with the GB 18030 standard required in PR China; I don't know if Hong Kong authorities made the same statement for its HKCS standard before it reunified with China, or if Taiwan made a similar decision; however Japan is adding new encodings in its JIS standard, pushed by national vendors, and the UCS still has delays for accepting these additions and not all is accepted, but in this area, there's a local subcommity constantly negociating with Asian vendors and reporting its efforts to Unicode and ISO). About umlauts and diaeresis I'm not sure they were always looking the same. If we try to encode old German, Hungarian or Czech texts, we may find some discrepencies or ambiguities (but there's still no mechanism to distinguish when an umlaut is really desired and a diaeresis is destired instead if they don't look the same in historic script variants). We cannot encode these using "variants" but possibly we may be using some combining controls such as CGJ (encoded after the precombined letter or after the base letter+diaresis, because of canonical equivalences it cannot be in the middle). Or may be, only for historic texts, we could add a combining lowercase e as an alternative to the existing diaeresis. 2017-03-24 19:33 GMT+01:00 Doug Ewell : > Philippe Verdy wrote: > > > But Unicode just prefered to keep the roundtrip compatiblity with > > earlier 8-bit encodings (including existing ISO 8859 and DIN > > standards) so that "?" in German and French also have the same > > canonical decomposition even if the diacritic is a diaeresis in French > > and an umlaut in German, with different semantics and origins. > > Was this only about compatibility, or perhaps also that the two signs > look identical and that disunifying them would have caused endless > confusion and misuse among users? > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Mar 25 09:09:10 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 25 Mar 2017 14:09:10 +0000 Subject: Status of Thai Angkhandiao Message-ID: <20170325140910.35f01687@JRWUBU2> Thai has two identical or very similar punctuation-like characters, 'paiyan noi' (?????????), definitely encoded as ? U+0E2F THAI CHARACTER PAIYANNOI, and 'angkhan diao' (often transliterated 'angkhandeaw') (????????????). Paiyan noi is an abbreviation mark, historically the same in name as ? U+17D8 KHMER SIGN BEYYAL, which however corresponds in form and meaning to the Thai sequence 'paiyan yai' - ???. Angkhandiao is historically a single danda, contrasting with the double danda U+0E5A THAI CHARACTER ANGKHANKHU. (They are both very little used in modern Thai.) One piece of evidence that paiyannoi and angkhandiao are two separate characters is that ISO 11940 uses different glyphs for them and prescribes different transliterations for them: ? U+01C0 LATIN LETTER DENTAL CLICK for angkhandiao ? U+01C1 LATIN LETTER LATERAL CLICK for U+0E5A THAI CHARACTER ANGKHANKHU ? U+01C2 LATIN LETTER ALVEOLAR CLICK for U+0E2F THAI CHARACTER PAIYANNOI (I would have said that U+0964 DEVANAGARI DANDA and U+0965 DEVANAGARI DOUBLE DANDA would have been better for the first two, but these are declared (Script_Extensions property) not to be used as part of the Latin script, though I thought they were used for Sanskrit.) Has Unicode ever ruled on whether U+0E2F includes angkhandiao? Richard. From prosfilaes at gmail.com Sat Mar 25 17:15:28 2017 From: prosfilaes at gmail.com (David Starner) Date: Sat, 25 Mar 2017 22:15:28 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On Fri, Mar 24, 2017 at 9:17 AM Michael Everson wrote: > And we *can* distinguish i and j in that Latin text, because we have > separate characters encoded for it. And we *have* encoded many other Latin > ligature-based letters and sigla of various kinds for the representation of > medieval European texts. Indeed, that?s just a stronger argument for > distinguishing the ligature-based letters for Deseret, I think. > And I'd argue that a good theoretical model of the Latin script makes ?, ? and a? the same character, distinguished only by the font. This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Mar 25 21:24:18 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 26 Mar 2017 04:24:18 +0200 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: 2017-03-25 23:15 GMT+01:00 David Starner : > On Fri, Mar 24, 2017 at 9:17 AM Michael Everson > wrote: > >> And we *can* distinguish i and j in that Latin text, because we have >> separate characters encoded for it. And we *have* encoded many other Latin >> ligature-based letters and sigla of various kinds for the representation of >> medieval European texts. Indeed, that?s just a stronger argument for >> distinguishing the ligature-based letters for Deseret, I think. >> > > And I'd argue that a good theoretical model of the Latin script makes ?, ? > and a? the same character, distinguished only by the font. This is > complicated by combining characters mostly identified by glyph, and the > fact that while ? and a? may be the same character across time, there are > people wanting to distinguish them in the same text today, and in both > cases the theoretical falls to the practical. In this case, there are no > combining character issues and there's nobody needing to use the two forms > in the same text. > Thats a good point: any disunification requires showing examples of contrasting uses. Now depending on individual publications, authors would use one character or the other according to their choice, and the encoding will respect it. If we need further unification for matching texts in the samer language across periods of time or authors, collation (UCA) can provide help: this is already what it does in modern German with the digram "ae" and the letter "?" which are orthographic variants not distinguished by the language but by authors' preference. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Mar 26 03:12:27 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sun, 26 Mar 2017 17:12:27 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> On 2017/03/26 11:24, Philippe Verdy wrote: > Thats a good point: any disunification requires showing examples of > contrasting uses. Fully agreed. We haven't yet heard of any contrasting uses for the letter shapes we are discussing. > Now depending on individual publications, authors would > use one character or the other according to their choice, and the encoding > will respect it. If we need further unification for matching texts in the > samer language across periods of time or authors, collation (UCA) can > provide help: this is already what it does in modern German with the digram > "ae" and the letter "?" which are orthographic variants not distinguished > by the language but by authors' preference. Well, in most cases, but not e.g. for names. Goethe is not spelled G?the. Regards, Martin. From wl at gnu.org Sun Mar 26 03:17:48 2017 From: wl at gnu.org (Werner LEMBERG) Date: Sun, 26 Mar 2017 10:17:48 +0200 (CEST) Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> References: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> Message-ID: <20170326.101748.844147132286739377.wl@gnu.org> > Well, in most cases, but not e.g. for names. Goethe is not spelled > G?the. Have a look into `Grimmsches W?rterbuch' to see the opposite :-) Werner From eik at iki.fi Sun Mar 26 04:07:15 2017 From: eik at iki.fi (Erkki I Kolehmainen) Date: Sun, 26 Mar 2017 12:07:15 +0300 Subject: VS: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> Message-ID: <000301d2a610$5cf1e550$16d5aff0$@fi> I tend to agree with Martin, Philippe and others in questioning the disunification. Sincerely, Erkki I. Kolehmainen -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Martin J. D?rst L?hetetty: 26. maaliskuuta 2017 11:12 Vastaanottaja: verdy_p at wanadoo.fr; David Starner Kopio: Michael Everson; unicode Unicode Discussion Aihe: Re: Standaridized variation sequences for the Desert alphabet? On 2017/03/26 11:24, Philippe Verdy wrote: > Thats a good point: any disunification requires showing examples of > contrasting uses. Fully agreed. We haven't yet heard of any contrasting uses for the letter shapes we are discussing. > Now depending on individual publications, authors would use one > character or the other according to their choice, and the encoding > will respect it. If we need further unification for matching texts in > the samer language across periods of time or authors, collation (UCA) > can provide help: this is already what it does in modern German with > the digram "ae" and the letter "?" which are orthographic variants not > distinguished by the language but by authors' preference. Well, in most cases, but not e.g. for names. Goethe is not spelled G?the. Regards, Martin. From duerst at it.aoyama.ac.jp Sun Mar 26 04:37:49 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sun, 26 Mar 2017 18:37:49 +0900 Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?) In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> Message-ID: On 2017/03/25 03:33, Doug Ewell wrote: > Philippe Verdy wrote: > >> But Unicode just prefered to keep the roundtrip compatiblity with >> earlier 8-bit encodings (including existing ISO 8859 and DIN >> standards) so that "?" in German and French also have the same >> canonical decomposition even if the diacritic is a diaeresis in French >> and an umlaut in German, with different semantics and origins. > > Was this only about compatibility, or perhaps also that the two signs > look identical and that disunifying them would have caused endless > confusion and misuse among users? I'm not sure to what extent this was explicitly discussed when Unicode was created. The fact that the first 256 code points are identical to those in ISO-8859-1 was used as a big selling point when Unicode was first introduced. It may well have been that for Unicode, there was no discussion at all in this area, because ISO-8859-1 was already so well established. And for ISO-8859-1, space was an important concern. Ideally, both Islandic and Turkish (and the letters missed for French) would have been covered, but that wasn't possible. Disunifying diaeresis and umlaut would have been an unaffordable luxury. The above reasons mask any inherent reasons for why diaeresis and umlaut would have been unified or not if the decision had been argued purely "on the merit". But having used both German and French, and e.g. looking at the situation in Switzerland, where it was important to be able to write both French and German on the same typewriter, I would definitely argue that disunifying them would have caused endless confusion and errors among users. Also, it was argued a few mails ago that diaeresis and umlaut don't look exactly the same. I remember well that when Apple introduced its first laser printers, there were widespread complaints that the fonts (was it Helvetica, Times Roman, and Palatino?) unified away the traditional differences in the cuts of these typefaces for different languages. So to quite some extent, in the relevant period (i.e. 1970ies/80ies), the differences between diaeresis and umlaut may be due to design differences in the cuts for different languages (e.g. French and German). Nobody would have disunified some basic letters because they may have looked slightly different in cuts for different languages, and so people may also have been just fine with unifying diaeresis and umlaut. (German fonts e.g. may have contained a '?' for use e.g. with "Citro?n", but the dots on that '?' will have been the same shape as '?', '?', and '?' umlauts for design consistency, and the other way round for French). Regards, Martin. From everson at evertype.com Sun Mar 26 08:06:41 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 14:06:41 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On 25 Mar 2017, at 22:15, David Starner wrote: > > And I'd argue that a good theoretical model of the Latin script makes ?, ? and a? the same character, distinguished only by the font. Fortunately for the users of our standard, we don?t do this. > This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. I?m fairly sure that a person citing a medieval document using a? may very well also need to write this alongside Swedish or German using ?. Michael Everson From everson at evertype.com Sun Mar 26 08:15:03 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 14:15:03 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> Message-ID: <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> > On 26 Mar 2017, at 09:12, Martin J. D?rst wrote: > >> Thats a good point: any disunification requires showing examples of >> contrasting uses. > > Fully agreed. The default position is NOT ?everything is encoded unified until disunified?. The characters in question have different and undisputed origins, undisputed. We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.) > We haven't yet heard of any contrasting uses for the letter shapes we are discussing. Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters. Please try to remember that. (It?s a bit shocking to have to remind people of this. Michael Everson From everson at evertype.com Sun Mar 26 08:18:37 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 14:18:37 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <000301d2a610$5cf1e550$16d5aff0$@fi> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> Message-ID: On 26 Mar 2017, at 10:07, Erkki I Kolehmainen wrote: > > I tend to agree with Martin, Philippe and others in questioning the disunification. You may, but you give no evidence or discussion about it, so... In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised. The origin of all of the characters as ligatures of other characters isn?t questioned. The right thing to do is to add the missing characters, not to invalidate any font that uses the 1855 characters by claiming that the 1855 and 1859 characters are ?the same?. Michael Everson From prosfilaes at gmail.com Sun Mar 26 08:32:07 2017 From: prosfilaes at gmail.com (David Starner) Date: Sun, 26 Mar 2017 13:32:07 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: On Sun, Mar 26, 2017 at 6:12 AM Michael Everson wrote: > On 25 Mar 2017, at 22:15, David Starner wrote: > > > > And I'd argue that a good theoretical model of the Latin script makes ?, > ? and a? the same character, distinguished only by the font. > > Fortunately for the users of our standard, we don?t do this. > You've yet to come up with users to whom these Deseret letters are relevant. I?m fairly sure that a person citing a medieval document using a? may very > well also need to write this alongside Swedish or German using ?. > I'm fairly sure that a person citing an early 20th century Germany document may well feel the need to cite it in Fraktur. In both cases, I believe that's going above and beyond the identity of the characters involved, but in your case, people do contrast the a? with ?, and the user case has been made. Show me the users who want to use these Deseret letters contrastingly. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Mar 26 08:37:11 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 14:37:11 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: <01091990-405C-46B6-B5F6-F89CD42BA820@evertype.com> On 26 Mar 2017, at 14:32, David Starner wrote: >>> And I'd argue that a good theoretical model of the Latin script makes ?, ? and a? the same character, distinguished only by the font. >> >> Fortunately for the users of our standard, we don?t do this. > > You've yet to come up with users to whom these Deseret letters are relevant. You might imagine it takes time to identify problems and address them. >> I?m fairly sure that a person citing a medieval document using a? may very well also need to write this alongside Swedish or German using ?. > > I'm fairly sure that a person citing an early 20th century Germany document may well feel the need to cite it in Fraktur. Fraktur is a whole-font substitition (modulo the ligatures). This is not the same thing as an editor choosing w or ?. Imagine if we had unified those two. After all, they both represent the same sound, right? (Shudder.) > In both cases, I believe that's going above and beyond the identity of the characters involved, but in your case, people do contrast the a? with ?, and the user case has been made. Show me the users who want to use these Deseret letters contrastingly. Do try to be less dismissive. Firstly, *I* have published entire books in Deseret and so I myself have a legitimate interest. In the second, Iam in fact beginning discussions with relevant experts. Michael Everson From asmusf at ix.netcom.com Sun Mar 26 10:45:15 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 08:45:15 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Mar 26 10:47:51 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 16:47:51 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> Message-ID: <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com> > On 26 Mar 2017, at 16:45, Asmus Freytag wrote: > > The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate. He refers to the shape of the diacritical marks. Michael Everson From asmusf at ix.netcom.com Sun Mar 26 10:59:42 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 08:59:42 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com> Message-ID: <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com> On 3/26/2017 8:47 AM, Michael Everson wrote: >> On 26 Mar 2017, at 16:45, Asmus Freytag wrote: >> >> The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate. > He refers to the shape of the diacritical marks. I see the issue: the font selected on my end made the "e" look like an "o", which completely changed my understanding of what he tried to communicate. A./ > > Michael Everson > From asmusf at ix.netcom.com Sun Mar 26 11:02:22 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 09:02:22 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> Message-ID: On 3/26/2017 6:18 AM, Michael Everson wrote: > On 26 Mar 2017, at 10:07, Erkki I Kolehmainen wrote: >> I tend to agree with Martin, Philippe and others in questioning the disunification. > You may, but you give no evidence or discussion about it, so... > > In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised. Calling them "characters" is pre-judging the issue, don't you think? We know that these are different shapes, but that they stand for the same text elements. A./ > The origin of all of the characters as ligatures of other characters isn?t questioned. The right thing to do is to add the missing characters, not to invalidate any font that uses the 1855 characters by claiming that the 1855 and 1859 characters are ?the same?. > > Michael Everson > From everson at evertype.com Sun Mar 26 11:20:04 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 17:20:04 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> Message-ID: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> On 26 Mar 2017, at 16:45, Asmus Freytag wrote: > > The priority in encoding has to be with allowing distinctions in modern texts, or distinctions that matter to modern users of historic writing systems. Beyond that, theoretical analysis of typographical evolution can give some interesting insight, but I would be in the camp that does not accord them a status as primary rationale for encoding decisions. Our rationales are NOT ranked in the way you suggest. A variety of criteria are applied. > Thus, critical need for contrasting use of the glyph distinctions would have to be established before it makes sense to discuss this further. Precedent for such needs is well-established. Consider the Latin Extended-D block. Sometimes it is editorial preference, and that?s not even always universal. > I see no principled objection to having a font choice result in a noticeable or structural glyph variation for only a few elements of an alphabet. We have handle-a vs. bowl-a as well as hook-g vs. loop-g in Latin, and fonts routinely select one or the other. Well, Asmus, we encode a and ? as well as g and ? and ?. And we do not consider ? and ? and ? to be things that ought to be distinguished by variation selectors. (I am of course well aware of IPA usage.) Whole-font switching is well understood. But character origin has always been taken into account. Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.) > (It is only for usage outside normal text that the distinction between these forms matters). What?s ?normal? text? ?Normal? text in Latin probably doesn?t use the characters from the Latin Extended-D block. > While the Deseret forms are motivated by their pronunciation, I'm not necessarily convinced that the distinction has any practical significance that is in any way different than similar differences in derivation (e.g. for long s-s or long-s-z for German esszett). One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. > In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance). Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. > This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction. Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions. >> This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. > > huh? He?s wrong there, as I pointed out. A text in German may write an older Clavieru?bung in a citation alongside the normal spelling Klavier?bung. The choice of spelling is key. Michael Everson From everson at evertype.com Sun Mar 26 11:20:42 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 17:20:42 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com> <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com> Message-ID: <3D06831A-7C66-44F3-8113-C8E2612B775F@evertype.com> > On 26 Mar 2017, at 16:59, Asmus Freytag wrote: > > On 3/26/2017 8:47 AM, Michael Everson wrote: >>> On 26 Mar 2017, at 16:45, Asmus Freytag wrote: >>> >>> The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate. >> He refers to the shape of the diacritical marks. > > I see the issue: the font selected on my end made the "e" look like an "o", which completely changed my understanding of what he tried to communicate. Ah, yes. M From everson at evertype.com Sun Mar 26 11:23:06 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 17:23:06 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> Message-ID: <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com> On 26 Mar 2017, at 17:02, Asmus Freytag wrote: > > On 3/26/2017 6:18 AM, Michael Everson wrote: > >> In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised. > > Calling them "characters" is pre-judging the issue, don't you think? No, I don?t think so. > We know that these are different shapes, but that they stand for the same text elements. No, they don?t. Those diphthongs can also be represented in other ways in Deseret. I?ve never accepted the view that ?everything is already encoded and everything new is a disunification? which seems to be a pretty common view. Michael Everson From doug at ewellic.org Sun Mar 26 12:19:08 2017 From: doug at ewellic.org (Doug Ewell) Date: Sun, 26 Mar 2017 11:19:08 -0600 Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for the Desert alphabet?) In-Reply-To: References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com> Message-ID: Philippe Verdy wrote: > Or may be, only for historic texts, we could add a combining lowercase > e as an alternative to the existing diaeresis. Something like U+0364 COMBINING LATIN SMALL LETTER E, maybe? -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Sun Mar 26 12:20:27 2017 From: doug at ewellic.org (Doug Ewell) Date: Sun, 26 Mar 2017 11:20:27 -0600 Subject: Standaridized variation sequences for the Desert alphabet? Message-ID: Michael Everson wrote: > One practical consequence of changing the chart glyphs now, for > instance, would be that it would invalidate every existing Deseret > font. Adding new characters would not. I thought the chart glyphs were not normative. -- Doug Ewell | Thornton, CO, US | ewellic.org From everson at evertype.com Sun Mar 26 12:33:00 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 18:33:00 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: Message-ID: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> On 26 Mar 2017, at 18:20, Doug Ewell wrote: > > Michael Everson wrote: > >> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. > > I thought the chart glyphs were not normative. Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead? Michael. From asmusf at ix.netcom.com Sun Mar 26 15:39:38 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 13:39:38 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> References: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> Message-ID: <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com> On 3/26/2017 10:33 AM, Michael Everson wrote: > On 26 Mar 2017, at 18:20, Doug Ewell wrote: >> Michael Everson wrote: >> >>> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. >> I thought the chart glyphs were not normative. > Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead? If there was a tradition of writing W like omega, then switching the chart glyphs to that alternative tradition would be something that is at least not inconceivable -- even if perhaps not advisable. For letters, their primary identity is not given by their shape, but their position / function in the alphabet. That's why making Gaelic style and Fraktur a font switch works at all, even if that is not perfect (viz, ligatures in Fraktur). In the Deseret case, making this alternation a font choice would tend to preserve the content of all documents. Making this an encoding difference would indeed invalidate some documents. Finally, if this was in major, modern use, adding these code points would have grave consequences for security. A./ From richard.wordingham at ntlworld.com Sun Mar 26 15:48:15 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 26 Mar 2017 21:48:15 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> References: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> Message-ID: <20170326214815.5bd7eadb@JRWUBU2> On Sun, 26 Mar 2017 18:33:00 +0100 Michael Everson wrote: > On 26 Mar 2017, at 18:20, Doug Ewell wrote: > > Michael Everson wrote: > >> One practical consequence of changing the chart glyphs now, for > >> instance, would be that it would invalidate every existing Deseret > >> font. Adding new characters would not. > > I thought the chart glyphs were not normative. > Come on, Doug. The letter W is a ligature of V and V. But sure, the > glyphs are only informative, so why don?t we use an OO ligature > instead? A script-stlye font might legitimately use a glyph that looks like a small omega for U+0077 LATIN SMALL LETTER W. Small omega, of course, is an ?? ligature. More to the point, a font may legitimately use the same glyphs for U+0067 LATIN SMALL LETTER G and U+0261 LATIN SMALL LETTER SCRIPT G. A more serious issue is the multiple forms of U+014A LATIN CAPITAL LETTER ENG, for which the underlying unity comes from their being the capital form of U+014B LATIN SMALL LETTER ENG. Are there not serious divergences with the shapes of the Syriac letters? Richard. From everson at evertype.com Sun Mar 26 15:51:43 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 21:51:43 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com> References: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com> Message-ID: <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com> On 26 Mar 2017, at 21:39, Asmus Freytag wrote: >> Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead? > > If there was a tradition of writing W like omega, then switching the chart glyphs to that alternative tradition would be something that is at least not inconceivable -- even if perhaps not advisable. You know, Asmus, no analogy is perfect. But mine was a discussion of letters derived from ligatures, and yours is just a random note about shape. > For letters, their primary identity is not given by their shape, but their position / function in the alphabet. This isn?t really something you can turn into an axiom, much as you would like to. Position in the alphabet may very WIDELY from language to language. As can function. The Latin letter c can mean /k s t? ts ? ? ?/? > That's why making Gaelic style and Fraktur a font switch works at all, even if that is not perfect (viz, ligatures in Fraktur). Font style isn?t the same thing in this context. The historical letters used to make the 1855 ligatures are *different* letters than those used for the 1859 ligatures. > In the Deseret case, making this alternation a font choice would tend to preserve the content of all documents. No, since it?s a question of *spelling*. Some documents use a ligature-letter for the diphthong /ju?/. Some documents use two separate letters for the same diphthong. So there?s no ?standardized? spelling that works for all text that would be affected here. (Spelling for English wasn?t standardized anyway in historical Deseret texts and there is much variety.) > Making this an encoding difference would indeed invalidate some documents. Right now the 1859 characters aren?t representable. Deciding to change the chart glyphs to 1859 glyphs would just destabilize EVERY current Deseret font. That?s not something we should do. > Finally, if this was in major, modern use, adding these code points would have grave consequences for security. Why? They?re not visually similar to the existing characters. So spoofing wouldn?t be an issue. Michael Everson From everson at evertype.com Sun Mar 26 15:56:14 2017 From: everson at evertype.com (Michael Everson) Date: Sun, 26 Mar 2017 21:56:14 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <20170326214815.5bd7eadb@JRWUBU2> References: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> <20170326214815.5bd7eadb@JRWUBU2> Message-ID: On 26 Mar 2017, at 21:48, Richard Wordingham wrote: >> Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature= instead? > > A script-stlye font might legitimately use a glyph that looks like a small omega for U+0077 LATIN SMALL LETTER W. As I said to Asmus, my analogy was about ligatures made from underlying letters. Yours doesn?t apply because it?s just talking about glyph shapes. > Small omega, of course, is an ?? ligature. True. :-) Isn?t history wonderful? > More to the point, a font may legitimately use the same glyphs for U+0067 LATIN SMALL LETTER G and U+0261 LATIN SMALL LETTER SCRIPT G. A good font will still find a way to distinguish them. :-) > A more serious issue is the multiple forms of U+014A LATIN CAPITAL LETTER ENG, for which the underlying unity comes from their being the capital form of U+014B LATIN SMALL LETTER ENG. We could have, and should have, solved this problem *long ago* by encoding LATIN CAPITAL LETTER AFRICAN ENG and LATIN SMALL LETTER AFRICAN ENG. > Are there not serious divergences with the shapes of the Syriac letters? That is analogous to Roman/Gaelic/Fraktur. That analogy doesn?t apply to these Deseret characters; it?s not a whole-script gestalt. Michael Everson From asmusf at ix.netcom.com Sun Mar 26 16:16:15 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 14:16:15 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> Message-ID: <5bc936f5-7513-8258-8709-26ee6c41d7ea@ix.netcom.com> On 3/26/2017 9:20 AM, Michael Everson wrote: > On 26 Mar 2017, at 16:45, Asmus Freytag wrote: >> The priority in encoding has to be with allowing distinctions in modern texts, or distinctions that matter to modern users of historic writing systems. Beyond that, theoretical analysis of typographical evolution can give some interesting insight, but I would be in the camp that does not accord them a status as primary rationale for encoding decisions. > Our rationales are NOT ranked in the way you suggest. A variety of criteria are applied. And the way you weigh the criteria? > >> Thus, critical need for contrasting use of the glyph distinctions would have to be established before it makes sense to discuss this further. > Precedent for such needs is well-established. Consider the Latin Extended-D block. Sometimes it is editorial preference, and that?s not even always universal. I think the Latin Extended-D block may have its own problems. However, Latin as a script caters to so many varied levels of users, from ordinary text to scholarly notations that it really cannot be used to settle this issue. > >> I see no principled objection to having a font choice result in a noticeable or structural glyph variation for only a few elements of an alphabet. We have handle-a vs. bowl-a as well as hook-g vs. loop-g in Latin, and fonts routinely select one or the other. > Well, Asmus, we encode a and ? as well as g and ? and ?. And we do that for reasons that are very different from preserving the early and possibly transient history of a minor script. > And we do not consider ? and ? and ? to be things that ought to be distinguished by variation selectors. (I am of course well aware of IPA usage.) Yes, and the absence of such usage in the current example makes all the difference. > Whole-font switching is well understood. But character origin has always been taken into account. Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.) Apparently not only in the standard, because they show as different in the plaintext view of this message. > >> (It is only for usage outside normal text that the distinction between these forms matters). > What?s ?normal? text? ?Normal? text in Latin probably doesn?t use the characters from the Latin Extended-D block. "ordinary" text, if you like, reflecting standard orthographies. As opposed to notational systems. > >> While the Deseret forms are motivated by their pronunciation, I'm not necessarily convinced that the distinction has any practical significance that is in any way different than similar differences in derivation (e.g. for long s-s or long-s-z for German esszett). > One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. No, if we state that both glyphs are alternates for the same character *and if we decide, to _not_ add variation selectors* the choice is where it belongs: with the font maker. > >> In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance). > Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. If the underlying text element is the same, font switching can be the correct choice. > >> This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction. > Character identity is not defined by any single criterion. Make it the "primary" criterion then. > Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions. Yes, and those other spellings are not affected. > >>> This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. >> huh? > He?s wrong there, as I pointed out. A text in German may write an older Clavieru?bung in a citation alongside the normal spelling Klavier?bung. The choice of spelling is key. That would have to be a very specialized text. But to claim that this needs to be possible in German in plaintext for the case of such a quote is more than a stretch. If there is a critical need for such texts *as plain text* in Deseret, that would be a curious fact, but perhaps decisive. A./ From asmusf at ix.netcom.com Sun Mar 26 16:20:26 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 14:20:26 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com> References: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com> <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com> <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com> Message-ID: <3a9ea98e-6615-688e-3ce9-b6a41a34ebc6@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sun Mar 26 16:30:04 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Sun, 26 Mar 2017 14:30:04 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com> Message-ID: On 3/26/2017 9:23 AM, Michael Everson wrote: > On 26 Mar 2017, at 17:02, Asmus Freytag wrote: >> On 3/26/2017 6:18 AM, Michael Everson wrote: >> >>> In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised. >> Calling them "characters" is pre-judging the issue, don't you think? > No, I don?t think so. I really think it is. > >> We know that these are different shapes, but that they stand for the same text elements. > No, they don?t. Those diphthongs can also be represented in other ways in Deseret. Having alternative ways to represent these doesn't invalidate or affect my argument. > > I?ve never accepted the view that ?everything is already encoded and everything new is a disunification? which seems to be a pretty common view. I would not say I aspire to the view you quote. If you encode a certain shape, it may get used for a range of text elements. This would (de facto) encode these text elements via that shape. If it is later felt that the given shape should not be used for the full range of text elements, then you could say that the "implicit" unification based on the usage (or, if you will, "fallback usage") was mistaken and should be better handled by two (or more) shapes. This represents a "de-facto" disunification. However, where I part from your description is the "everything is already encoded". That would not be the case anywhere a range of text elements cannot be represented at all. Your statement also implies a "correctly encoded" or "successfully encoded" which is different from "there's an encoding that some people use as a fallback", which, if disunification should prove proper later on, would be a better way of describing what was the original situation. Perhaps the point is subtle, but it is important. In the current case, you have the opposite, to wit, the text elements are unchanged, but you would like to add alternate code elements to represent what are, ultimately, the same text elements. That's not disunification, but dual encoding. A./ From jameskasskrv at gmail.com Sun Mar 26 23:58:51 2017 From: jameskasskrv at gmail.com (James Kass) Date: Sun, 26 Mar 2017 20:58:51 -0800 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com> Message-ID: Asmus Freytag wrote, > In the current case, you have the opposite, > to wit, the text elements are unchanged, but > you would like to add alternate code elements > to represent what are, ultimately, the same > text elements. That's not disunification, but > dual encoding. If spelling a word with an x+y string versus a z+y string represents two different spellings of the same word, then hand printing the same word with either an x/y ligature versus a z/y ligature also represents two different spellings of the same word. Best regards, James Kass From duerst at it.aoyama.ac.jp Mon Mar 27 00:42:40 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 27 Mar 2017 14:42:40 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> Message-ID: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> On 2017/03/26 22:15, Michael Everson wrote: > >> On 26 Mar 2017, at 09:12, Martin J. D?rst wrote: >> >>> Thats a good point: any disunification requires showing examples of >>> contrasting uses. >> >> Fully agreed. > > The default position is NOT ?everything is encoded unified until disunified?. Neither it's "everything is encoded separately unless it's unified". > The characters in question have different and undisputed origins, undisputed. If you change that to the somewhat more neutral "the shapes in question have different and undisputed origins", then I'm with you. I actually have said as much (in different words) in an earlier post. > We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.) Fine. I (and others) have also given quite a few analogies, none of them perfect, but most if not all of them pertinent. >> We haven't yet heard of any contrasting uses for the letter shapes we are discussing. > > Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters. Sorry, but where did I say that it's the only criterion? I don't think it's the only criterion. On the other hand, I also don't think that historical origin is or should be the only criterion. Unfortunately, much of what you wrote gave me the impression that you may think that historical origin is the only criterion, or a criterion that trumps all others. If you don't think so, it would be good if you could confirm this. If you think so, it would be good to know why. > Please try to remember that. (It?s a bit shocking to have to remind people of this. You don't have to remind me, at least. I have mentioned "usability for average users in average contexts" and "contrasting use" as criteria, and I have also in earlier mail acknowledged history as a (not the) criterion, and have mentioned legacy/roundtrip issues. I'm sure there are others. Regards, Martin. From duerst at it.aoyama.ac.jp Mon Mar 27 02:05:12 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 27 Mar 2017 16:05:12 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> Message-ID: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> On 2017/03/27 01:20, Michael Everson wrote: > On 26 Mar 2017, at 16:45, Asmus Freytag wrote: > Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.) "apparently", maybe. Let's for a moment leave aside the radicals themselves, which are to a large extent artificial constructs. Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ? shape, with two horizontal bars), whereas the H/T/V columns show the ? shape (two downwards slanted bars) for the "MEAT" radical and the ? shape for the moon radical. So whether these radicals have identical glyphs depends on typographic tradition/font/... In Japan, many people may be rather unaware of the difference, whereas in Taiwan, it may be that school children get drilled on the difference. > One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. Independent of whether the chart glyphs get changed, couldn't we just add a note "also # in some fonts" (where # is the other variant). That would make sure that nobody could claim "this font is wrong" based on the charts. (Even if a general claim that the chart glyphs aren't normative applies to all charts anyway.) >> In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance). > > Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. Well, yes, rejected many times in cases where that was appropriate. But also accepted many times, in cases that we may not even remember, because they may not even have been made explicitly. Because in such cases, the focus may not be on a change to one or a few letter shapes, but the focus may be on a change of the overall style, which induces a change of letter shape in some letters. The roman/italic a/? and g/? distinctions (the later code points only used to show the distinction in plain text, which could as well be done descriptively), as well as a large number of distinctions in Han fonts, come to my mind. I'm quite sure other scripts have similar phenomena. >> This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction. > > Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions. This is interesting information. You are saying that in actual practice, there is a choice between writing ???? (two letters for a diphthong) and writing ??. In the same location, is ???? (the base for the historically later shape variant of ??; please note that this may actually be written ????; there's some inconsistency in order between the above cited sentence and the text below copied from an earlier mail) also used as a spelling variant? Overall, we may have up to four variants, of which three are currently explicitly supported in Unicode. Are all of these used as spelling variants? Is the choice of variant up to the author (for which variants), or is it the editor or printer who makes the choice (for which variants)? And what informs this choice? If we have any historic metal types, are there examples where a font contains both ligature variants? (Please note that because ??, ??, and ?? are available as individual letters, it's very difficult to think about the two-letter sequences as anything else than spellings, but that doesn't necessarily carry over to the ligatures.) And then the same questions, with parallel (or not parallel) answers, for ??/??/??. Regards, Martin. Text copied from earlier mail by Michael: >>>> 1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], that is, [?] + [o?] = [?u?], that is, [ju]. 2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], that is, [?] + [?] = [??], that is, [??]. That?s encoded. Now evidently, the glyphs for the 1859 substitutions are as follows: 1. The 1859 glyph for EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], that is, [?] + [?] = [??], that is, [ju]. 2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that is, [??] + [?] = [???], that is, [??]. >>> From jameskasskrv at gmail.com Mon Mar 27 03:04:44 2017 From: jameskasskrv at gmail.com (James Kass) Date: Mon, 27 Mar 2017 00:04:44 -0800 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: Martin J. D?rst responded to Michael Everson, > Overall, we may have up to four variants, of which > three are currently explicitly supported in Unicode. Yes. > Are all of these used as spelling variants? Is there another possible use? > Is the choice of variant up to the author (for which > variants), or is it the editor or printer who makes > the choice (for which variants)? The author, see below. > And what informs this choice? Personal preference and/or spelling reform as well as whether the material was machine printed or hand written. > If we have any historic metal types, are there > examples where a font contains both ligature > variants? Apparently not. John H. Jenkins mentioned early in this thread that these ligatures weren't used in printed materials and were not part of the official Deseret set. They were only used in manuscript. Best regards, James Kass From jameskasskrv at gmail.com Mon Mar 27 03:23:39 2017 From: jameskasskrv at gmail.com (James Kass) Date: Mon, 27 Mar 2017 00:23:39 -0800 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> Message-ID: Martin J. D?rst responded to Michael Everson, > Unfortunately, much of what you wrote gave me the > impression that you may think that historical origin > is the only criterion, or a criterion that trumps all > others. If you don't think so, it would be good if you > could confirm this. If you think so, it would be good > to know why. Historical origin is always a good starting point. The importance of history cannot be overstated. Without it, the other criteria would not exist. Historical origin wouldn't override evidence of contrasting use in this case because such evidence would be "icing on the cake". > ... I have mentioned "usability for average users in > average contexts" and "contrasting use" as criteria, > and I have also in earlier mail acknowledged history > as a (not the) criterion, and have mentioned legacy/ > roundtrip issues. I'm sure there are others. Adding a few historic letters should seldom have any effect on "usability for average users in average contexts". Whether it does in this case remains to be seen. Legacy and roundtrip issues are important because backwards-compatibility supports history. Concerns in this case appear to be hypothetical. Best regards, James Kass From duerst at it.aoyama.ac.jp Mon Mar 27 03:29:21 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 27 Mar 2017 17:29:21 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> Message-ID: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> On 2017/03/24 23:37, Michael Everson wrote: > On 24 Mar 2017, at 11:34, Martin J. D?rst wrote: >> >> On 2017/03/23 22:48, Michael Everson wrote: >> >>> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. >> >> Well, I might be completely wrong, but John Jenkins may be the person on this list closest to an actual user of Deseret (John, please correct me if I'm wrong one way or another). > > He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark). Great to know. Given that, I'd assume that you'd take his input a bit more serious. Here's what he wrote: >>>> My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts. It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table. >>>> >> It may be that actual users of Deseret read these character variants the same way most of us would read serif vs. sans-serif variants: I.e. unless we are designers or typographers, we don't actually consciously notice the difference. > > I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received. That's fine, and not disputed at all. That's exactly why I'm looking for input from other people. As an analogy, assume we had a famous type designer coming to this list and request that we encode old-style digits separately from roman digits, e.g. arguing that this might simplify the production of fonts. We would understand this request, but we would still deny it because based on our day-to-day use of digits, we would understand that at large (i.e. for the average user) the convenience of having only one code point for a given digit weights stronger than the convenience of separate code points for the type designer. We are looking for similar input from "average users" for Deseret. >> If that's the case, it would be utterly annoying to these actual users to have to make a distinction between two characters where there actually is none. > > Actually neither of the ligature-letters are used in our Carrollian Deseret volumes. Ok. That means that these don't provide any information on the discussion at hand (whether to unify or disunify the ligature shapes). >> The richness of the history of the Deseret alphabet can still be preserved e.g. with different fonts the same way we have thousands of different fonts for Latin and many other scripts that show a lot of rich history. > > You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do. Great. So you know that present-day font technology would allow us to handle the different shapes in at least any of the following ways: 1) Separate characters for separate shapes, both shapes in same font 2) Variant selectors, one or both shapes in same font 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font 4) Font selection, different fonts for different shapes Does that knowledge in any way suggest one particular solution? > I?m also aware of what principles we have used for determining character identity. Which, as we have been working out in other mails, are indeed a collection of principles, one of which is history of shape derivation. > I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters. One of the principles of CJK unification is that minor differences are ignored if they are not semantically relevant. For CJK, 'minor' is important, because otherwise, many users wouldn't be able to recognize the shapes as having the same semantics/usage. The qualification 'minor' is less important for an alphabet. In general, the more established and well-known an alphabet is, the wider the variations of glyph shapes that may be tolerated. The question I'm trying to get an answer for for Deseret is whether current actual script users see the shape variation as just substitutable glyphs of the same letter, or inherently different letters. The answer to this question is not the *only* criterion for deciding whether to encode further Deseret letters, but I think it's an important criterion. And the answer that John has given seems to point in a very clear direction for this question. Regards, Martin. From tfujiwar at redhat.com Mon Mar 27 04:00:21 2017 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Mar 2017 18:00:21 +0900 Subject: different version of common/annotations/ja.xml Message-ID: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> Hi, Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation? http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml. E.g. ? | ?? | ???? instead of ??? | ???? | ???? I think the committed version is useful without input method and it follows other languages. Thanks, Fujiwara From jcb+unicode at inf.ed.ac.uk Mon Mar 27 04:14:17 2017 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Mon, 27 Mar 2017 10:14:17 +0100 (BST) Subject: Standaridized variation sequences for the Desert alphabet? References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> Message-ID: While I hesitate to dive in to this argument, Martin makes one comment where I think a point of principle arises: On 2017-03-27, =?UTF-8?Q?Martin_J._D=c3=bcrst?= wrote: [Michael wrote] >> You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do. > > Great. So you know that present-day font technology would allow us to > handle the different shapes in at least any of the following ways: > > 1) Separate characters for separate shapes, both shapes in same font > 2) Variant selectors, one or both shapes in same font > 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font > 4) Font selection, different fonts for different shapes > > Does that knowledge in any way suggest one particular solution? As I've observed before, the intention is that we are stuck with Unicode for as long as our civilization endures, be that 5000 years or 50 years. I contend, therefore, that no decision about Unicode should take into account any ephemeral considerations such as this year's electronic font technology, and that therefore it's not even useful to mention them. All you should need to say is "these letters are too insignificant to merit encoding, and those who believe they need to be able to distinguish them in plain text will just have to use other means, such as ZWJ with the components of the ligature". (I'm not saying that's my view, by the way - I'm more of a splitter than a lumper, and on the basis of this thread, I'm probably on the "encode" side.) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From mark at macchiato.com Mon Mar 27 04:48:25 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 27 Mar 2017 11:48:25 +0200 Subject: different version of common/annotations/ja.xml In-Reply-To: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> Message-ID: By "committed strings", you mean the hiragana phonetic reading? Mark On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara wrote: > Hi, > > Do you have any chances to create a different version of ja.xml of the > Japanese emoji annotation? > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml > > That file includes Hiragana only but I'd need another file which has the > committed strings, likes ja_convert.xml. > E.g. > ? | ?? | ???? > > instead of > > ??? | ???? | ???? > > I think the committed version is useful without input method and it > follows other languages. > > Thanks, > Fujiwara > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfujiwar at redhat.com Mon Mar 27 05:04:27 2017 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Mar 2017 19:04:27 +0900 Subject: different version of common/annotations/ja.xml In-Reply-To: References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> Message-ID: <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> On 03/27/17 18:48, Mark Davis ??-san wrote: > By "committed strings", you mean the hiragana phonetic reading? Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion. After users select one of the converted strings, the converted strings are committed on the text. I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method. Fujiwara > > Mark > ////// > > On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara > wrote: > > Hi, > > Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation? > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml > > > That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml. > E.g. > ? | ?? | ???? > > instead of > > ??? | ???? | ???? > > I think the committed version is useful without input method and it follows other languages. > > Thanks, > Fujiwara > > From everson at evertype.com Mon Mar 27 06:39:56 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 12:39:56 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <000301d2a610$5cf1e550$16d5aff0$@fi> <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com> Message-ID: On 27 Mar 2017, at 05:58, James Kass wrote: > > Asmus Freytag wrote, > >> In the current case, you have the opposite, to wit, the text elements are unchanged, but you would like to add alternate code elements >> to represent what are, ultimately, the same text elements. That's not disunification, but dual encoding. > > If spelling a word with an x+y string versus a z+y string represents two different spellings of the same word, then hand printing the same > word with either an x/y ligature versus a z/y ligature also represents two different spellings of the same word. Asmus also changes the terms of the discussion by introducing the vague and undefined term ?text element?. Michael Everson From everson at evertype.com Mon Mar 27 07:07:19 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 13:07:19 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> Message-ID: <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> On 27 Mar 2017, at 06:42, Martin J. D?rst wrote: >> The default position is NOT ?everything is encoded unified until disunified?. > > Neither it's "everything is encoded separately unless it's unified?. These Deseret letters aren?t encoded. For my part I wasn?t made aware of them in 2004 when they were written about. My view is ?Ah, here?s something. is it encoded? No. Is it a glyph variant of something encoded? No." >> The characters in question have different and undisputed origins, undisputed. > > If you change that to the somewhat more neutral "the shapes in question have different and undisputed origins", then I'm with you. I actually have said as much (in different words) in an earlier post. And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before. >> We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.) > > Fine. I (and others) have also given quite a few analogies, none of them perfect, but most if not all of them pertinent. The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care. No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this: https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there. The situation in Deseret is different. Other analogies had to do with normal shape variation, not shapes derived from underlying ligatures. Analogies are never perfect but I don?t think the ones offered were pertinent. Underlying ligature difference is indicative of character identity. Particularly when two resulting ligatures are SO different from one another as to be unrecognizable. And that is the case with EW on the left and OI on the right here: https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters. >>> We haven't yet heard of any contrasting uses for the letter shapes we are discussing. >> >> Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters. > > Sorry, but where did I say that it's the only criterion? I don't think it's the only criterion. On the other hand, I also don't think that historical origin is or should be the only criterion. Neither do I, but it has been a very clear precedent for many character distinctions and that is useful precedent. > Unfortunately, much of what you wrote gave me the impression that you may think that historical origin is the only criterion, or a criterion that trumps all others. If you don't think so, it would be good if you could confirm this. If you think so, it would be good to know why. Character origin is intimately related to character identity. Even where superficial similarity is concerned; I had to prove character origin for the disunification of YOGH from EZH long long ago and I?ve done the same over and over again for many characters and even full scripts. Sometimes characters are used and then become disused. MOST of the Bamum characters we have encoded aren?t in modern use today, but they were encoded for historical concerns. >> Please try to remember that. (It?s a bit shocking to have to remind people of this. > > You don't have to remind me, at least. I have mentioned "usability for average users in average contexts" and "contrasting use" as criteria, and I have also in earlier mail acknowledged history as a (not the) criterion, and have mentioned legacy/roundtrip issues. I'm sure there are others. I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters. Michael Everson From everson at evertype.com Mon Mar 27 07:59:40 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 13:59:40 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> On 27 Mar 2017, at 08:05, Martin J. D?rst wrote: >> Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.) > > "apparently", maybe. Let's for a moment leave aside the radicals themselves, which are to a large extent artificial constructs. I do stipulate not being a CJK expert. But those are indeed different due to their origins, however similar their shapes are. > Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ? shape, with two horizontal bars), whereas the H/T/V columns show the ? shape (two downwards slanted bars) for the "MEAT" radical and the ? shape for the moon radical. So whether these radicals have identical glyphs depends on typographic tradition/font/? They are still always very similar, right? > In Japan, many people may be rather unaware of the difference, whereas in Taiwan, it may be that school children get drilled on the difference. That?s interesting. >> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. > > Independent of whether the chart glyphs get changed, couldn't we just add a note "also # in some fonts" (where # is the other variant). Well, no. First, ALL fonts currently use the 1855 letterforms based on ligatures ???? and ????, so a decree that those code positions would Second, the letterforms resulting from the ligations are just nothing alike > That would make sure that nobody could claim "this font is wrong" based on the charts. (Even if a general claim that the chart glyphs aren't normative applies to all charts anyway.) As James Kass said: "If spelling a word with an x+y string versus a z+y string represents two different spellings of the same word, then hand printing the same word with either an x/y ligature versus a z/y ligature also represents two different spellings of the same word." >> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. > > Well, yes, rejected many times in cases where that was appropriate. But also accepted many times, in cases that we may not even remember, because they may not even have been made explicitly. Do come up with examples if you have any. > Because in such cases, the focus may not be on a change to one or a few letter shapes, but the focus may be on a change of the overall style, which induces a change of letter shape in some letters. To be honest I really don?t follow this reasoning. https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg isn?t just some ?glyph variation?. They are entirely different glyphs with entirely different origins. I can think of no instance where we have "unified? such wildly different glyphs. > The roman/italic a/? and g/? distinctions (the later code points only used to show the distinction in plain text, which could as well be done descriptively), Aa and ?? are used contrastively for different sounds in some languages and in the IPA. ?? is not, to my knowledge, used contrastively with Gg (except that ? can only mean /?/, while orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is reasonably analogous to ?? and ???? being used for /ju?/. > as well as a large number of distinctions in Han fonts, come to my mind. I'm quite sure other scripts have similar phenomena. Again, spelling of all kinds varies greatly in Deseret texts. I?ll try with another example using some Latin glyphs. ?Poison? can be written ???????????? POIZ?N in Deseret, or it can be written ?????????? P?Z?N or it can be written ???????? P?Z?N. That?s three different spellings, not two. (I used O with a bar to mimic the bar of Deseret SHORT I ??). >> Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions. > > This is interesting information. You are saying that in actual practice, there is a choice between writing ???? (two letters for a diphthong) and writing ??. In the same location, is ???? (the base for the historically later shape variant of ??; please note that this may actually be written ????; No, that?s not correct. Poison can be written with ???? or it can be written with ?? (in origin a ligature of ????) or it can be written with ????. Unligated, the three spellings would be different: ???????????? /po?z?n/ and ???????????? /p??z?n/ and ???????????? /p???z?n/. Despite this, with the ligatures, the pronunciation would be /po?z?n/ whether ???????????? or ?????????? or ????????. > there's some inconsistency in order between the above cited sentence and the text below copied from an earlier mail) also used as a spelling variant? I don?t think so. > Overall, we may have up to four variants, No, we don?t. See above. And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ????. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/. > of which three are currently explicitly supported in Unicode. The characters and are not encoded. > Are all of these used as spelling variants? In principle, what I have shown above is accurate. I can?t do a corpus search for actual examples. > Is the choice of variant up to the author (for which variants), or is it the editor or printer who makes the choice (for which variants)? In a handwritten manuscript obviously the choice is the author?s. As to historical printing, printers may have > And what informs this choice? If we have any historic metal types, are there examples where a font contains both ligature variants? Ken Beesley have samples of a metal font (the 1857 St Luois punches) which had both ?? and ????; I don?t know what other sorts were in that font. > (Please note that because ??, ??, and ?? are available as individual letters, it's very difficult to think about the two-letter sequences as anything else than spellings, but that doesn't necessarily carry over to the ligatures.) See above. > And then the same questions, with parallel (or not parallel) answers, for ??/??/??. See above. Michael Everson > Regards, Martin. > > > Text copied from earlier mail by Michael: > > >>>> > 1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], that is, [?] + [o?] = [?u?], that is, [ju]. > > 2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], that is, [?] + [?] = [??], that is, [??]. > > That?s encoded. Now evidently, the glyphs for the 1859 substitutions are as follows: > > 1. The 1859 glyph for EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], that is, [?] + [?] = [??], that is, [ju]. > > 2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that is, [??] + [?] = [???], that is, [??]. > >>> From everson at evertype.com Mon Mar 27 08:02:00 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 14:02:00 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: On 27 Mar 2017, at 09:04, James Kass wrote: > John H. Jenkins mentioned early in this thread that these ligatures weren't used in printed materials and were not part of the official Deseret set. They were only used in manuscript. Not quite true. Such detail will be for the proposal. Michael From everson at evertype.com Mon Mar 27 08:49:54 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 14:49:54 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> Message-ID: <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com> On 27 Mar 2017, at 09:29, Martin J. D?rst wrote: >> He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark). > > Great to know. Given that, I'd assume that you'd take his input a bit more serious. I?m discussing it now, offline, with him and Ken. > Here's what he wrote: > > >>>> > My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. That begs the whole question of character identity. He?s simply saying what you and Asmus also said. But when you dig into it further, there?s more to the story, as we have found out. > In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts. There was indeed type cut for these. What?s not found is a full alphabet chart showing some of the ligated letters, but that?s a different question. > It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table. > >>>> Now that further research has been done, I?ll be discussing this with John and Ken with regard to putting together a proposal which will support the two ligating letterform characters as well as some other historical Deseret characters, some used in an important English-Hopi lexicon which was recently published. (I await my copy of that.) >> I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received. > > That's fine, and not disputed at all. That's exactly why I'm looking for input from other people. Well, all right, but I didn?t use either ?? or ?? in my editions apart from the entry in the chart in the front matter. > As an analogy, assume we had a famous type designer coming to this list and request that we encode old-style digits separately from roman digits, e.g. arguing that this might simplify the production of fonts. I don?t see how this analogy could possibly apply. Once again the 1859 ligature-characters look nothing at all like the 1855 one, which speaks to their unique identity as characters. Moreover, encoded digits are used by billions of people daily. > We would understand this request, but we would still deny it because based on our day-to-day use of digits, we would understand that at large (i.e. for the average user) the convenience of having only one code point for a given digit weights stronger than the convenience of separate code points for the type designer. I?m not suggesting encoding characters for ?convenience?. I?m suggesting that there is a character-identity issue here, based both on the origin of the characters and of their vasty different appearance from other characters encoded in the standard. > We are looking for similar input from "average users" for Deseret. The encoding of historic characters is for ?expert users? working with historical material, not necessarily ?average users? who might be composing blog entries. >> Actually neither of the ligature-letters are used in our Carrollian Deseret volumes. > > Ok. That means that these don't provide any information on the discussion at hand (whether to unify or disunify the ligature shapes). I didn?t even know about the 1859 ligatures until this week. All this proves is that John didn?t use any ligatures when he transcribed the texts. >> You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do. > > Great. So you know that present-day font technology would allow us to handle the different shapes in at least any of the following ways: > > 1) Separate characters for separate shapes, both shapes in same font We shouldn?t do that for shapes so different and with clearly different origins. > 2) Variant selectors, one or both shapes in same font Pseudo-encoding, useful for subtle variation but not for something as big as this. I am not an enemy of variation selectors. In fact I?m preparing a nice proposal for some standardized sequences. It would not apply here, because they glyph identity of the letters is too distinct. > 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font Font trickery. Not portable. Not supported by most apps. > 4) Font selection, different fonts for different shapes We really don?t do this just for one or two characters in a script. > Does that knowledge in any way suggest one particular solution? None of this discussion has convinced me that these letters are variants of existing characters. >> I?m also aware of what principles we have used for determining character identity. > > Which, as we have been working out in other mails, are indeed a collection of principles, one of which is history of shape derivation. That and spelling. The only counterargument seems to be ?they are diphthongs? but we don?t encode sounds, we encode the elements of writings systems. The 1859 ligated letterforms are not in any way glyph variants of the 1855 ligated letterforms. They?re completely different letterforms, having only the diagonal stroke of the ?? in common. >> I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters. > > One of the principles of CJK unification is that minor differences are ignored if they are not semantically relevant. For CJK, 'minor' is important, because otherwise, many users wouldn't be able to recognize the shapes as having the same semantics/usage. These would not be unified according to CJK principles: https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg > The qualification 'minor' is less important for an alphabet. In general, the more established and well-known an alphabet is, the wider the variations of glyph shapes that may be tolerated. The question I'm trying to get an answer for for Deseret is whether current actual script users see the shape variation as just substitutable glyphs of the same letter, or inherently different letters. > > The answer to this question is not the *only* criterion for deciding whether to encode further Deseret letters, but I think it's an important criterion. And the answer that John has given seems to point in a very clear direction for this question. John?s view was a first statement before many questions were asked and before research into the matter had commenced, really. I?ll get back to you after working with John and Ken some more. Michael Everson From alastair at alastairs-place.net Mon Mar 27 09:04:17 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Mon, 27 Mar 2017 15:04:17 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> Message-ID: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> On 27 Mar 2017, at 10:14, Julian Bradfield wrote: > > I contend, therefore, that no decision about Unicode should take into > account any ephemeral considerations such as this year's electronic > font technology, and that therefore it's not even useful to mention > them. I?d disagree with that, for two reasons: 1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now. If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text). 2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future. There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed. I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems. All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here. That?s very much not my field. Kind regards, Alastair. -- http://alastairs-place.net From alastair at alastairs-place.net Mon Mar 27 09:32:55 2017 From: alastair at alastairs-place.net (Alastair Houghton) Date: Mon, 27 Mar 2017 15:32:55 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com> Message-ID: <3F53B8D8-31EB-41EC-B672-1AD03A415C67@alastairs-place.net> On 27 Mar 2017, at 14:49, Michael Everson wrote: >> 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font > > Font trickery. Not portable. Not supported by most apps. I wouldn?t describe it as ?trickery? or ?not portable?. Features like stylistic alternates are part of the OpenType specification, and actually have quite widespread support in Mac software (check out the Typography panel, which you can get to from the system Font Panel). On Windows and Linux, support is more limited, though software that uses the newer DirectWrite or Pango APIs to render text should find it straightforward enough. I don?t know how this bears on the discussion about Deseret (that?s outside my area of expertise), but as a software developer I?d certainly *prefer* to see font features used (rather than, say, assigning a new code point or using variation selectors) where the primary difference is in the rendering rather than the meaning. Kind regards, Alastair. -- http://alastairs-place.net From irgendeinbenutzername at gmail.com Mon Mar 27 09:44:59 2017 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Mon, 27 Mar 2017 16:44:59 +0200 Subject: Encoding of old compatibility characters Message-ID: I?ve recently developed an interest in old legacy text encodings and noticed that there are various characters in several sets that don?t have a Unicode equivalent. I had already started research into these encodings to eventually prepare a proposal until I realised I should probably ask on the mailing list first whether it is likely the UTC will be interested in those characters before I waste my time on a project that won?t achieve anything in the end. The character sets in question are ATASCII, PETSCII, the ZX80 set, the Atari ST set, and the TI calculator sets. So far I?ve only analyzed the ZX80 set in great detail, revealing 32 characters not in the UCS. Most characters are pseudo-graphics, simple pictographs or inverted variants of other characters. Now, one of Unicode?s declared goals is to enable round-trip compatibility with legacy encodings. We?ve accumulated a lot of weird stuff over the years in the pursuit of this goal. So it would be natural to assume that the unencoded characters from the mentioned sets would also be eligible for inclusion in the UCS. On the other hand, those encodings are for the most part older than Unicode and so far there seems to have been little interest in them from the UTC or WG2, or any of their contributors. Something tells me that if these character sets were important enough to consider for inclusion, they would have been encoded a long time ago along with all the other stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc. Obviously the character sets in question don?t receive much use nowadays (and some weren?t even that relevant in their time, either), which leads to me wonder whether further putting work into this proposal would be worth it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 09:51:07 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 15:51:07 +0100 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: <9DAB9816-CC38-4FA0-82A1-B1BF2BFFECDD@evertype.com> On 27 Mar 2017, at 15:44, Charlotte Buff wrote: > > I?ve recently developed an interest in old legacy text encodings and noticed that there are various characters in several sets that don?t have a Unicode equivalent. I had already started research into these encodings to eventually prepare a proposal until I realised I should probably ask on the mailing list first whether it is likely the UTC will be interested in those characters before I waste my time on a project that won?t achieve anything in the end. It?s hard to say without knowing what the characters are. Michael Everson From irgendeinbenutzername at gmail.com Mon Mar 27 10:48:16 2017 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Mon, 27 Mar 2017 17:48:16 +0200 Subject: Encoding of old compatibility characters Message-ID: > It?s hard to say without knowing what the characters are. For the ZX80, the missing characters include five block elements (top and bottom halfs of MEDIUM SHADE, as well as their inverse counterparts), and inverse/negative squared variants of European digits and the following symbols: " ? $ : ? ( ) - + * / = < > ; , . Negative squared digits may be unifiable with negative circled digits. ATASCII includes inverse variants of box drawing characters. I have to check whether some other pictographs are unifiable with existing characters. PETSCII includes some box drawings and vertical scan lines that are probably not unifiable. Atari ST includes two simple pictographs that were used as graphical UI elements. They look like a negative, low diagonal stroke and a negative diamond respectively. It also has six characters that together form logos which I wasn?t going to propose. TI calculators include a single character for a superscript minus 1. I don?t have a lot of information available about this set at the moment. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jenkins at apple.com Mon Mar 27 10:56:16 2017 From: jenkins at apple.com (John H. Jenkins) Date: Mon, 27 Mar 2017 09:56:16 -0600 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: > On Mar 27, 2017, at 2:04 AM, James Kass wrote: > >> >> If we have any historic metal types, are there >> examples where a font contains both ligature >> variants? > > Apparently not. > > John H. Jenkins mentioned early in this thread that these ligatures > weren't used in printed materials and were not part of the official > Deseret set. They were only used in manuscript. > This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 11:03:36 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 17:03:36 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com> On 27 Mar 2017, at 16:56, John H. Jenkins wrote: >> John H. Jenkins mentioned early in this thread that these ligatures weren't used in printed materials and were not part of the official Deseret set. They were only used in manuscript. > > This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode. The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI . Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont. Michael Everson From jenkins at apple.com Mon Mar 27 11:07:25 2017 From: jenkins at apple.com (John H. Jenkins) Date: Mon, 27 Mar 2017 10:07:25 -0600 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: > On Mar 27, 2017, at 9:56 AM, John H. Jenkins wrote: > > >> On Mar 27, 2017, at 2:04 AM, James Kass > wrote: >> >>> >>> If we have any historic metal types, are there >>> examples where a font contains both ligature >>> variants? >> >> Apparently not. >> >> John H. Jenkins mentioned early in this thread that these ligatures >> weren't used in printed materials and were not part of the official >> Deseret set. They were only used in manuscript. >> > > This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode. > This should teach me to double-check before posting. Apparently, the earlier typeface *did* include all forty letters; it just didn't use these two. I don't know what glyphs were used. -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 11:20:19 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 17:20:19 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> Message-ID: <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com> On 27 Mar 2017, at 17:07, John H. Jenkins wrote: > This should teach me to double-check before posting. The research is a lot of fun. Can?t wait till I get Ken?s book next week. > Apparently, the earlier typeface *did* include all forty letters; it just didn't use these two. I don't know what glyphs were used. What I understood is that typefaces included the letters but there?s no *chart* that contains both 1859 letters. Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??, ?few? as ??, ?truefully? [sic] as ????????????, and ?you? as ??. Fascinating stuff. Michael Everson From markus.icu at gmail.com Mon Mar 27 11:49:19 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 27 Mar 2017 09:49:19 -0700 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: I think the interest has been low because very few documents survive in these encodings, and even fewer documents using not-already-encoded symbols. In my opinion, this is a good use of the Private Use Area among a very small group of people. See also https://en.wikipedia.org/wiki/ConScript_Unicode_Registry Best regards, markus ? PS: I had a ZX 81, then a Commodore 64, then an Atari ST, and at school used a Commodore PET... -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 11:49:48 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 17:49:48 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> Message-ID: <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com> On 27 Mar 2017, at 15:04, Alastair Houghton wrote: > 1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now. If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text). Nothing?s easier than representing encoded characters. :-) > 2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future. There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed. Sorry, but typographic control of that sort is grand for typesetting, where you can select ranges of text and language-tag it (assuming your program accepts and supports all the language tags you might need (which they don?t)) and you can select fonts which have all the trickery baked into them (hardly any do) and then? can you use this in file names? In your plain-text databases? In your text messages? > I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems. They exist. And can be useful for some things. I think that historic origin of the Deseret diphthong letters and the importance these options have for the study of Deseret orthographic choices throughout the early period of its use. > All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here. That?s very much not my field. Michael Everson From gwalla at gmail.com Mon Mar 27 12:08:33 2017 From: gwalla at gmail.com (Garth Wallace) Date: Mon, 27 Mar 2017 17:08:33 +0000 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: Apple IIs also had inverse-video letters, and some had "MouseText" pseudographics used to simulate a Mac-like GUI in text mode. I know that a couple of fonts from Kreative put these in the PUA and Nishiki-Teki follows their lead. On Mon, Mar 27, 2017 at 9:25 AM Charlotte Buff < irgendeinbenutzername at gmail.com> wrote: > > It?s hard to say without knowing what the characters are. > > For the ZX80, the missing characters include five block elements (top and > bottom halfs of MEDIUM SHADE, as well as their inverse counterparts), and > inverse/negative squared variants of European digits and the following > symbols: " ? $ : ? ( ) - + * / = < > ; , . > Negative squared digits may be unifiable with negative circled digits. > > ATASCII includes inverse variants of box drawing characters. I have to > check whether some other pictographs are unifiable with existing characters. > > PETSCII includes some box drawings and vertical scan lines that are > probably not unifiable. > > Atari ST includes two simple pictographs that were used as graphical UI > elements. They look like a negative, low diagonal stroke and a negative > diamond respectively. It also has six characters that together form logos > which I wasn?t going to propose. > > TI calculators include a single character for a superscript minus 1. I > don?t have a lot of information available about this set at the moment. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 12:16:15 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 18:16:15 +0100 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: <9AA9A227-EE3D-4B06-B3F4-F2CB606351E7@evertype.com> On 27 Mar 2017, at 18:08, Garth Wallace wrote: > > Apple IIs also had inverse-video letters, and some had "MouseText" pseudographics used to simulate a Mac-like GUI in text mode. > > I know that a couple of fonts from Kreative put these in the PUA and Nishiki-Teki follows their lead. I think it?s better to be inclusive rather than exclusive. PUA isn?t stable, and marginal as this stuff may be, we stuff encoded that is far more marginal? nothing more frustrating than expecting something and finding it missing. Michael Everson From kenwhistler at att.net Mon Mar 27 12:18:03 2017 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 27 Mar 2017 10:18:03 -0700 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net> On 3/27/2017 7:44 AM, Charlotte Buff wrote: > Now, one of Unicode?s declared goals is to enable round-trip > compatibility with legacy encodings. We?ve accumulated a lot of weird > stuff over the years in the pursuit of this goal. So it would be > natural to assume that the unencoded characters from the mentioned > sets [ATASCII, PETSCII, the ZX80 set, the Atari ST set, and the TI > calculator sets] would also be eligible for inclusion in the UCS. Actually, it wouldn't be. The original goal was to ensure round-trip compatibility with *important* legacy character encodings, *for which there was a need to convert legacy data, and/or an ongoing need to representation of text for interchange*. From Unicode 1.0: "The Unicode standard includes the character content of all major International Standards approved and published before December 31, 1990... [long list ensues] ... and from various industry standards in common use (such as code pages and character sets from Adobe, Apple, IBM, Lotus, Microsoft, WordPerfect, Xerox and others)." Even as long ago as 1990, artifacts such as the Atari ST set were considered obsolete antiquities, and did not rise to the level of the kind of character listings that we considered when pulling together the original repertoire. And there are several observations to be made about the "weird stuff" we have accumulated over the years in the pursuit of compatibility. A lot of stuff that was made up out of whole cloth, rather than being justified by existing, implemented character sets used in information interchange at the time, came from the 1991/1992 merger process between the Unicode Standard and the ISO/IEC 10646 drafts. That's how Unicode acquired blocks full of Arabic ligatures, for example. Other, subsequent additions of small (or even largish) sets of oddball "characters" that don't fit the prototypical sets of characters for scripts and/or well-behaved punctuation and symbols, typically have come in with argued cases for the continued need in current text interchange, for complete coverage. For example, that is how we ended up filling out Zapf dingbats with some glyph pieces that had been omitted in the initial repertoire for that block. More recently, of course, the continued importance of Wingdings and Webdings font encodings on the Windows platform led the UTC to filling out the set of graphical dingbats to cover those sets. And of course, we first started down the emoji track because of the need to interchange text originating from widely deployed Japanese carrier sets implemented as extensions to Shift-JIS. I don't think the early calculator character sets, or sets for the Atari ST and similar early consumer computer electronics fit the bill, precisely because there isn't a real text data interchange case to be made for character encoding. Many of the elements you have mentioned, for example, like the inverse/negative squared versions of letters and symbols, are simply idiosyncratic aspects of the UI for the devices, in an era when font generators were hard coded and very primitive indeed. Documenting these early uses, and pointing out parts of the UI and character usage that aren't part of the character repertoire in the Unicode Standard seems an interesting pursuit to me. But absent a true textual data interchange issue for these long-gone, obsolete devices, I don't really see a case to be made for spending time in the UTC defining a bunch of compatibility characters to encode for them. --Ken From everson at evertype.com Mon Mar 27 12:18:34 2017 From: everson at evertype.com (Michael Everson) Date: Mon, 27 Mar 2017 18:18:34 +0100 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: <4A85FDE2-C9BF-4C72-B7C5-05FC9A477DC8@evertype.com> On 27 Mar 2017, at 17:49, Markus Scherer wrote: > > I think the interest has been low because very few documents survive in these encodings, and even fewer documents using not-already-encoded symbols. That doesn?t mean that the few people who may need the characters now or in the centuries to come shouldn?t have them. If we?ve encoded some characters like these for compatibility, it?s only fair to be thorough. > In my opinion, this is a good use of the Private Use Area among a very small group of people. I?d say not, since they?d be using some encoded characters and having to augment it with some PUA characters. > See also https://en.wikipedia.org/wiki/ConScript_Unicode_Registry That?s not for this sort of thing at all at all. The UCS is for this sort of thing. Michael Everson > ?PS: I had a ZX 81, then a Commodore 64, then an Atari ST, and at school used a Commodore PET... Lucky man. :-) From kojiishi at gmail.com Mon Mar 27 12:25:36 2017 From: kojiishi at gmail.com (Koji Ishii) Date: Tue, 28 Mar 2017 02:25:36 +0900 Subject: different version of common/annotations/ja.xml In-Reply-To: <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> Message-ID: I think he meant Kanji/Han ideographic by "committed string". 2017-03-27 19:04 GMT+09:00 Takao Fujiwara : > On 03/27/17 18:48, Mark Davis ??-san wrote: > >> By "committed strings", you mean the hiragana phonetic reading? >> > > Hiragana is used to the raw text of the phonetic reading by the Japanese > input method before the conversion. > After users select one of the converted strings, the converted strings are > committed on the text. > I mean the major conversion of ja.xml is useful instead of remembering the > raw text as the converted result in the input method. > > Fujiwara > > >> Mark >> ////// >> >> On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara > > wrote: >> >> Hi, >> >> Do you have any chances to create a different version of ja.xml of >> the Japanese emoji annotation? >> http://unicode.org/cldr/trac/browser/tags/latest/common/anno >> tations/ja.xml >> > otations/ja.xml> >> >> That file includes Hiragana only but I'd need another file which has >> the committed strings, likes ja_convert.xml. >> E.g. >> ? | ?? | ???? >> >> instead of >> >> ??? | ???? | ???? >> >> I think the committed version is useful without input method and it >> follows other languages. >> >> Thanks, >> Fujiwara >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 13:44:06 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 27 Mar 2017 20:44:06 +0200 Subject: Encoding of old compatibility characters In-Reply-To: <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net> References: <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net> Message-ID: TI caculators are not antique tools, and when I see how most calculators for Android or Windows 10 are now, they are not as usable as the scientific calculators we had in the past. I know at least one excellent calculator that works with Android and Windows and finally has the real look and feel of a true calculator, and that display correct labels and excellent formulas (with the conventional 2D layout), my favorite is now "HyperCalc" (it has a free version and a paid version). The Android version is a bit more advanced. The paid version has only a few additional features not so needed (such as themes). The interface is clear, and there are several input modes for expressions. When you look at the default Calculator of Windows 10 it has never been worse than what it is now (it was much better in Windows 7 or before, even if it had many limitations). Also entering expressions in Excel is really antique, and many functions have stupid limitations (in addition, spreadsheets are not even portable across versions of Office or don't render the same, and sometimes unexpectedly produce different results). But this is not at all a problem of character encoding: we don't need Unicode at all to create a convenient UI in such applications. Even with a web_based interface, you can do a lot with HTML canvas and SVG and have a scalable UI without having to use dirty text tricks or using PUA fonts. 2017-03-27 19:18 GMT+02:00 Ken Whistler : > > On 3/27/2017 7:44 AM, Charlotte Buff wrote: > >> Now, one of Unicode?s declared goals is to enable round-trip >> compatibility with legacy encodings. We?ve accumulated a lot of weird stuff >> over the years in the pursuit of this goal. So it would be natural to >> assume that the unencoded characters from the mentioned sets [ATASCII, >> PETSCII, the ZX80 set, the Atari ST set, and the TI calculator sets] would >> also be eligible for inclusion in the UCS. >> > > Actually, it wouldn't be. > > The original goal was to ensure round-trip compatibility with *important* > legacy character encodings, *for which there was a need to convert legacy > data, and/or an ongoing need to representation of text for interchange*. > > From Unicode 1.0: "The Unicode standard includes the character content of > all major International Standards approved and published before December > 31, 1990... [long list ensues] ... and from various industry standards in > common use (such as code pages and character sets from Adobe, Apple, IBM, > Lotus, Microsoft, WordPerfect, Xerox and others)." > > Even as long ago as 1990, artifacts such as the Atari ST set were > considered obsolete antiquities, and did not rise to the level of the kind > of character listings that we considered when pulling together the original > repertoire. > > And there are several observations to be made about the "weird stuff" we > have accumulated over the years in the pursuit of compatibility. A lot of > stuff that was made up out of whole cloth, rather than being justified by > existing, implemented character sets used in information interchange at the > time, came from the 1991/1992 merger process between the Unicode Standard > and the ISO/IEC 10646 drafts. That's how Unicode acquired blocks full of > Arabic ligatures, for example. > > Other, subsequent additions of small (or even largish) sets of oddball > "characters" that don't fit the prototypical sets of characters for scripts > and/or well-behaved punctuation and symbols, typically have come in with > argued cases for the continued need in current text interchange, for > complete coverage. For example, that is how we ended up filling out Zapf > dingbats with some glyph pieces that had been omitted in the initial > repertoire for that block. More recently, of course, the continued > importance of Wingdings and Webdings font encodings on the Windows platform > led the UTC to filling out the set of graphical dingbats to cover those > sets. And of course, we first started down the emoji track because of the > need to interchange text originating from widely deployed Japanese carrier > sets implemented as extensions to Shift-JIS. > > I don't think the early calculator character sets, or sets for the Atari > ST and similar early consumer computer electronics fit the bill, precisely > because there isn't a real text data interchange case to be made for > character encoding. Many of the elements you have mentioned, for example, > like the inverse/negative squared versions of letters and symbols, are > simply idiosyncratic aspects of the UI for the devices, in an era when font > generators were hard coded and very primitive indeed. > > Documenting these early uses, and pointing out parts of the UI and > character usage that aren't part of the character repertoire in the Unicode > Standard seems an interesting pursuit to me. But absent a true textual data > interchange issue for these long-gone, obsolete devices, I don't really see > a case to be made for spending time in the UTC defining a bunch of > compatibility characters to encode for them. > > --Ken > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 27 14:17:20 2017 From: doug at ewellic.org (Doug Ewell) Date: Mon, 27 Mar 2017 12:17:20 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> announcements at Unicode dot org wrote: > ? and new regional flags for England, Scotland, and Wales. It's not clear from this text, nor from the table in Section C.1.1 of the draft, what the status is of flag emoji tag sequences other than the three above. I read the relevant section a couple of times and could not figure out how a "standard sequence" differs from a non-standard one, or how ordinary users are supposed to know the difference. The term "standard sequence" appears nowhere in the draft except as a table header. Vendors always have the option of supporting or not supporting a glyph for any code point or sequence -- note 4 in Section C.1 and the second sentence in C.1.1 both reinforce this long-standing principle -- so there must be something more here. -- Doug Ewell | Thornton, CO, US | ewellic.org ?? From verdy_p at wanadoo.fr Mon Mar 27 15:30:58 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 27 Mar 2017 22:30:58 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: 2017-03-27 21:17 GMT+02:00 Doug Ewell : > announcements at Unicode dot org wrote: > > > ? and new regional flags for England, Scotland, and Wales. > > It's not clear from this text, nor from the table in Section C.1.1 of > the draft, what the status is of flag emoji tag sequences other than the > three above. > Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the codes used do not specify clearly about which region code standard they are refering to. We just see that it's an ISO3166-1 country/territory code followed directly (without separator) by sequences of letter/digits, all of them converted to RIS and surrounded by a the same initial emeoji code and the DEL from RIS. The problem is how to choose the codes for the letter/digits in the second part, if they ever come from ISO3166-2 after dropping the hypen separator (this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB) or somewhere else. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Mar 27 15:34:09 2017 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 27 Mar 2017 13:34:09 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: On 3/27/2017 12:17 PM, Doug Ewell wrote: > announcements at Unicode dot org wrote: > >> ? and new regional flags for England, Scotland, and Wales. > It's not clear from this text, nor from the table in Section C.1.1 of > the draft, what the status is of flag emoji tag sequences other than the > three above. > > I read the relevant section a couple of times and could not figure out > how a "standard sequence" differs from a non-standard one, or how > ordinary users are supposed to know the difference. The term "standard > sequence" appears nowhere in the draft except as a table header. The terminology is still a bit in flux, which is why the text of UTS #51 is still under review, before being finalized at the UTC meeting in May. But the data for Emoji 5.0 is final, and there are precisely 3 "emoji tag sequences" in the relevant data file: http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt As for how "users" are supposed to know the difference. Well, they don't. What matters is that the data file that the "implementers" will use has these 3 emoji tag sequences in it, so that is quite likely what everybody will see added to their phones. The "users" will just see 3 more flags. And if they want a flag of California (or whatever), then they need to badger the platform vendors, who will then come back to the Emoji SC, saying, "Help! We need to add a flag of California, or people won't buy our phones!" And if a flag of California (or Pomerania or ...) then gets added to the list of emoji tag sequences in a future version of the data, there is a good chance that the "users" will then see the difference, because that flag will appear on their phones eventually. Anybody could *attempt* to convey a flag of Pomerania (a rather handsome black gryphon on a yellow background, btw) with an emoji tag sequence right now, I suppose. Good luck on any input support or actual interoperability or availability in any font on any standard platform, however. You'd just get fallback display. If conveying flags of Pomerania is in your near term future, I'd advise sticking to images. ;-) --Ken > > Vendors always have the option of supporting or not supporting a glyph > for any code point or sequence -- note 4 in Section C.1 and the second > sentence in C.1.1 both reinforce this long-standing principle -- so > there must be something more here. > From verdy_p at wanadoo.fr Mon Mar 27 15:39:10 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 27 Mar 2017 22:39:10 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: Note also that ISO3166-2 is far from being stable, and this could contradict Unicode encoding stability: it would then be required to ensure this stability by only allowing sequences that are effectively registered in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt (independantly of the registration ins ISO3166-2), and nothing is said if ever ISO3166-2 obsoletes some codes and then some years later decide to reassign these codes to new entities: it should not be possible to do the same thing in Emoji sequences, and specific assignments will need to be made in the Unicode database. Note also that most rencetly created administrative divisions do not really adopt any flag, but if flags are used they may be reusing flags from older historic entities... or they could adopt only a logo (with legal protection, not really suitable from encoding in the UCS as it won't be possible to define any "representative glyph" without asking for permission to the relevant authorities for displaying some design, possibly simplified) We still lack an encoding standard for vexillologists. And for now only "Flags of the World" proposes some encoding (not based strictly and only on ISO3166). I think that the UTC should try contacting authors of Flags of the World and seek for advice there: we are speaking here about regional flags (we can exclude some graphical variants such as civil vs. navy flags vs honorific flags) 2017-03-27 22:30 GMT+02:00 Philippe Verdy : > > > 2017-03-27 21:17 GMT+02:00 Doug Ewell : > >> announcements at Unicode dot org wrote: >> >> > ? and new regional flags for England, Scotland, and Wales. >> >> It's not clear from this text, nor from the table in Section C.1.1 of >> the draft, what the status is of flag emoji tag sequences other than the >> three above. >> > > Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the > codes used do not specify clearly about which region code standard they are > refering to. We just see that it's an ISO3166-1 country/territory code > followed directly (without separator) by sequences of letter/digits, all of > them converted to RIS and surrounded by a the same initial emeoji code and > the DEL from RIS. > > The problem is how to choose the codes for the letter/digits in the second > part, if they ever come from ISO3166-2 after dropping the hypen separator > (this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB) > or somewhere else. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Mar 27 16:19:38 2017 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 27 Mar 2017 14:19:38 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: <33d67dda-3758-34ef-f5a3-a5ff4f669843@att.net> On 3/27/2017 1:39 PM, Philippe Verdy wrote: > Note also that ISO3166-2 is far from being stable, and this could > contradict Unicode encoding stability: it would then be required to > ensure this stability by only allowing sequences that are effectively > registered in > http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt > (independantly of the registration ins ISO3166-2), and nothing is said > if ever ISO3166-2 obsoletes some codes and then some years later > decide to reassign these codes to new entities: it should not be > possible to do the same thing in Emoji sequences, and specific > assignments will need to be made in the Unicode database. > These emoji tag sequences don't derive their stability from ISO 3166-2. The emoji tag sequences depend on: CLDR Unicode Locale Identifiers, and more specifically, for these subregions, on the unicode_subdivision_id: http://unicode.org/reports/tr35/index.html#unicode_subdivision_id And the data for that is here: http://unicode.org/repos/cldr/tags/latest/common/validity/subdivision.xml The stability for such tags is baked into the CLDR repository, as I understand it. By the way, if anybody is looking, Pomerania is there: "plpm" among the 4925 other valid unicode_subdivision_id values. So: Flag of Pomerania = 1F3F4 E0070 E006C E0070 E006D E007F But alas, that is not a *valid* emoji tag sequence (yet), so no soup for you! --Ken From richard.wordingham at ntlworld.com Mon Mar 27 16:32:25 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 27 Mar 2017 22:32:25 +0100 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: <20170327223225.73cb528e@JRWUBU2> On Mon, 27 Mar 2017 13:34:09 -0700 Ken Whistler wrote: > And if a flag of > California (or Pomerania or ...) then gets added to the list of emoji > tag sequences in a future version of the data, there is a good chance > that the "users" will then see the difference, because that flag will > appear on their phones eventually. Indeed, why isn't the flag of Texas there already so as to terminate the abuse of . Technically, at least, it has the justification of being a formerly independent country, though I don't know that they have any national teams. Is anyone working on the issue of flags for the whole of Ireland? Different sports have their own 'national' flags. Pomerania will be a bit tricky, as it isn't any recent administrative division. Richard. From doug at ewellic.org Mon Mar 27 16:39:53 2017 From: doug at ewellic.org (Doug Ewell) Date: Mon, 27 Mar 2017 14:39:53 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170327143953.665a7a7059d7ee80bb4d670165c8327d.9232a4d4ec.wbe@email03.godaddy.com> Ken Whistler wrote: > As for how "users" are supposed to know the difference. Well, they > don't. What matters is that the data file that the "implementers" will > use has these 3 emoji tag sequences in it, so that is quite likely > what everybody will see added to their phones. The "users" will just > see 3 more flags. So, no provision for a UI like the one I'm building, to let users select a region or subdivision and generate the corresponding sequence? Mmh. Well, anyway. > And if they want a flag of California (or whatever), then they need to > badger the platform vendors, who will then come back to the Emoji SC, > saying, "Help! We need to add a flag of California, or people won't > buy our phones!" The way nobody will buy their phones unless they support all 5 skin tones for all 3 flavors of "vampire" or "elf" or "fairy" or "person in lotus position"? Those are also generative mechanisms, but not limited to just a couple of combinations deemed worthy. If flags have to be added one by one, a lot of them (including the really useful ones, like California and Bavaria) will probably never happen. -- Doug Ewell | Thornton, CO, US | ewellic.org From frederic.grosshans at gmail.com Mon Mar 27 16:46:34 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Mon, 27 Mar 2017 23:46:34 +0200 Subject: Encoding of old compatibility characters In-Reply-To: References: Message-ID: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> An example of a legacy character successfully encoded recently is ? U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2. It came from the Soviet standard GOST 10859-64 and the German standard ALCOR. And was proposed by Leo Broukhis in this proposal http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a discussion on this mailing list here http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where Ken Whistler was already sceptical about the usefulness of this encoding. Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit : > I?ve recently developed an interest in old legacy text encodings and > noticed that there are various characters in several sets that don?t > have a Unicode equivalent. I had already started research into these > encodings to eventually prepare a proposal until I realised I should > probably ask on the mailing list first whether it is likely the UTC > will be interested in those characters before I waste my time on a > project that won?t achieve anything in the end. > > The character sets in question are ATASCII, PETSCII, the ZX80 set, the > Atari ST set, and the TI calculator sets. So far I?ve only analyzed > the ZX80 set in great detail, revealing 32 characters not in the UCS. > Most characters are pseudo-graphics, simple pictographs or inverted > variants of other characters. > > Now, one of Unicode?s declared goals is to enable round-trip > compatibility with legacy encodings. We?ve accumulated a lot of weird > stuff over the years in the pursuit of this goal. So it would be > natural to assume that the unencoded characters from the mentioned > sets would also be eligible for inclusion in the UCS. On the other > hand, those encodings are for the most part older than Unicode and so > far there seems to have been little interest in them from the UTC or > WG2, or any of their contributors. Something tells me that if these > character sets were important enough to consider for inclusion, they > would have been encoded a long time ago along with all the other stuff > in Block Elements, Box Drawings, Miscellaneous Symbols etc. > > Obviously the character sets in question don?t receive much use > nowadays (and some weren?t even that relevant in their time, either), > which leads to me wonder whether further putting work into this > proposal would be worth it. From doug at ewellic.org Mon Mar 27 16:50:54 2017 From: doug at ewellic.org (Doug Ewell) Date: Mon, 27 Mar 2017 14:50:54 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com> Philippe Verdy wrote: > We still lack an encoding standard for vexillologists. And for now > only "Flags of the World" proposes some encoding (not based strictly > and only on ISO3166). I think that the UTC should try contacting > authors of Flags of the World and seek for advice there: we are > speaking here about regional flags (we can exclude some graphical > variants such as civil vs. navy flags vs honorific flags) As Philippe knows, because he and I had this discussion in 2012 and again in 2013: - I have already contacted FOTW. - They have no such encoding, except 3166-1 for countries and the 2-by-3 information code, and they have never proposed one. - I think such a standard would be a great idea, but - I don't think this is any of UTC's business and I'll bet they agree. -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Mon Mar 27 16:53:13 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 27 Mar 2017 23:53:13 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170327223225.73cb528e@JRWUBU2> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <20170327223225.73cb528e@JRWUBU2> Message-ID: And the new region of Normandie still has no formal code, but it reuses a flag that was used by one of the two former regions. Technically I don't see that as a problem except that people may want to display that flag using the code for the former region and semantically this is different (and also different from the former Duchy before it was partly annexed by France and left the Channel Islands to the new English Crown in the Middle Age. If we are concerned only by encoding modern entities, anyway if these sequences are encoded, there will be nobody to restrict their reuse for past entities (jsut kike Unicode cannot rule against the use of a capital Greek Alpha replacing a Capital Latin A, or the fancy use of Latin for "ASCII art", as Unicode does not encode orthographies or languages). Once a sequence is registered, even if it is intended to represent a modern entity, anyone will be using them as they want. This gives also a hint about why encoding stability will be important. But as we know, the regional or national entities are changing their flags and sometimes reusing former flags from other entities. Sooner or later, there will be confusion. I would suggest that if renderers have the capability of rendering colorful flags and provide an UI, at least they should be also rendering some hints, notably the underlying code or a name if available, using for example mousehover events to explain these flags and their intended usage: if a former flag is reused by another entity, that new entity should have its own encoding and the former flags should not be affected (its displayed hint should still indicate a reference to their former meaning). 2017-03-27 23:32 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Mon, 27 Mar 2017 13:34:09 -0700 > Ken Whistler wrote: > > > And if a flag of > > California (or Pomerania or ...) then gets added to the list of emoji > > tag sequences in a future version of the data, there is a good chance > > that the "users" will then see the difference, because that flag will > > appear on their phones eventually. > > Indeed, why isn't the flag of Texas there already so as to terminate > the abuse of . Technically, at least, it has the > justification of being a formerly independent country, though I don't > know that they have any national teams. > > Is anyone working on the issue of flags for the whole of Ireland? > Different sports have their own 'national' flags. > > Pomerania will be a bit tricky, as it isn't any recent administrative > division. > > Richard. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedberg at apple.com Mon Mar 27 16:54:04 2017 From: pedberg at apple.com (Peter Edberg) Date: Mon, 27 Mar 2017 14:54:04 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <8DB7A85C-3892-4208-A609-86709564D4D8@mac.com> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <8DB7A85C-3892-4208-A609-86709564D4D8@mac.com> Message-ID: <5925AB8F-03F4-4FC1-9571-827CB5555AF2@apple.com> (this time from the correct account) Philippe and others, http://www.unicode.org/reports/tr51/tr51-11.html#valid-emoji-tag-sequences refers to CLDR data for the list of valid subregion sequences, see http://unicode.org/reports/tr35/index.html#Validity CLDR data will maintain stable sequences in the event that ISO 3166-2 data changes. - Peter E > On Mar 27, 2017, at 1:39 PM, Philippe Verdy > wrote: > > Note also that ISO3166-2 is far from being stable, and this could contradict Unicode encoding stability: it would then be required to ensure this stability by only allowing sequences that are effectively registered in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt (independantly of the registration ins ISO3166-2), and nothing is said if ever ISO3166-2 obsoletes some codes and then some years later decide to reassign these codes to new entities: it should not be possible to do the same thing in Emoji sequences, and specific assignments will need to be made in the Unicode database. > > Note also that most rencetly created administrative divisions do not really adopt any flag, but if flags are used they may be reusing flags from older historic entities... or they could adopt only a logo (with legal protection, not really suitable from encoding in the UCS as it won't be possible to define any "representative glyph" without asking for permission to the relevant authorities for displaying some design, possibly simplified) > > We still lack an encoding standard for vexillologists. And for now only "Flags of the World" proposes some encoding (not based strictly and only on ISO3166). I think that the UTC should try contacting authors of Flags of the World and seek for advice there: we are speaking here about regional flags (we can exclude some graphical variants such as civil vs. navy flags vs honorific flags) > > > 2017-03-27 22:30 GMT+02:00 Philippe Verdy >: > > > 2017-03-27 21:17 GMT+02:00 Doug Ewell >: > announcements at Unicode dot org wrote: > > > ? and new regional flags for England, Scotland, and Wales. > > It's not clear from this text, nor from the table in Section C.1.1 of > the draft, what the status is of flag emoji tag sequences other than the > three above. > > Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the codes used do not specify clearly about which region code standard they are refering to. We just see that it's an ISO3166-1 country/territory code followed directly (without separator) by sequences of letter/digits, all of them converted to RIS and surrounded by a the same initial emeoji code and the DEL from RIS. > > The problem is how to choose the codes for the letter/digits in the second part, if they ever come from ISO3166-2 after dropping the hypen separator (this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB ) or somewhere else. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 16:54:53 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 27 Mar 2017 23:54:53 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com> References: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com> Message-ID: So it's up to the UTC to create this encoding: this new relase is a start for a new vexillology registry (within encoded sequences) which creates a new standard for them. 2017-03-27 23:50 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > > We still lack an encoding standard for vexillologists. And for now > > only "Flags of the World" proposes some encoding (not based strictly > > and only on ISO3166). I think that the UTC should try contacting > > authors of Flags of the World and seek for advice there: we are > > speaking here about regional flags (we can exclude some graphical > > variants such as civil vs. navy flags vs honorific flags) > > As Philippe knows, because he and I had this discussion in 2012 and > again in 2013: > > - I have already contacted FOTW. > - They have no such encoding, except 3166-1 for countries and the 2-by-3 > information code, and they have never proposed one. > - I think such a standard would be a great idea, but > - I don't think this is any of UTC's business and I'll bet they agree. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Mar 27 17:05:28 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 28 Mar 2017 00:05:28 +0200 Subject: Encoding of old compatibility characters In-Reply-To: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> Message-ID: <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com> Another example, about to be encoded, it the GOUP MARK, used on old IBM computers (proposal: ML threads: http://www.unicode.org/mail-arch/unicode-ml/y2015-m01/0040.html , and http://unicode.org/mail-arch/unicode-ml/y2007-m05/0367.html ) Le 27/03/2017 ? 23:46, Fr?d?ric Grosshans a ?crit : > An example of a legacy character successfully encoded recently is ? > U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2. > It came from the Soviet standard GOST 10859-64 and the German standard > ALCOR. And was proposed by Leo Broukhis in this proposal > http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a > discussion on this mailing list here > http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where > Ken Whistler was already sceptical about the usefulness of this encoding. > > > Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit : >> I?ve recently developed an interest in old legacy text encodings and >> noticed that there are various characters in several sets that don?t >> have a Unicode equivalent. I had already started research into these >> encodings to eventually prepare a proposal until I realised I should >> probably ask on the mailing list first whether it is likely the UTC >> will be interested in those characters before I waste my time on a >> project that won?t achieve anything in the end. >> >> The character sets in question are ATASCII, PETSCII, the ZX80 set, >> the Atari ST set, and the TI calculator sets. So far I?ve only >> analyzed the ZX80 set in great detail, revealing 32 characters not in >> the UCS. Most characters are pseudo-graphics, simple pictographs or >> inverted variants of other characters. >> >> Now, one of Unicode?s declared goals is to enable round-trip >> compatibility with legacy encodings. We?ve accumulated a lot of weird >> stuff over the years in the pursuit of this goal. So it would be >> natural to assume that the unencoded characters from the mentioned >> sets would also be eligible for inclusion in the UCS. On the other >> hand, those encodings are for the most part older than Unicode and so >> far there seems to have been little interest in them from the UTC or >> WG2, or any of their contributors. Something tells me that if these >> character sets were important enough to consider for inclusion, they >> would have been encoded a long time ago along with all the other >> stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc. >> >> Obviously the character sets in question don?t receive much use >> nowadays (and some weren?t even that relevant in their time, either), >> which leads to me wonder whether further putting work into this >> proposal would be worth it. > > From doug at ewellic.org Mon Mar 27 17:08:27 2017 From: doug at ewellic.org (Doug Ewell) Date: Mon, 27 Mar 2017 15:08:27 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170327150827.665a7a7059d7ee80bb4d670165c8327d.2fd8210960.wbe@email03.godaddy.com> Ken Whistler wrote: > By the way, if anybody is looking, Pomerania is there: "plpm" among > the 4925 other valid unicode_subdivision_id values. So: > > Flag of Pomerania = 1F3F4 E0070 E006C E0070 E006D E007F > > But alas, that is not a *valid* emoji tag sequence (yet), so no soup > for you! This is a major letdown, after almost two years following the progress of flag tag sequences, to find that the arguments that "these three flags are special because they appear in international sports" have won the day and the others are demoted to "non-standard." That was never implied in any of the published UTC documents before. I've collected well over 800 subdivision flags, and I'm sure there are hundreds more, each with its own proud constituency. Vendors don't want to bother adding a glyph for Saskatchewan or Neuqu?n or Yamagata? They don't have to; they never had to. But now they're essentially being told not to. This was the only aspect of emoji I had the slightest interest in. Boo. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Mon Mar 27 17:10:17 2017 From: doug at ewellic.org (Doug Ewell) Date: Mon, 27 Mar 2017 15:10:17 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170327151017.665a7a7059d7ee80bb4d670165c8327d.fb76d7ed31.wbe@email03.godaddy.com> Philippe Verdy wrote: > So it's up to the UTC to create this encoding: this new relase is a > start for a new vexillology registry (within encoded sequences) which > creates a new standard for them. Fine. If you think you can persuade UTC that this is within their scope, go ahead. Let us know how that works out. -- Doug Ewell | Thornton, CO, US | ewellic.org From jr at qsm.co.il Mon Mar 27 17:43:17 2017 From: jr at qsm.co.il (Jonathan Rosenne) Date: Mon, 27 Mar 2017 22:43:17 +0000 Subject: Encoding of old compatibility characters In-Reply-To: <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com> Message-ID: GROUP MARK Best Regards, Jonathan Rosenne -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Fr?d?ric Grosshans Sent: Tuesday, March 28, 2017 1:05 AM To: unicode Subject: Re: Encoding of old compatibility characters Another example, about to be encoded, it the GOUP MARK, used on old IBM computers (proposal: ML threads: http://www.unicode.org/mail-arch/unicode-ml/y2015-m01/0040.html , and http://unicode.org/mail-arch/unicode-ml/y2007-m05/0367.html ) Le 27/03/2017 ? 23:46, Fr?d?ric Grosshans a ?crit : > An example of a legacy character successfully encoded recently is ? > U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2. > It came from the Soviet standard GOST 10859-64 and the German standard > ALCOR. And was proposed by Leo Broukhis in this proposal > http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a > discussion on this mailing list here > http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where > Ken Whistler was already sceptical about the usefulness of this encoding. > > > Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit : >> I?ve recently developed an interest in old legacy text encodings and >> noticed that there are various characters in several sets that don?t >> have a Unicode equivalent. I had already started research into these >> encodings to eventually prepare a proposal until I realised I should >> probably ask on the mailing list first whether it is likely the UTC >> will be interested in those characters before I waste my time on a >> project that won?t achieve anything in the end. >> >> The character sets in question are ATASCII, PETSCII, the ZX80 set, >> the Atari ST set, and the TI calculator sets. So far I?ve only >> analyzed the ZX80 set in great detail, revealing 32 characters not in >> the UCS. Most characters are pseudo-graphics, simple pictographs or >> inverted variants of other characters. >> >> Now, one of Unicode?s declared goals is to enable round-trip >> compatibility with legacy encodings. We?ve accumulated a lot of weird >> stuff over the years in the pursuit of this goal. So it would be >> natural to assume that the unencoded characters from the mentioned >> sets would also be eligible for inclusion in the UCS. On the other >> hand, those encodings are for the most part older than Unicode and so >> far there seems to have been little interest in them from the UTC or >> WG2, or any of their contributors. Something tells me that if these >> character sets were important enough to consider for inclusion, they >> would have been encoded a long time ago along with all the other >> stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc. >> >> Obviously the character sets in question don?t receive much use >> nowadays (and some weren?t even that relevant in their time, either), >> which leads to me wonder whether further putting work into this >> proposal would be worth it. > > From markus.icu at gmail.com Mon Mar 27 18:33:43 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 27 Mar 2017 16:33:43 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: On Mon, Mar 27, 2017 at 1:34 PM, Ken Whistler wrote: > Anybody could *attempt* to convey a flag of Pomerania (a rather handsome > black gryphon on a yellow background, btw) with an emoji tag sequence right > now, I suppose. I suppose not. Since it's bound to ISO 3166 subdivision codes (possibly with CLDR additions), it would have to be "demv" for https://en.wikipedia.org/wiki/Mecklenburg-Vorpommern or codes for adjacent regions in Poland. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Mar 27 18:35:18 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 27 Mar 2017 16:35:18 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy wrote: > Note also that ISO3166-2 is far from being stable, and this could > contradict Unicode encoding stability: it would then be required to ensure > this stability by only allowing sequences that are effectively registered > in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt > (independantly of the registration ins ISO3166-2), and nothing is said if > ever ISO3166-2 obsoletes some codes and then some years later decide to > reassign these codes to new entities: it should not be possible to do the > same thing in Emoji sequences, and specific assignments will need to be > made in the Unicode database. > The emoji sequences are stable. Please read http://www.unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences and follow the links to the CLDR spec and data. Let SD be the result of mapping each character in the tag_spec to a character in [0-9a-z] by subtracting 0xE0000. 1. SD must then be a specification as per [CLDR ] of either a Unicode subdivision_id ( data ) or a 3-digit unicode_region_subtag ( data ), and 2. SD must have CLDR idStatus equal to "regular" or "deprecated". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 18:58:24 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 01:58:24 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: This only describes the sequences encoded with 2 characters, not the newer longer sequences for flags of subnational regions. the unicode_region_subtag data does not contain anything about the flags for the first 3 regions in GB. 2017-03-28 1:35 GMT+02:00 Markus Scherer : > On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy > wrote: > >> Note also that ISO3166-2 is far from being stable, and this could >> contradict Unicode encoding stability: it would then be required to ensure >> this stability by only allowing sequences that are effectively registered >> in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt >> (independantly of the registration ins ISO3166-2), and nothing is said if >> ever ISO3166-2 obsoletes some codes and then some years later decide to >> reassign these codes to new entities: it should not be possible to do the >> same thing in Emoji sequences, and specific assignments will need to be >> made in the Unicode database. >> > > The emoji sequences are stable. Please read http://www.unicode.org/ > reports/tr51/proposed.html#valid-emoji-tag-sequences and follow the links > to the CLDR spec and data. > > Let SD be the result of mapping each character in the tag_spec to a > character in [0-9a-z] by subtracting 0xE0000. > > > 1. SD must then be a specification as per [CLDR > ] of either > a Unicode subdivision_id > > (data > ) > or a 3-digit unicode_region_subtag > ( > data > ), > and > 2. SD must have CLDR idStatus equal to "regular" or "deprecated". > > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Mon Mar 27 19:04:06 2017 From: prosfilaes at gmail.com (David Starner) Date: Tue, 28 Mar 2017 00:04:06 +0000 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> Message-ID: On Mon, Mar 27, 2017 at 1:34 AM Martin J. D?rst wrote: > The qualification 'minor' is less important for an alphabet. In general, > the more established and well-known an alphabet is, the wider the > variations of glyph shapes that may be tolerated. > My problem with that is that a new script is likely to have wider variation in properties. It invites people to tinker, with the possibility that any new changes have a chance to become popular. And variants that show up in Latin script, like http://www.gutenberg.org/files/20130/20130-h/20130-h.htm , don't tend to get encoded unless they have serious support. When the discussion of the Hopi-English dictionary comes up, I'm reminded that the Siouian alphabet for Latin, https://commons.wikimedia.org/wiki/File:BAE-Siouan_Alphabet.png , was rejected for encoding, at least on this list, because it was only used in one set of publications that were distributed to every major library in the US, unlike the Hopi dictionary that was stuck in an archive somewhere. -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Mon Mar 27 19:06:40 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 27 Mar 2017 17:06:40 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy wrote: > This only describes the sequences encoded with 2 characters, not the newer > longer sequences for flags of subnational regions. the > unicode_region_subtag data does not contain anything about the flags for > the first 3 regions in GB. > Please read again what I quoted, and do follow the links. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 19:06:36 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 02:06:36 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: Also these yellow statements from the initial proposal are contradicting what is now published in TR51: "UN" and "EU" are accepted even if they are "macroregions", not satisfying the quoted condition 2 in the proposed update. 2017-03-28 1:58 GMT+02:00 Philippe Verdy : > This only describes the sequences encoded with 2 characters, not the newer > longer sequences for flags of subnational regions. the > unicode_region_subtag data does not contain anything about the flags for > the first 3 regions in GB. > > 2017-03-28 1:35 GMT+02:00 Markus Scherer : > >> On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy >> wrote: >> >>> Note also that ISO3166-2 is far from being stable, and this could >>> contradict Unicode encoding stability: it would then be required to ensure >>> this stability by only allowing sequences that are effectively registered >>> in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt >>> (independantly of the registration ins ISO3166-2), and nothing is said if >>> ever ISO3166-2 obsoletes some codes and then some years later decide to >>> reassign these codes to new entities: it should not be possible to do the >>> same thing in Emoji sequences, and specific assignments will need to be >>> made in the Unicode database. >>> >> >> The emoji sequences are stable. Please read >> http://www.unicode.org/reports/tr51/proposed.html#valid- >> emoji-tag-sequences and follow the links to the CLDR spec and data. >> >> Let SD be the result of mapping each character in the tag_spec to a >> character in [0-9a-z] by subtracting 0xE0000. >> >> >> 1. SD must then be a specification as per [CLDR >> ] of >> either a Unicode subdivision_id >> >> (data >> ) >> or a 3-digit unicode_region_subtag >> >> (data >> ), >> and >> 2. SD must have CLDR idStatus equal to "regular" or "deprecated". >> >> >> markus >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 19:09:26 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 02:09:26 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: I followed the links. Check your links, you are referencing the proposal, and this contradicts the published version 4.0 of TR51. Where is stability ? 2017-03-28 2:06 GMT+02:00 Markus Scherer : > On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy > wrote: > >> This only describes the sequences encoded with 2 characters, not the >> newer longer sequences for flags of subnational regions. the >> unicode_region_subtag data does not contain anything about the flags for >> the first 3 regions in GB. >> > > Please read again what I quoted, and do follow the links. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Mon Mar 27 19:19:06 2017 From: everson at evertype.com (Michael Everson) Date: Tue, 28 Mar 2017 01:19:06 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> Message-ID: I?ll look into whatever you?re on about the other ?minor? script, but with regard to what you?ve said below, I?m fairly sure I encoded the missing characters there. I believe it was A7AE and A7B0, capital letters turned K and T used in that orthography. There is a problem with turned P and p in that orthography, though, but no one has ever chosen to look at that. But apart from dealing with the turned p, I do not believe it?s correct to say that that alphabet was ?rejected?. Oh, there is a problem with the turned cedilla above; that seems to be missing too. > On 28 Mar 2017, at 01:04, David Starner wrote: > > When the discussion of the Hopi-English dictionary comes up, I'm reminded that the Siouian alphabet for Latin, https://commons.wikimedia.org/wiki/File:BAE-Siouan_Alphabet.png , was rejected for encoding, at least on this list, because it was only used in one set of publications that were distributed to every major library in the US, unlike the Hopi dictionary that was stuck in an archive somewhere. From mark at kli.org Mon Mar 27 19:22:04 2017 From: mark at kli.org (Mark E. Shoulson) Date: Mon, 27 Mar 2017 20:22:04 -0400 Subject: Encoding of old compatibility characters In-Reply-To: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> Message-ID: <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> On 03/27/2017 05:46 PM, Fr?d?ric Grosshans wrote: > An example of a legacy character successfully encoded recently is ? > U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2. > It came from the Soviet standard GOST 10859-64 and the German standard > ALCOR. And was proposed by Leo Broukhis in this proposal > http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a > discussion on this mailing list here > http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where > Ken Whistler was already sceptical about the usefulness of this encoding. Aw, but ? is awesome! It's much cooler-looking and more visually understandable than "e" for exponent notation. In some code I've been playing around with I support it as a valid alternative to "e". ~mark From markus.icu at gmail.com Mon Mar 27 19:28:10 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 27 Mar 2017 17:28:10 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy wrote: > I followed the links. Check your links, you are referencing the proposal, > and this contradicts the published version 4.0 of TR51. Where is stability ? > Of course I am pointing to the proposal. The version of TR 51 under review adds a mechanism that didn't exist before. It's an addition, not a contradiction. Once it's there it will be stable. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Mar 27 20:38:44 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 03:38:44 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: I try to summarize the situation for France, There are some missing codes France m?tropolitaine (deprecated: [fx]): D?partements m?tropolitains: [fr01~19 fr2a~b fr21~68 fr70-95] (unchanged) [fr6d] Rh?ne (d?partement) (missing, included in [fr69]?) Statuts particuliers: [fr69] Rh?ne (circonscription d?partementale) [fr6m] M?tropole de Lyon (missing, included in [fr69]?) R?gions m?tropolitaines: [frara] Auvergne-Rh?ne-Alpes (new) - Auvergne (former) (deprecated: [frc]) - Rh?ne-Alpes (former) (deprecated: [frv]) [frbfc] Bourgogne-Franche-Comt? (new) - Bourgogne (former) (deprecated: [frd]) - Franche-Comt? (former) (deprecated: [fri]) [frbre] Bretagne (unchanged) (deprecated: [fre]) [frcor] Corse (collectivit? territoriale de) (deprecated: [frh]) [frcvl] Centre-Val de Loire (deprecated: [frf]) [frges] Grand-Est (new) - Alsace (former) (deprecated: [fra]) - Champagne-Ardenne (former) (deprecated: [frg]) - Franche-Comt? (former) (deprecated: [frm]) [frhdf] Hauts-de-France (new) - Nord-Pas-de-Calais (former) (deprecated: [fro]) - Picardie (former) (deprecated: [frs]) [fridf] ?le-de-France (deprecated: [frj]) [frnaq] Nouvelle-Aquitaine (new) - Aquitaine (former) (deprecated: [frb]) - Limousin (former) (deprecated: [frl) - Poitou-Charentes (former) (deprecated: [frt]) [frnor] Normandie (new) - Basse-Normandie (former) (deprecated: [frp]) - Haute-Normandie (former) (deprecated: [frq]) [frocc] Occitanie (new) - Languedoc-Roussillon (former) (deprecated: [frk]) - Midi-Pyr?n?es (former) (deprecated: [frn]) [frpac] Provence-Alpes-Cote d'Azur (deprecated: [fru]) [frpdl] Pays de la Loire (deprecated: [frr]) D?partements/r?gions d'outre-mer (DOM/ROM): [gp] Guadeloupe (d?partement) (deprecated: [frgp]) [frgua] Guadeloupe (r?gion) [mq] Martinique (d?partement) (deprecated: [frmq]) [frmar] Martinique (ancienne r?gion) (missing?) [gf] Guyane (d?partement) (deprecated: [frgf]) [frguy] Guyane (ancienne r?gion) (missing?) [yt] Mayotte (d?partement) (deprecated: [fryt]) [frmay] Mayotte (ancienne collectivit?) [re] La R?union (d?partement) (deprecated: [frre]) [frlre] La R?union (r?gion) Autres outre-mers: Collectivit?s d'outre-mer (COM): [bl] Saint-Barth?lemy (deprecated: [frbl]) [mf] Saint-Martin (partie fran?aise) (deprecated: [frmf]) [pf] Polyn?sie fran?aise (deprecated: [frpf]) [pm] Saint-Pierre-et-Miquelon (deprecated: [frpm]) [tf] Terres australes et antarctiques fran?aises (deprecated: [frtf]) [wf] Wallis-et-Futuna (deprecated: [frwf]) Statuts particuliers: [nc] Nouvelle-Cal?donie (deprecated: [frnc]) [cp] Clipperton (deprecated: [frcp]) 2017-03-28 2:28 GMT+02:00 Markus Scherer : > On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy > wrote: > >> I followed the links. Check your links, you are referencing the proposal, >> and this contradicts the published version 4.0 of TR51. Where is stability ? >> > > Of course I am pointing to the proposal. The version of TR 51 under review > adds a mechanism that didn't exist before. It's an addition, not a > contradiction. Once it's there it will be stable. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfujiwar at redhat.com Tue Mar 28 00:46:59 2017 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Tue, 28 Mar 2017 14:46:59 +0900 Subject: different version of common/annotations/ja.xml In-Reply-To: References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> Message-ID: <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com> It would be combinations of Hiragana, Katakana, Kanji. On 03/28/17 02:25, Koji Ishii-san wrote: > I think he meant Kanji/Han ideographic by "committed string". > > 2017-03-27 19:04 GMT+09:00 Takao Fujiwara >: > > On 03/27/17 18:48, Mark Davis ??-san wrote: > > By "committed strings", you mean the hiragana phonetic reading? > > > Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion. > After users select one of the converted strings, the converted strings are committed on the text. > I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method. > > Fujiwara > > > Mark > ////// > > On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara >> wrote: > > Hi, > > Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation? > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml > > > > > That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml. > E.g. > ? | ?? | ???? > > instead of > > ??? | ???? | ???? > > I think the committed version is useful without input method and it follows other languages. > > Thanks, > Fujiwara > > > > From mark at macchiato.com Tue Mar 28 00:57:52 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 07:57:52 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: To add to what Ken and Markus said: like many other identifiers, there are a number of different categories. 1. *Ill-formed: *"$1" 2. *Well-formed, but not valid: *"usx". Is *syntactic* according to http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence, but is not *valid* according to http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences . 3. *Valid, but not recommended: "usca". *Corresponds to the valid Unicode subdivision code for California according to http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. 4. *Recommended:* "gbsct". Corresponds to the valid Unicode subdivision code for Scotland, and *is* listed in http://unicode.org/Public/emoji/5.0/. As Ken says, the terminology is a little bit in flux for term 'recommended'. TR51 is still open for comment, although we won't make any changes that would invalidate http://unicode.org/Public/emoji/5.0/. ==== I would also encourage people to look at the slides on http://unicode.org/emoji/, together with the speaker notes, since some of those slides present this very issue. I'm sure the people on this list will have some useful comments for improvements. Another item: with Tayfun's help, we updated http://unicode.org/press/emoji.html. If people have any feedback on other articles that should be on that list, please let us know... Mark Mark On Tue, Mar 28, 2017 at 2:28 AM, Markus Scherer wrote: > On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy > wrote: > >> I followed the links. Check your links, you are referencing the proposal, >> and this contradicts the published version 4.0 of TR51. Where is stability ? >> > > Of course I am pointing to the proposal. The version of TR 51 under review > adds a mechanism that didn't exist before. It's an addition, not a > contradiction. Once it's there it will be stable. > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Mar 28 01:12:17 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 08:12:17 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: (I'm sure you know this, Philippe, but a reminder for others: as far as the Unicode projects go, discussions on this list have no effect unless they are turned into a submission (UTC or Emoji proposal, CLDR or ICU ticket).) If you see any problems in the CLDR data, please file a ticket at http://unicode.org/cldr/trac/newticket. Please only include the problem cases. (Note that it is *not* a goal for CLDR to include all ISO subdivisions going back in time; just back to 2015-09. And even there, if an ISO subdivision is introduced after the start of a CLDR version, but retracted before that version releases, it won't be included. If retracted in a later version, it is moved to the deprecated set.) Mark 2017-03-28 3:38 GMT+02:00 Philippe Verdy : > I try to summarize the situation for France, There are some missing codes > > France m?tropolitaine (deprecated: [fx]): > D?partements m?tropolitains: > [fr01~19 fr2a~b fr21~68 fr70-95] (unchanged) > [fr6d] Rh?ne (d?partement) (missing, included > in [fr69]?) > Statuts particuliers: > [fr69] Rh?ne (circonscription d?partementale) > [fr6m] M?tropole de Lyon (missing, included > in [fr69]?) > R?gions m?tropolitaines: > [frara] Auvergne-Rh?ne-Alpes (new) > - Auvergne (former) (deprecated: [frc]) > - Rh?ne-Alpes (former) (deprecated: [frv]) > [frbfc] Bourgogne-Franche-Comt? (new) > - Bourgogne (former) (deprecated: [frd]) > - Franche-Comt? (former) (deprecated: [fri]) > [frbre] Bretagne (unchanged) (deprecated: [fre]) > [frcor] Corse (collectivit? territoriale de) (deprecated: [frh]) > [frcvl] Centre-Val de Loire (deprecated: [frf]) > [frges] Grand-Est (new) > - Alsace (former) (deprecated: [fra]) > - Champagne-Ardenne (former) (deprecated: [frg]) > - Franche-Comt? (former) (deprecated: [frm]) > [frhdf] Hauts-de-France (new) > - Nord-Pas-de-Calais (former) (deprecated: [fro]) > - Picardie (former) (deprecated: [frs]) > [fridf] ?le-de-France (deprecated: [frj]) > [frnaq] Nouvelle-Aquitaine (new) > - Aquitaine (former) (deprecated: [frb]) > - Limousin (former) (deprecated: [frl) > - Poitou-Charentes (former) (deprecated: [frt]) > [frnor] Normandie (new) > - Basse-Normandie (former) (deprecated: [frp]) > - Haute-Normandie (former) (deprecated: [frq]) > [frocc] Occitanie (new) > - Languedoc-Roussillon (former) (deprecated: [frk]) > - Midi-Pyr?n?es (former) (deprecated: [frn]) > [frpac] Provence-Alpes-Cote d'Azur (deprecated: [fru]) > [frpdl] Pays de la Loire (deprecated: [frr]) > D?partements/r?gions d'outre-mer (DOM/ROM): > [gp] Guadeloupe (d?partement) (deprecated: [frgp]) > [frgua] Guadeloupe (r?gion) > [mq] Martinique (d?partement) (deprecated: [frmq]) > [frmar] Martinique (ancienne r?gion) (missing?) > [gf] Guyane (d?partement) (deprecated: [frgf]) > [frguy] Guyane (ancienne r?gion) (missing?) > [yt] Mayotte (d?partement) (deprecated: [fryt]) > [frmay] Mayotte (ancienne collectivit?) > [re] La R?union (d?partement) (deprecated: [frre]) > [frlre] La R?union (r?gion) > Autres outre-mers: > Collectivit?s d'outre-mer (COM): > [bl] Saint-Barth?lemy (deprecated: [frbl]) > [mf] Saint-Martin (partie fran?aise) (deprecated: [frmf]) > [pf] Polyn?sie fran?aise (deprecated: [frpf]) > [pm] Saint-Pierre-et-Miquelon (deprecated: [frpm]) > [tf] Terres australes et antarctiques fran?aises (deprecated: [frtf]) > [wf] Wallis-et-Futuna (deprecated: [frwf]) > Statuts particuliers: > [nc] Nouvelle-Cal?donie (deprecated: [frnc]) > [cp] Clipperton (deprecated: [frcp]) > > > 2017-03-28 2:28 GMT+02:00 Markus Scherer : > >> On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy >> wrote: >> >>> I followed the links. Check your links, you are referencing the >>> proposal, and this contradicts the published version 4.0 of TR51. Where is >>> stability ? >>> >> >> Of course I am pointing to the proposal. The version of TR 51 under >> review adds a mechanism that didn't exist before. It's an addition, not a >> contradiction. Once it's there it will be stable. >> markus >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Mar 28 01:20:04 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 08:20:04 +0200 Subject: different version of common/annotations/ja.xml In-Reply-To: <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com> References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com> Message-ID: Ah, yes. Sorry for my confusion. One main purpose for the short names is for TTS, and for that I think people felt that the reading was more useful. However, it would probably be better for the keywords to have the normal spelling. You might consider filing a ticket at http://unicode.org/cldr/trac/newticket with a proposal for change. Mark On Tue, Mar 28, 2017 at 7:46 AM, Takao Fujiwara wrote: > It would be combinations of Hiragana, Katakana, Kanji. > > On 03/28/17 02:25, Koji Ishii-san wrote: > >> I think he meant Kanji/Han ideographic by "committed string". >> >> 2017-03-27 19:04 GMT+09:00 Takao Fujiwara > tfujiwar at redhat.com>>: >> >> On 03/27/17 18:48, Mark Davis ??-san wrote: >> >> By "committed strings", you mean the hiragana phonetic reading? >> >> >> Hiragana is used to the raw text of the phonetic reading by the >> Japanese input method before the conversion. >> After users select one of the converted strings, the converted >> strings are committed on the text. >> I mean the major conversion of ja.xml is useful instead of >> remembering the raw text as the converted result in the input method. >> >> Fujiwara >> >> >> Mark >> ////// >> >> On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara < >> tfujiwar at redhat.com > tfujiwar at redhat.com >> >> wrote: >> >> Hi, >> >> Do you have any chances to create a different version of >> ja.xml of the Japanese emoji annotation? >> http://unicode.org/cldr/trac/browser/tags/latest/common/anno >> tations/ja.xml >> > otations/ja.xml> >> > otations/ja.xml >> > otations/ja.xml>> >> >> That file includes Hiragana only but I'd need another file >> which has the committed strings, likes ja_convert.xml. >> E.g. >> ? | ?? | ???? >> >> instead of >> >> ??? | ???? | >> ???? >> >> I think the committed version is useful without input method >> and it follows other languages. >> >> Thanks, >> Fujiwara >> >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Mar 28 01:32:03 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 15:32:03 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com> Message-ID: On 2017/03/28 01:03, Michael Everson wrote: > On 27 Mar 2017, at 16:56, John H. Jenkins wrote: > The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI . Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont. Good to have some actual examples. However, the example at hand does, as far as I understand it, not necessarily support separate encoding. While it mixes 1855 and 1859, it contains only one of the ligature variants each. Indeed, it could be taken as support for the theory that the top and bottom row ligatures in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg were used interchangeably, and that the 1857 St Louis punches just made one particular choice of glyph selection. What would give a strong argument would be the *concurrent* existence of *corresponding* ligatures in the same font, or the concurrent (even better, contrasting) use of corresponding ligatures in the same text. Regards, Martin. What's interesting (weird?) is that the "1859" OI appears in 1857 punches. Time travel? Or is the label "1859" a misnomer or just a convention? From tfujiwar at redhat.com Tue Mar 28 02:49:52 2017 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Tue, 28 Mar 2017 16:49:52 +0900 Subject: different version of common/annotations/ja.xml In-Reply-To: References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com> <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com> <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com> Message-ID: <85acb454-de31-51ce-369c-44c900c9a7c6@redhat.com> Thanks, I will file that ticket. I'd like to have another version of ja.xml for both TTS and non-TTS. Fujiwara On 03/28/17 15:20, Mark Davis ??-san wrote: > Ah, yes. Sorry for my confusion. > > One main purpose for the short names is for TTS, and for that I think people felt that the reading was more useful. However, it would probably be > better for the keywords to have the normal spelling. You might consider filing a ticket at http://unicode.org/cldr/trac/newticket with a proposal for > change. > > Mark > ////// > > On Tue, Mar 28, 2017 at 7:46 AM, Takao Fujiwara > wrote: > > It would be combinations of Hiragana, Katakana, Kanji. > > On 03/28/17 02:25, Koji Ishii-san wrote: > > I think he meant Kanji/Han ideographic by "committed string". > > 2017-03-27 19:04 GMT+09:00 Takao Fujiwara >>: > > On 03/27/17 18:48, Mark Davis ??-san wrote: > > By "committed strings", you mean the hiragana phonetic reading? > > > Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion. > After users select one of the converted strings, the converted strings are committed on the text. > I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method. > > Fujiwara > > > Mark > ////// > > On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara > > >>> wrote: > > Hi, > > Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation? > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml > > > > > >> > > That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml. > E.g. > ? | ?? | ???? > > instead of > > ??? | ???? | ???? > > I think the committed version is useful without input method and it follows other languages. > > Thanks, > Fujiwara > > > > > > From frederic.grosshans at gmail.com Tue Mar 28 04:18:24 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 28 Mar 2017 11:18:24 +0200 Subject: Encoding of old compatibility characters In-Reply-To: <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> Message-ID: <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit : > Aw, but ? is awesome! It's much cooler-looking and more visually > understandable than "e" for exponent notation. In some code I've been > playing around with I support it as a valid alternative to "e". I Agree 1?3 times with you on this ! Fr?d?ric From verdy_p at wanadoo.fr Tue Mar 28 04:33:41 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 11:33:41 +0200 Subject: Encoding of old compatibility characters In-Reply-To: <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> Message-ID: Ideally a smart text renderer could as well display that glyph with a leading multiplication sign (a mathematical middle dot) and implicitly convert the following digits (and sign) as real superscript/exponent (using contextual substitution/positioning like for Eastern Arabic/Urdu), without necessarily writing the 10 base with smaller digits. Without it, people will want to use 20? to mean it is the decimal number twenty and not hexadecimal number thirty two. 2017-03-28 11:18 GMT+02:00 Fr?d?ric Grosshans : > Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit : > >> Aw, but ? is awesome! It's much cooler-looking and more visually >> understandable than "e" for exponent notation. In some code I've been >> playing around with I support it as a valid alternative to "e". >> > > I Agree 1?3 times with you on this ! > > Fr?d?ric > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joan at montane.cat Tue Mar 28 04:56:02 2017 From: joan at montane.cat (=?UTF-8?Q?Joan_Montan=C3=A9?=) Date: Tue, 28 Mar 2017 11:56:02 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: 2017-03-28 7:57 GMT+02:00 Mark Davis ?? : > To add to what Ken and Markus said: like many other identifiers, there are > a number of different categories. > > 1. *Ill-formed: *"$1" > 2. *Well-formed, but not valid: *"usx". Is *syntactic* according to > http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence > , > but is not *valid* according to http://unicode.org/reports/tr5 > 1/proposed.html#valid-emoji-tag-sequences > > . > 3. *Valid, but not recommended: "usca". *Corresponds to the valid > Unicode subdivision code for California according to > http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences > > and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. > 4. *Recommended:* "gbsct". Corresponds to the valid Unicode > subdivision code for Scotland, and *is* listed in > http://unicode.org/Public/emoji/5.0/ > . > > As Ken says, the terminology is a little bit in flux for term > 'recommended'. TR51 is still open for comment, although we won't make any > changes that would invalidate http://unicode.org/Public/emoji/5.0/. > Just two remarks 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site) arises something like chicken-egg problem. Vendors don't easily add new subdivision-flags (because they aren't recommended), and Unicode doesn't recommend new subdivision flags (because they aren't supported by vendors). 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be valid, but not recommended, Unicode subdivisions codes eligible? For instances, say, could someone adopt California, Texas, Pomerania, or Catalonia flags? Regards, Joan Montan? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Mar 28 05:32:55 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 12:32:55 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: ?Good questions.? On Tue, Mar 28, 2017 at 11:56 AM, Joan Montan? wrote: > 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site) > arises something like chicken-egg problem. Vendors don't easily add new > subdivision-flags (because they aren't recommended), and Unicode doesn't > recommend new subdivision flags (because they aren't supported by vendors). > ? That isn't really the case. In particular, vendors can propose adding additional subdivisions to the recommended list. The UTC Consideration?s would come into play in assessing those proposals.? So it is certainly possible for there to be (say) a flag of Texas or Catalonia appearing in an Emoji 6.0 release this year. Similarly, Microsoft could propose adding the ninja cat ZWJ sequences. > 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be > valid, but not recommended, Unicode subdivisions codes eligible? For > instances, say, could someone adopt California, Texas, Pomerania, or > Catalonia flags? > ?We only support the recommended list for adoptions. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Mar 28 05:36:40 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 12:36:40 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: I note this in TR32 *3.2 Unicode Locale Identifier * EBNF ABNF unicode_locale_id = unicode_language_id (transformed_extensions unicode_locale_extensions? | unicode_locale_extensions? transformed_extensions?) ; = unicode_language_id ([trasformed_extensions [unicode_locale_extensions]] / [unicode_locale_extensions [transformed_extensions]]) * first there's a typo in the ABNF syntax ("trasformed") * the syntax is not strictly equivalent, or the ABNF is unnecessarily not context-free It should better be: EBNF ABNF unicode_locale_id = unicode_language_id (transformed_extensions unicode_locale_extensions? | unicode_locale_extensions transformed_extensions?)?; = unicode_language_id [transformed_extensions [unicode_locale_extensions] / unicode_locale_extensions [transformed_extensions]] 2017-03-28 11:56 GMT+02:00 Joan Montan? : > > > 2017-03-28 7:57 GMT+02:00 Mark Davis ?? : > >> To add to what Ken and Markus said: like many other identifiers, there >> are a number of different categories. >> >> 1. *Ill-formed: *"$1" >> 2. *Well-formed, but not valid: *"usx". Is *syntactic* according to >> http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence >> , >> but is not *valid* according to http://unicode.org/reports/tr5 >> 1/proposed.html#valid-emoji-tag-sequences >> >> . >> 3. *Valid, but not recommended: "usca". *Corresponds to the valid >> Unicode subdivision code for California according to >> http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta >> g-sequences >> >> and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. >> 4. *Recommended:* "gbsct". Corresponds to the valid Unicode >> subdivision code for Scotland, and *is* listed in >> http://unicode.org/Public/emoji/5.0/ >> . >> >> As Ken says, the terminology is a little bit in flux for term >> 'recommended'. TR51 is still open for comment, although we won't make any >> changes that would invalidate http://unicode.org/Public/emoji/5.0/. >> > > Just two remarks > > 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site) > arises something like chicken-egg problem. Vendors don't easily add new > subdivision-flags (because they aren't recommended), and Unicode doesn't > recommend new subdivision flags (because they aren't supported by vendors). > > 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be > valid, but not recommended, Unicode subdivisions codes eligible? For > instances, say, could someone adopt California, Texas, Pomerania, or > Catalonia flags? > > > Regards, > Joan Montan? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Mar 28 05:39:13 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 19:39:13 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> Message-ID: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> Hello Michael, others, On 2017/03/27 21:07, Michael Everson wrote: > On 27 Mar 2017, at 06:42, Martin J. D?rst wrote: > >>> The characters in question have different and undisputed origins, undisputed. >> >> If you change that to the somewhat more neutral "the shapes in question have different and undisputed origins", then I'm with you. I actually have said as much (in different words) in an earlier post. > > And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before. I didn't say that you have to change words. I just said that I could agree to a slightly differently worded phrase. And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion. > The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care. Sorry, but that was exactly the point of this analogy. As to "can't tell", it's easy to ask somebody to look at an actual ? letter and say whether the right part looks more like an s or like a z. On the other hand, users of Deseret may or may not ignore the difference between the 1855 and 1859 shapes when they read. Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different. > No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this: > > https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg Yes. And we are just starting to collect evidence for Deseret fonts. > And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there. Well, not to somebody used to it. But non-German users quite often use a Greek ? where they should use a ?, so it's no surprise people don't distinguish the ?s and ?z derived glyphs. > The situation in Deseret is different. The graphic difference is definitely bigger, so to an outsider, it's definitely quite impossible to identify the pairs of shapes. But that does in no way mean that these have to be seen as different characters (rather than just different glyphs) by insiders (actual users). To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters. Similar for many fantasy fonts, and for people not very familiar with the Latin script. > Underlying ligature difference is indicative of character identity. Particularly when two resulting ligatures are SO different from one another as to be unrecognizable. And that is the case with EW on the left and OI on the right here: > https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg > > The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters. The range of what can be a glyph variant is quite wide across scripts and font styles. Just that the shapes differ widely, or that the origin is different, doesn't make this conclusive. > Character origin is intimately related to character identity. In most cases, yes. But it's not a given conclusion. > I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters. No, your work wouldn't be impossible. It might be quite a bit more difficult, but not impossible. I have written papers about Han ideographs and Japanese text processing where I had to create my own fonts (8-bit, with mostly random assignments of characters because these were one-off jobs), or fake things with inline bitmap images (trying to get information on the final printer resolution and how many black pixels wide a stem or crossbar would have to be to avoid dropouts, and not being very successful). I have heard the argument that some character variant is needed because of research, history,... quite a few times. If a character has indeed been historically used in a contrasting way, this is definitely a good argument for encoding. But if a character just looked somewhat different a few (hundreds of) years ago, that doesn't make such a good argument. Otherwise, somebody may want to propose new codepoints for Bodoni and Helvetica,... Regards, Martin. From mark at macchiato.com Tue Mar 28 05:49:39 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 12:49:39 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: ?Thanks. Probably best as: unicode_locale_id = unicode_language_id ( transformed_extensions unicode_locale_extensions? | unicode_locale_extensions transformed_extensions? )? ;? even clearer would be two steps: unicode_locale_id = unicode_language_id extensions? ; extensions = transformed_extensions unicode_locale_extensions? | unicode_locale_extensions transformed_extensions? ; ?Could you file a CLDR ticket on this? ? Mark On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy wrote: > I note this in TR32 > *3.2 Unicode Locale Identifier > * > > EBNF > ABNF > > unicode_locale_id > = > unicode_language_id > (transformed_extensions > unicode_locale_extensions? > | unicode_locale_extensions? > transformed_extensions?) ; = unicode_language_id > ([trasformed_extensions > [unicode_locale_extensions]] > / [unicode_locale_extensions > [transformed_extensions]]) > > * first there's a typo in the ABNF syntax ("trasformed") > * the syntax is not strictly equivalent, or the ABNF is unnecessarily not > context-free > > It should better be: > > EBNF > ABNF > > unicode_locale_id > = > unicode_language_id > (transformed_extensions > unicode_locale_extensions? > | unicode_locale_extensions > transformed_extensions?)?; = unicode_language_id > [transformed_extensions > [unicode_locale_extensions] > / unicode_locale_extensions > [transformed_extensions]] > > > > 2017-03-28 11:56 GMT+02:00 Joan Montan? : > >> >> >> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? : >> >>> To add to what Ken and Markus said: like many other identifiers, there >>> are a number of different categories. >>> >>> 1. *Ill-formed: *"$1" >>> 2. *Well-formed, but not valid: *"usx". Is *syntactic* according to >>> http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence >>> , >>> but is not *valid* according to http://unicode.org/reports/tr5 >>> 1/proposed.html#valid-emoji-tag-sequences >>> >>> . >>> 3. *Valid, but not recommended: "usca". *Corresponds to the valid >>> Unicode subdivision code for California according to >>> http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta >>> g-sequences >>> >>> and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. >>> 4. *Recommended:* "gbsct". Corresponds to the valid Unicode >>> subdivision code for Scotland, and *is* listed in >>> http://unicode.org/Public/emoji/5.0/ >>> . >>> >>> As Ken says, the terminology is a little bit in flux for term >>> 'recommended'. TR51 is still open for comment, although we won't make any >>> changes that would invalidate http://unicode.org/Public/emoji/5.0/. >>> >> >> Just two remarks >> >> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site) >> arises something like chicken-egg problem. Vendors don't easily add new >> subdivision-flags (because they aren't recommended), and Unicode doesn't >> recommend new subdivision flags (because they aren't supported by vendors). >> >> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be >> valid, but not recommended, Unicode subdivisions codes eligible? For >> instances, say, could someone adopt California, Texas, Pomerania, or >> Catalonia flags? >> >> >> Regards, >> Joan Montan? >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Mar 28 05:59:00 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 12:59:00 +0200 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> Message-ID: On Tue, Mar 28, 2017 at 12:39 PM, Martin J. D?rst wrote: ?....? No, your work wouldn't be impossible. It might be quite a bit more > difficult, but not impossible. I have written papers about Han ideographs > and Japanese text processing where I had to create my own fonts (8-bit, > with mostly random assignments of characters because these were one-off > jobs), or fake things with inline bitmap images (trying to get information > on the final printer resolution and how many black pixels wide a stem or > crossbar would have to be to avoid dropouts, and not being very successful). > > I have heard the argument that some character variant is needed because of > research, history,... quite a few times. If a character has indeed been > historically used in a contrasting way, this is definitely a good argument > for encoding. But if a character just looked somewhat different a few > (hundreds of) years ago, that doesn't make such a good argument. Otherwise, > somebody may want to propose new codepoints for Bodoni and Helvetica,... > ?I agree with Martin. Moreover, his last paragraphs are getting at the crux of the matter. Unicode is not a registry of glyphs for letters, nor should try to be. Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape. We do not need to capture all of the shapes in https://upload.wikimedia.org/wikipedia/commons/f/fc/Gebrochene_Schriften.png simply because somebody is going to "publish a volume full of" those shapes. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From ian.clifton at chem.ox.ac.uk Tue Mar 28 06:00:25 2017 From: ian.clifton at chem.ox.ac.uk (Ian Clifton) Date: Tue, 28 Mar 2017 12:00:25 +0100 Subject: Encoding of old compatibility characters In-Reply-To: (Philippe Verdy's message of "Tue, 28 Mar 2017 11:33:41 +0200") References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> Message-ID: <4q7f39oed2.fsf@chem.ox.ac.uk> Philippe Verdy writes: > Ideally a smart text renderer could as well display that glyph with a > leading multiplication sign (a mathematical middle dot) and implicitly > convert the following digits (and sign) as real superscript/exponent > (using contextual substitution/positioning like for Eastern > Arabic/Urdu), without necessarily writing the 10 base with smaller > digits. Actually, I would see this as putting unnecessary clutter back in! I would say the advantage of the ? notation, introduced with Algol 60, is that it subsumes and makes implicit the multiplication and exponentiation operators, resulting in a visually compact denotation of a real number in ?scientific notation?, and it does so with a single symbol that hints at its own meaning. I?ve used ? a couple of times, without explanation, in my own emails?without, as far as I?m aware, causing any misunderstanding. > Without it, people will want to use 20? to mean it is the decimal > number twenty and not hexadecimal number thirty two. Yes, this ambiguity is a drawback. Hopefully, the use cases should be sufficiently different that real confusion would be unlikely (and of course, normally, U+23E8 should never be used to denote decimal number base). -- Ian Clifton ? ?: +44 1865 275677 Chemistry Research Laboratory ?: +44 1865 285002 Oxford University ??: ian.clifton at chem.ox.ac.uk Mansfield Road Oxford OX1 3TA UK From verdy_p at wanadoo.fr Tue Mar 28 06:01:47 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 13:01:47 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: I just filed the bug in the CLDR contact form. 2017-03-28 12:49 GMT+02:00 Mark Davis ?? : > ?Thanks. Probably best as: > > unicode_locale_id = unicode_language_id > ( transformed_extensions unicode_locale_extensions? > | unicode_locale_extensions transformed_extensions? )? > ;? > > even clearer would be two steps: > > unicode_locale_id = unicode_language_id extensions? ; > > extensions = transformed_extensions unicode_locale_extensions? > | unicode_locale_extensions transformed_extensions? ; > > ?Could you file a CLDR ticket on this? > > ? > Mark > > On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy > wrote: > >> I note this in TR32 >> *3.2 Unicode Locale Identifier >> * >> >> EBNF >> ABNF >> >> unicode_locale_id >> = >> unicode_language_id >> (transformed_extensions >> unicode_locale_extensions? >> | unicode_locale_extensions? >> transformed_extensions?) ; = unicode_language_id >> ([trasformed_extensions >> [unicode_locale_extensions]] >> / [unicode_locale_extensions >> [transformed_extensions]]) >> >> * first there's a typo in the ABNF syntax ("trasformed") >> * the syntax is not strictly equivalent, or the ABNF is unnecessarily not >> context-free >> >> It should better be: >> >> EBNF >> ABNF >> >> unicode_locale_id >> = >> unicode_language_id >> (transformed_extensions >> unicode_locale_extensions? >> | unicode_locale_extensions >> transformed_extensions?)?; = unicode_language_id >> [transformed_extensions >> [unicode_locale_extensions] >> / unicode_locale_extensions >> [transformed_extensions]] >> >> >> >> 2017-03-28 11:56 GMT+02:00 Joan Montan? : >> >>> >>> >>> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? : >>> >>>> To add to what Ken and Markus said: like many other identifiers, there >>>> are a number of different categories. >>>> >>>> 1. *Ill-formed: *"$1" >>>> 2. *Well-formed, but not valid: *"usx". Is *syntactic* according to >>>> http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence >>>> , >>>> but is not *valid* according to http://unicode.org/reports/tr5 >>>> 1/proposed.html#valid-emoji-tag-sequences >>>> >>>> . >>>> 3. *Valid, but not recommended: "usca". *Corresponds to the valid >>>> Unicode subdivision code for California according to >>>> http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta >>>> g-sequences >>>> >>>> and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. >>>> 4. *Recommended:* "gbsct". Corresponds to the valid Unicode >>>> subdivision code for Scotland, and *is* listed in >>>> http://unicode.org/Public/emoji/5.0/ >>>> . >>>> >>>> As Ken says, the terminology is a little bit in flux for term >>>> 'recommended'. TR51 is still open for comment, although we won't make any >>>> changes that would invalidate http://unicode.org/Public/emoji/5.0/. >>>> >>> >>> Just two remarks >>> >>> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode >>> site) arises something like chicken-egg problem. Vendors don't easily add >>> new subdivision-flags (because they aren't recommended), and Unicode >>> doesn't recommend new subdivision flags (because they aren't supported by >>> vendors). >>> >>> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be >>> valid, but not recommended, Unicode subdivisions codes eligible? For >>> instances, say, could someone adopt California, Texas, Pomerania, or >>> Catalonia flags? >>> >>> >>> Regards, >>> Joan Montan? >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Mar 28 06:26:38 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 20:26:38 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> Message-ID: <06a3292e-a275-8b0d-d03d-5c76e0870977@it.aoyama.ac.jp> I agree with Alstair. The list of font technology options was mostly to show that there are already a lot of options (some might even say too many), so font technology doesn't really limit our choices. Regards, Martin. On 2017/03/27 23:04, Alastair Houghton wrote: > On 27 Mar 2017, at 10:14, Julian Bradfield wrote: >> >> I contend, therefore, that no decision about Unicode should take into >> account any ephemeral considerations such as this year's electronic >> font technology, and that therefore it's not even useful to mention >> them. > > I?d disagree with that, for two reasons: > > 1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now. If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text). > > 2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future. There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed. > > I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems. > > All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here. That?s very much not my field. > > Kind regards, > > Alastair. > > -- > http://alastairs-place.net From duerst at it.aoyama.ac.jp Tue Mar 28 06:33:25 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 20:33:25 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp> <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net> <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com> Message-ID: On 2017/03/28 01:49, Michael Everson wrote: > Sorry, but typographic control of that sort is grand for typesetting, where you can select ranges of text and language-tag it (assuming your program accepts and supports all the language tags you might need (which they don?t)) and you can select fonts which have all the trickery baked into them (hardly any do) and then? can you use this in file names? In your plain-text databases? In your text messages? Do you think that the 1855/1859 distinction is needed in file names? In text messages? It may help in some kinds of databases, but it may also be possible to just tag each piece of text in the database with "1855" or "1859" if that distinction is important (e.g. for historical documents). As far as I understand, we are still looking for actual texts that use both shapes of the same ligature concurrently. Regards, Martin. From duerst at it.aoyama.ac.jp Tue Mar 28 06:38:22 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 20:38:22 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com> Message-ID: <40e15c1e-b175-6dc3-d0df-3eb07e5f0eb8@it.aoyama.ac.jp> On 2017/03/28 01:20, Michael Everson wrote: > Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??, ?few? as ??, ?truefully? [sic] as ????????????, and ?you? as ??. These are all 1859 variants, yes? That would just show that these variants existed (which I think nobody in this discussion has doubted), but not that there was contrasting use. And is that letter hand-written or printed? Regards, Martin. From duerst at it.aoyama.ac.jp Tue Mar 28 07:10:58 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Tue, 28 Mar 2017 21:10:58 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> Message-ID: On 2017/03/27 21:59, Michael Everson wrote: > On 27 Mar 2017, at 08:05, Martin J. D?rst wrote: > >>> Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.) >> >> "apparently", maybe. Let's for a moment leave aside the radicals themselves, which are to a large extent artificial constructs. > > I do stipulate not being a CJK expert. But those are indeed different due to their origins, however similar their shapes are. Except for the radicals themselves, I haven't found a contrasting pair. What I think we would need to find to influence the current argumentation (except for general "history is important", which I think we all agree) is a case of a character that originally existed both with a MEAT radical and a MOON radical, but has only a single usage. Then whether there were one or two code points would provide an analog for the situation we have at hand. Also note that there is a difference in meaning. The characters with MEAT radicals mostly refer to body parts and organs. The characters with MOON radicals are mostly time-related. >> Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ? shape, with two horizontal bars), whereas the H/T/V columns show the ? shape (two downwards slanted bars) for the "MEAT" radical and the ? shape for the moon radical. So whether these radicals have identical glyphs depends on typographic tradition/font/? > > They are still always very similar, right? Similarity is in the eye of the beholder (or the script). Sometimes, a little dot or hook is irrelevant. Sometimes it's the single difference that makes it a totally different character. >> In Japan, many people may be rather unaware of the difference, whereas in Taiwan, it may be that school children get drilled on the difference. > > That?s interesting. Not necessarily for the poor Taiwanese students, and not necessarily for the Japanese who try to find a character in a dictionary ordered by radical :-(. >>> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. >> >> Well, yes, rejected many times in cases where that was appropriate. But also accepted many times, in cases that we may not even remember, because they may not even have been made explicitly. > > Do come up with examples if you have any. I had the following in mind: >> The roman/italic a/? and g/? distinctions (the later code points only used to show the distinction in plain text, which could as well be done descriptively), > > Aa and ?? are used contrastively for different sounds in some languages and in the IPA. ?? is not, to my knowledge, used contrastively with Gg (except that ? can only mean /?/, while orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is reasonably analogous to ?? and ???? being used for /ju?/. The contrastive use *in some languages or notations* (IPA) is the reason these are separately encoded. The fact that these are not contrastively used in most major languages is responsible for the fact that they don't use different code points when used in these languages. It would be a real hassle to have to change from g to ? when switching e.g. from Times Roman to Times Italic. In Deseret, we are still missing any contrastive usage, so that suggests to be careful with encoding. >> as well as a large number of distinctions in Han fonts, come to my mind. It's difficult to show these distinctions, because they are NOT separately encoded, but three-stroke and four-stroke grass radical is the most well known. > And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ????. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/. Ah, I see. So we seem to have five different ways (counting the two ligature variants) of writing the same word, with three different pronunciations. The important question is whether the two ligatures do imply any difference in pronunciation (as opposed to time of writing or author/printer preference), i.e. whether the ligated sequences ?????? or ???? are pronounced differently (not by a phonologist but by an average user). >> Is the choice of variant up to the author (for which variants), or is it the editor or printer who makes the choice (for which variants)? > > In a handwritten manuscript obviously the choice is the author?s. As to historical printing, printers may have Did you want to write something more here? >> And what informs this choice? If we have any historic metal types, are there examples where a font contains both ligature variants? > > Ken Beesley have samples of a metal font (the 1857 St Luois punches) which had both ?? and ????; I don?t know what other sorts were in that font. As I explained in another post, that may just be a 1855/1859 hybrid. Regards, Martin. From mark at macchiato.com Tue Mar 28 07:22:36 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 28 Mar 2017 14:22:36 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: Thanks Mark On Tue, Mar 28, 2017 at 1:01 PM, Philippe Verdy wrote: > I just filed the bug in the CLDR contact form. > > 2017-03-28 12:49 GMT+02:00 Mark Davis ?? : > >> ?Thanks. Probably best as: >> >> unicode_locale_id = unicode_language_id >> ( transformed_extensions unicode_locale_extensions? >> | unicode_locale_extensions transformed_extensions? >> )? ;? >> >> even clearer would be two steps: >> >> unicode_locale_id = unicode_language_id extensions? ; >> >> extensions = transformed_extensions unicode_locale_extensions? >> | unicode_locale_extensions transformed_extensions? ; >> >> ?Could you file a CLDR ticket on this? >> >> ? >> Mark >> >> On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy >> wrote: >> >>> I note this in TR32 >>> *3.2 Unicode Locale Identifier >>> * >>> >>> EBNF >>> ABNF >>> >>> unicode_locale_id >>> = >>> unicode_language_id >>> (transformed_extensions >>> unicode_locale_extensions? >>> | unicode_locale_extensions? >>> transformed_extensions?) ; = unicode_language_id >>> ([trasformed_extensions >>> [unicode_locale_extensions]] >>> / [unicode_locale_extensions >>> [transformed_extensions]]) >>> >>> * first there's a typo in the ABNF syntax ("trasformed") >>> * the syntax is not strictly equivalent, or the ABNF is unnecessarily >>> not context-free >>> >>> It should better be: >>> >>> EBNF >>> ABNF >>> >>> unicode_locale_id >>> = >>> unicode_language_id >>> (transformed_extensions >>> unicode_locale_extensions? >>> | unicode_locale_extensions >>> transformed_extensions?)?; = unicode_language_id >>> [transformed_extensions >>> [unicode_locale_extensions] >>> / unicode_locale_extensions >>> [transformed_extensions]] >>> >>> >>> >>> 2017-03-28 11:56 GMT+02:00 Joan Montan? : >>> >>>> >>>> >>>> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? : >>>> >>>>> To add to what Ken and Markus said: like many other identifiers, there >>>>> are a number of different categories. >>>>> >>>>> 1. *Ill-formed: *"$1" >>>>> 2. *Well-formed, but not valid: *"usx". Is *syntactic* according >>>>> to http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_ >>>>> sequence, but is not *valid* according to >>>>> http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta >>>>> g-sequences >>>>> >>>>> . >>>>> 3. *Valid, but not recommended: "usca". *Corresponds to the valid >>>>> Unicode subdivision code for California according to >>>>> http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta >>>>> g-sequences >>>>> >>>>> and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/ >>>>> . >>>>> 4. *Recommended:* "gbsct". Corresponds to the valid Unicode >>>>> subdivision code for Scotland, and *is* listed in >>>>> http://unicode.org/Public/emoji/5.0/ >>>>> . >>>>> >>>>> As Ken says, the terminology is a little bit in flux for term >>>>> 'recommended'. TR51 is still open for comment, although we won't make any >>>>> changes that would invalidate http://unicode.org/Public/emoji/5.0/. >>>>> >>>> >>>> Just two remarks >>>> >>>> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode >>>> site) arises something like chicken-egg problem. Vendors don't easily add >>>> new subdivision-flags (because they aren't recommended), and Unicode >>>> doesn't recommend new subdivision flags (because they aren't supported by >>>> vendors). >>>> >>>> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be >>>> valid, but not recommended, Unicode subdivisions codes eligible? For >>>> instances, say, could someone adopt California, Texas, Pomerania, or >>>> Catalonia flags? >>>> >>>> >>>> Regards, >>>> Joan Montan? >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Mar 28 07:38:38 2017 From: everson at evertype.com (Michael Everson) Date: Tue, 28 Mar 2017 13:38:38 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com> Message-ID: <852E1F82-A015-4616-BA59-8AABBF4FABC2@evertype.com> On 28 Mar 2017, at 07:32, Martin J. D?rst wrote: > On 2017/03/28 01:03, Michael Everson wrote: >> On 27 Mar 2017, at 16:56, John H. Jenkins wrote: > >> The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI . Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont. > > Good to have some actual examples. However, the example at hand does, as far as I understand it, not necessarily support separate encoding. Of course it does. > While it mixes 1855 and 1859, it contains only one of the ligature variants each. It?s a smoke proof taken from some metal sorts. It shows that at least these two characters were in that font. > Indeed, it could be taken as support for the theory that the top and bottom row ligatures in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg were used interchangeably, and that the 1857 St Louis punches just made one particular choice of glyph selection. "Letters to represent the same diphthong? does not mean ?letters used interchangeably?. These letters have entirely different histories. They are not similar to one another. They are not ?glyph variants? of one another by ANY measure of character identity that I have learned in two decades of this work, where I have examined and successfully proposed a great many characters. Martin, your scepticism just doesn?t convince. It seems like it?s scepticism for its own sake. You only have to, you know, use your EYES to see that 1855 EW looks NOTHING LIKE 1859 EW. Doesn?t matter if they?re used to represent the same sound. That doesn?t mean they?re in free variation. In fact, what it looks like is that early texts may use some letters, later texts may use other letters, and a few texts This is a matter of SPELLING. Of the choice the author makes. It may be important for dating a manuscript. Representing texts as they are written is as important for early Deseret as it is for medieval Latin, to researchers who care to represent the text as it was without normalizing it to one thing or another. > What would give a strong argument would be the *concurrent* existence of *corresponding* ligatures in the same font, or the concurrent (even better, contrasting) use of corresponding ligatures in the same text. Well, ain?t it just too bad that the accident of history has not left us complete print shops with all the fonts that were ever used for Deseret. The origin of these four letters as ligatures of four distinct letters with SHORT I is the right argument for character identity. Recognizability is also a strong argument. We used that when we encoded Phoenician, though some people argued that Semitic studies would collapse if we didn?t treat Phoenician as a font variant of Hebrew. Maybe those of you who don?t have to face the ever-moving bar of encoding criteria over and over again don?t remember that stuff. > What's interesting (weird?) is that the "1859" OI appears in 1857 punches. Time travel? Or is the label "1859" a misnomer or just a convention? I think 1859 refers to a particular publication. Michael Everson From asmusf at ix.netcom.com Tue Mar 28 08:09:00 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 28 Mar 2017 06:09:00 -0700 Subject: Encoding of old compatibility characters In-Reply-To: <4q7f39oed2.fsf@chem.ox.ac.uk> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> Message-ID: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> An HTML attachment was scrubbed... URL: From everson at evertype.com Tue Mar 28 08:56:28 2017 From: everson at evertype.com (Michael Everson) Date: Tue, 28 Mar 2017 14:56:28 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> Message-ID: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> On 28 Mar 2017, at 11:39, Martin J. D?rst wrote: >> And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before. > > I didn't say that you have to change words. I just said that I could agree to a slightly differently worded phrase. An ? ligature is a ligature of a and of e. It is not some sort of pretzel. What Deseret has is this: 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE * officially named ?ew? in the code chart * used for ew in earlier texts 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE * officially named ?oi? in the code chart * used for oi in earlier texts 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE * used for oi in later texts 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE * used for ew in later texts Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character. Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character. To do so is to show no understanding of the history of writing systems at all. You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here. > And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion. The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this. >> The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care. > > Sorry, but that was exactly the point of this analogy. As to "can't tell", it's easy to ask somebody to look at an actual ? letter and say whether the right part looks more like an s or like a z. By ?can?t tell? I mean ?recognize as essentially the same letterform?. The streetsigns in some German cities use a very ?? if you look at it and know anything about typography. Most people probably don?t notice. They see ? and that?s precisely because ?s and ?? look very much alike. > On the other hand, users of Deseret may or may not ignore the difference between the 1855 and 1859 shapes when they read. The people who wrote the manuscripts are dead. Most readers and writers of Deseret today use the shapes that are in their fonts, which are those in the Unicode charts, and most texts published today don?t use the EW and OI ligatures at all, because that?s John Jenkins? editorial practice. The need to distinguish these letters (which are distinguished because of their history as letterforms, not because of the diphthong) is no different from the reason we encoded these ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?. Scholars required those. Manuscripts may contain them side by side. Or their usage may be separated by hundreds of kilometres or hundreds of years. There is no difference. There were pages of discussion as to WHY scholars needed the medievalist characters. The counter argument was ?Why not normalize?? We had similar pages of discussion as to WHY Uralicists needed the great many characters we encoded for them. Why is it that you people can encode BROCCOLI on the basis of nothing but ?people might like it? but we cannot use sound existing precedent to encode characters which (while similar in use to other characters) are an index of orthographic change in a historical script and orthography? There are plenty of ?glyph variations? in early Deseret texts vis ? vis which I?d ignore. This isn?t one of them. > Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different. Martin, there is no answer to this unless you can read the minds of people who are dead a century or more. Therefore it is not a useful criterion, and the other criteria (letter origin, spelling choice) are the indices which must guide our understanding. The result of those criteria is that there are four characters here, not two. > No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this: >> >> https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg > > Yes. And we are just starting to collect evidence for Deseret fonts. Well you aren?t going to get full repertoires from the 19th-century lead type because they don?t exist. We have what we have of them, and we have the manuscripts. As to modern digital typefaces, there are NONE which support the 1859 letters. And I?ve seen most of them. >> And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there. > > Well, not to somebody used to it. But non-German users quite often use a Greek ? where they should use a ?, so it's no surprise people don't distinguish the ?s and ?z derived glyphs. I?ve received German texts which used Greek ?. But that?s not the point. People don?t distinguish the ?s and ?? glyphs because they look pretty much the same AND there?s no reason to distinguish them. A world of difference between that and the Deseret LETTERs WITH STROKE. >> The situation in Deseret is different. > > The graphic difference is definitely bigger, For pity?s sake, Martin. ?? ?? look NOTHING ALIKE. And ?? and ?? look NOTHING ALIKE. This isn?t anything like ?s and ?? and ?z and ?. > so to an outsider, it's definitely quite impossible to identify the pairs of shapes. But that does in no way mean that these have to be seen as different characters (rather than just different glyphs) by insiders (actual users). They had a script reform and they cut new type. The did this on purpose. Note that in their ligatures they shifted from SHORT AH and LONG OO to LONG AH and SHORT OO. > To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters. I do not believe you. If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all. I?ll stipulate that few Germans can read S?tterlin or similar hands. :-) > Similar for many fantasy fonts, and for people not very familiar with the Latin script. What?s a fantasy font? And what does this have to do with supporting the encoding in plain text of historical documents in the Deseret script? >> The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters. > > The range of what can be a glyph variant is quite wide across scripts and font styles. Just that the shapes differ widely, or that the origin is different, doesn't make this conclusive. LONG OO WITH STROKE is not a glyph variant of SHORT OO WITH STROKE. LONG AH WITH STROKE is not a glyph variant of SHORT AH WITH STROKE. >> I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters. > > No, your work wouldn't be impossible. It might be quite a bit more difficult, but not impossible. No. Wrong. Wrong, wrong, wrong. No, Martin. We encoded the Latin characters on the basis of good arguments. You do NOT get to invalidate that, or to pretend that the encoding of those characters was a mistake, or anything like it. Many scholars ? including myself ? use these characters, and that is what the Universal Character Set is for. Also, apparently, it is for pictures of BROCCOLI. > I have written papers about Han ideographs and Japanese text processing where I had to create my own fonts (8-bit, with mostly random assignments of characters because these were one-off jobs), or fake things with inline bitmap images (trying to get information on the final printer resolution and how many black pixels wide a stem or crossbar would have to be to avoid dropouts, and not being very successful). All of use make use of nonce glyphs for examples. That?s not the same as making an edition of a medieval Cornish text, or of a Mormon diary. We do NOT want to have to use font trickery > I have heard the argument that some character variant is needed because of research, history,... quite a few times. If a character has indeed been historically used in a contrasting way, Contrast may be geographical or temporal. > this is definitely a good argument for encoding. But if a character just looked somewhat different a few (hundreds of) years ago, Also, LATIN LETTER D WITH STROKE is a different letter from LATIN LETTER T WITH STROKE. Why? Because the underlying letters are different. And it?s no different for Deseret. Your suggestion that LONG AH WITH STROKE and SHORT AH WITH STROKE are the same character is unsupportable. > that doesn't make such a good argument. Otherwise, somebody may want to propose new codepoints for Bodoni and Helvetica,? This suggestion is nonsense. On 28 Mar 2017, at 11:59, Mark Davis ?? wrote: > ?I agree with Martin. > > Moreover, his last paragraphs are getting at the crux of the matter. Unicode is not a registry of glyphs for letters, nor should try to be. DESERET LETTER LONG AH WITH STROKE is not a glyph variant of DESERET LETTER SHORT AH WITH STROKE. > Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape. Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake. > We do not need to capture all of the shapes in https://upload.wikimedia.org/wikipedia/commons/f/fc/Gebrochene_Schriften.png simply because somebody is going to "publish a volume full of" those shapes. That analogy has nothing to do with the discussion about the Deseret letters. On 28 Mar 2017, at 12:33, Martin J. D?rst wrote: > Do you think that the 1855/1859 distinction is needed in file names? In text messages? It may help in some kinds of databases, but it may also be possible to just tag each piece of text in the database with "1855" or "1859" if that distinction is important (e.g. for historical documents). As far as I understand, we are still looking for actual texts that use both shapes of the same ligature concurrently. I think that this is the sort of distinction that should be made in plain text, yes. The 1859 letters are not "glyph variants? of the 1855 letters by any criterion in the history of writing systems that I recognize. On 2017/03/28 01:20, Michael Everson wrote: >> Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??, ?few? as ??, ?truefully? [sic] as ????????????, and ?you? as ??. > > These are all 1859 variants, yes? Yes, it was one letter written by one person at one sitting and he used one orthography and he didn?t mix it with the other orthography. > That would just show that these variants existed (which I think nobody in this discussion has doubted), but not that there was contrasting use. And is that letter hand-written or printed? They had a script reform. At first Mormons used the letter SHORT AH WITH STROKE [??] for /??/ and then later they used LONG AH WITH STROKE [???] for /??/. And at first Mormons used the letter LONG OO WITH STROKE [?u?] for /ju?/ and then later they used SHORT OO WITH STROKE [??] for /ju?/. And some Mormons didn?t use either, they just wrote the diphthongs with digraphs of other letters. On 28 Mar 2017, at 13:10, Martin J. D?rst wrote: >> And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ????. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/. > > Ah, I see. So we seem to have five different ways (counting the two ligature variants) of writing the same word, That?s called spelling. > with three different pronunciations. No, that?s wrong. I give those transcriptions to show the usual meanings of the Deseret letters. So if you were going to write ?tube? /tju?b/ you would write ???????? or ?????? or ????. In the second sentence I show that while the ligated letters ?? and can be used for /ju?/ the unligated sequences ???? and ???? would in principle be pronounced /?u?/ and /??/ respectively. Obviously the pronunciation of the word ?tube? would not have changed for speakers of English in Mormon territories in the middle of the 19th century. (Of course many dialects of English in North America now have /tu?b/ rather than /tju?b/ but that is not relevant here. > The important question is whether the two ligatures do imply any difference in pronunciation (as opposed to time of writing or author/printer preference), i.e. whether the ligated sequences ?????? or ???? are pronounced differently (not by a phonologist but by an average user). No, it?s spelling. Michael Everson From richard.wordingham at ntlworld.com Tue Mar 28 11:14:35 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 28 Mar 2017 17:14:35 +0100 Subject: U+0261 LATIN SMALL LETTER SCRIPT G In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> Message-ID: <20170328171435.39d8bd40@JRWUBU2> On Tue, 28 Mar 2017 21:10:58 +0900 "Martin J. D?rst" wrote: (in Re: Standaridized variation sequences for the Desert alphabet?) > On 2017/03/27 21:59, Michael Everson wrote: > > Aa and ?? are used contrastively for different sounds in some > > languages and in the IPA. ?? is not, to my knowledge, used > > contrastively with Gg (except that ? can only mean /?/, while > > orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is > > reasonably analogous to ?? and ???? being used for /ju?/. > The contrastive use *in some languages or notations* (IPA) is the > reason these are separately encoded. I thought that reason is that at the time, the IPA proscribed the use of the two-storey 'g' in phonetic notation. They have since relented. This was disunification on the basis that one form simply looks wrong. Which writing system contrasts the two? Richard. From asmusf at ix.netcom.com Tue Mar 28 11:30:15 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 28 Mar 2017 09:30:15 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> Message-ID: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> On 3/28/2017 6:56 AM, Michael Everson wrote: > An ? ligature is a ligature of a and of e. It is not some sort of pretzel. We need a pretzel emoji. A./ From frederic.grosshans at gmail.com Tue Mar 28 11:35:41 2017 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Tue, 28 Mar 2017 18:35:41 +0200 Subject: U+0261 LATIN SMALL LETTER SCRIPT G In-Reply-To: <20170328171435.39d8bd40@JRWUBU2> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> <20170328171435.39d8bd40@JRWUBU2> Message-ID: Le 28/03/2017 ? 18:14, Richard Wordingham a ?crit : > On Tue, 28 Mar 2017 21:10:58 +0900 > "Martin J. D?rst" wrote: > (in Re: Standaridized variation sequences for the Desert alphabet?) > >> On 2017/03/27 21:59, Michael Everson wrote: >>> Aa and ?? are used contrastively for different sounds in some >>> languages and in the IPA. ?? is not, to my knowledge, used >>> contrastively with Gg (except that ? can only mean /?/, while >>> orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is >>> reasonably analogous to ?? and ???? being used for /ju?/. >> The contrastive use *in some languages or notations* (IPA) is the >> reason these are separately encoded. > [...] > Which writing system contrasts the two? I had found in 2013 a G? contrast in mathematical notations of an old (1952) physics book (see http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0092.html) Fr?d?ric From verdy_p at wanadoo.fr Tue Mar 28 11:47:53 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 28 Mar 2017 18:47:53 +0200 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> Message-ID: 2017-03-28 18:30 GMT+02:00 Asmus Freytag : > On 3/28/2017 6:56 AM, Michael Everson wrote: > >> An ? ligature is a ligature of a and of e. It is not some sort of pretzel. >> > We need a pretzel emoji. We need a broken tooth emoji too ! -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Tue Mar 28 11:52:24 2017 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Tue, 28 Mar 2017 18:52:24 +0200 Subject: Aw: Re: U+0261 LATIN SMALL LETTER SCRIPT G In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> <20170328171435.39d8bd40@JRWUBU2>, Message-ID: An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Tue Mar 28 12:26:16 2017 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Tue, 28 Mar 2017 17:26:16 +0000 Subject: U+0261 LATIN SMALL LETTER SCRIPT G In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> <20170328171435.39d8bd40@JRWUBU2> Message-ID: I don't think it is a script capital G, but I admit it is arguable. One of the reasons is that the related variables s and ? are not script capital. If you're interested, I could check in the book if script capital are used in this book for other notations. Le mar. 28 mars 2017 ? 18:52, "J?rg Knappen" a ?crit : This is a script capital G or, in TeX notation, {\cal G}. It reflects the use of multiple styles of the same underlying alhabet in mathematics and sciences. It is not a capital script g (note the different ordering of capital and script). --J?rg Knappen I had found in 2013 a G? contrast in mathematical notations of an old (1952) physics book (see http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0092.html) Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From pedberg at apple.com Tue Mar 28 12:30:04 2017 From: pedberg at apple.com (Peter Edberg) Date: Tue, 28 Mar 2017 10:30:04 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> Message-ID: <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com> > On Mar 28, 2017, at 9:30 AM, Asmus Freytag wrote: > > On 3/28/2017 6:56 AM, Michael Everson wrote: >> An ? ligature is a ligature of a and of e. It is not some sort of pretzel. > We need a pretzel emoji. Already in Unicode 10 / emoji 5.0: http://www.unicode.org/emoji/charts/emoji-released.html#1f968 > A./ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Mar 28 13:41:38 2017 From: doug at ewellic.org (Doug Ewell) Date: Tue, 28 Mar 2017 11:41:38 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> Mark Davis wrote: > 3. Valid, but not recommended: "usca". Corresponds to the valid > Unicode subdivision code for California according to >?http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences > and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. "Not recommended" is no better and no less disappointing than "not standard." Both phrases imply strongly that the sequence, while syntactically valid, should not be used. Burying a disclaimer that "implementations can support them, but they may not interoperate well" in the speaker's notes of slide 38 of a 53-page presentation does nothing to change this perception. "Even though it is possible to support the US states, or any subset of them, implementations don?t have to." Well, of course they don't. Implementations don't have to support the three British flags either if they don't want to, or any national flags or other emoji, or any particular character for that matter. The superfluous statement is easily reduced to "Don't do this." Joan Montan?'s return to the list to comment on this issue was interesting because of a post from February 2015, in which Andrea Giammarchi reported [1] on Joan's request [2] for Twitter to support flags for specific "active online communities" that happened to have a TLD, by stringing three or more Regional Indicator Symbols together: > [S][C][O][T] --> it shows Scottish flag > [C][Y][M][R][U] --> it shows a Welsh flag > [B][Z][H] --> it shows a Breton flag > [C][A][T] --> it shows Catalan flag > [E][U][S] --> it shows a Basque flag > [G][A][L] --> it shows a Gallician flag [1] http://www.unicode.org/mail-arch/unicode-ml/y2015-m02/0039.html [2] https://github.com/twitter/twemoji/issues/40 Of course this approach was incompatible with conformant use of RIS; visit [2] with an RIS-conformant browser to see the inadvertently displayed flags of Seychelles, Cyprus, Belize, Canada, etc. I don't know if the ensuing thread helped inspire ESC to pursue the present mechanism involving sequences of Plane 14 tags -- the earliest mention I can find is PRI #299, just a few months later -- but the intent seemed straightforward and sensible: provide an official, conformant mechanism to support a recognized user need, with a suitable fallback strategy, rather than encouraging users via inaction to adopt a non-conformant and broken solution. Unfortunately, the follow-up turned out to be "... and then discourage THAT mechanism as well, except in a couple of selected cases, and tell people to use stickers instead." If this story sounds vaguely familiar to old-timers, it's exactly the path that was followed the last time Plane 14 tag characters were under discussion, between 1998 and 2000: someone wrote an RFC to embed language tags in plain text using invalid UTF-8 sequences; Unicode responded by introducing a proper, conformant mechanism to use Plane 14 characters instead; then the conformant replacement mechanism itself was deprecated and users were told to use out-of-band tagging, exactly what the original RFC sought to avoid. "Not recommended," "not standard," "not interoperable," or any other term ESC settles on for the 5000+ valid flag sequences that are not England, Scotland, and Wales is just a short, easy step away from deprecation for these as well. -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Tue Mar 28 15:09:19 2017 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Tue, 28 Mar 2017 13:09:19 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix. netcom.com> <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com> Message-ID: <62deac46-7dbd-6e35-8c28-cb2628799848@ix.netcom.com> On 3/28/2017 10:30 AM, Peter Edberg wrote: > >> On Mar 28, 2017, at 9:30 AM, Asmus Freytag > > wrote: >> >> On 3/28/2017 6:56 AM, Michael Everson wrote: >>> An ? ligature is a ligature of a and of e. It is not some sort of >>> pretzel. >> We need a pretzel emoji. > > Already in Unicode 10 / emoji 5.0: > http://www.unicode.org/emoji/charts/emoji-released.html#1f968 No, like the ae, so a half eaten one. :) > >> A./ >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Tue Mar 28 15:17:43 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Tue, 28 Mar 2017 13:17:43 -0700 Subject: U+0261 LATIN SMALL LETTER SCRIPT G In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com> <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com> <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp> <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com> <20170328171435.39d8bd40@JRWUBU2> Message-ID: An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Mar 28 16:29:44 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 28 Mar 2017 22:29:44 +0100 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> Message-ID: <20170328222944.3c53914c@JRWUBU2> On Tue, 28 Mar 2017 11:41:38 -0700 "Doug Ewell" wrote: > "Not recommended," "not standard," "not interoperable," or any other > term ESC settles on for the 5000+ valid flag sequences that are not > England, Scotland, and Wales is just a short, easy step away from > deprecation for these as well. It's certainly on the cards that the sequence for the Scottish flag will be deprecated in favour of an RI sequence. Richard. From markus.icu at gmail.com Tue Mar 28 18:52:04 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 28 Mar 2017 16:52:04 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> Message-ID: On Tue, Mar 28, 2017 at 11:41 AM, Doug Ewell wrote: > Mark Davis wrote: > > > 3. Valid, but not recommended: "usca". Corresponds to the valid > > Unicode subdivision code for California according to > > http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences > > and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/. > > "Not recommended" is no better and no less disappointing than "not > standard." Both phrases imply strongly that the sequence, while > syntactically valid, should not be used. > I think the distinction between "valid" and "recommended" is confusing terminology-wise, but it does make sense to have a distinction between "valid" and "we know that one or more vendors are motivated to show these sequences as single glyphs". "valid" is clearly defined, and then there is a subset of valid that's listed in a catalog. Just like anyone is free to string some characters together with intervening ZWJ, but it is useful to have a catalog of sequences that are, or are going to be, in actual use, so that it is known which sequences are likely to work more or less the same on some set of devices. This right now is the right time to propose better wording in the spec so that implementers like you don't feel like they may get the rug pulled from under them down the road. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Mar 28 20:02:24 2017 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 28 Mar 2017 21:02:24 -0400 Subject: Encoding of old compatibility characters In-Reply-To: References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> Message-ID: <2e4a5a86-8b45-acd9-4e80-1f4a31f55805@kli.org> I don't think I want my text renderer to be *that* smart. If I want ?, I'll put ?. If I want a multiplication sign or something, I'll put that. Without the multiplication sign, it's still quite understandable, more so than just "e". It is valid for a text rendering engine to render "g" with one loop or two. I don't think it's valid for it to render "g" as "xg" or "-g" or anything else. The ? character looks like it does. You don't get to add multiplication signs to it because you THINK you know what I'm saying with it. And using 20? to mean "twenty base ten" sounds perfectly reasonable to me also. ~mark On 03/28/2017 05:33 AM, Philippe Verdy wrote: > Ideally a smart text renderer could as well display that glyph with a > leading multiplication sign (a mathematical middle dot) and implicitly > convert the following digits (and sign) as real superscript/exponent > (using contextual substitution/positioning like for Eastern > Arabic/Urdu), without necessarily writing the 10 base with smaller > digits. > Without it, people will want to use 20? to mean it is the decimal > number twenty and not hexadecimal number thirty two. > > 2017-03-28 11:18 GMT+02:00 Fr?d?ric Grosshans > >: > > Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit : > > Aw, but ? is awesome! It's much cooler-looking and more > visually understandable than "e" for exponent notation. In > some code I've been playing around with I support it as a > valid alternative to "e". > > > I Agree 1?3 times with you on this ! > > Fr?d?ric > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Mar 28 20:07:45 2017 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 28 Mar 2017 21:07:45 -0400 Subject: Encoding of old compatibility characters In-Reply-To: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> Message-ID: <601652f3-eb4e-e0a9-b592-416998f3c38f@kli.org> On 03/28/2017 09:09 AM, Asmus Freytag wrote: > On 3/28/2017 4:00 AM, Ian Clifton wrote: >> I?ve used ? a couple of times, without explanation, in my own >> emails?without, as far as I?m aware, causing any misunderstanding. > > Works especially well, whenever it renders as a box with 23E8 inscribed! > > A./ > I ? Unicode. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Mar 28 20:31:39 2017 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 28 Mar 2017 21:31:39 -0400 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> Message-ID: <550b86a8-111e-bfd9-bc82-40c9a3584b1e@kli.org> Kind of have to agree with Doug here. Either support the mechanism or don't. Saying "wellllllll, you CAN do this if you WANT to" always implies a "...but you probably shouldn't." Why even bother making it a possibility? On 03/28/2017 02:41 PM, Doug Ewell wrote: > "Even though it is possible to support the US states, or any subset of > them, implementations don?t have to." Well, of course they don't. > Implementations don't have to support the three British flags either if > they don't want to, or any national flags or other emoji, or any > particular character for that matter. The superfluous statement is > easily reduced to "Don't do this." That's a pretty good re-statement. ~mark From duerst at it.aoyama.ac.jp Tue Mar 28 21:32:37 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 29 Mar 2017 11:32:37 +0900 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> Message-ID: <3b8ac649-cf66-9670-dfb3-41af15e0dde0@it.aoyama.ac.jp> Hello Doug, On 2017/03/29 03:41, Doug Ewell wrote: > If this story sounds vaguely familiar to old-timers, it's exactly the > path that was followed the last time Plane 14 tag characters were under > discussion, between 1998 and 2000: someone wrote an RFC to embed > language tags in plain text using invalid UTF-8 sequences; Unicode > responded by introducing a proper, conformant mechanism to use Plane 14 > characters instead; then the conformant replacement mechanism itself was > deprecated and users were told to use out-of-band tagging, exactly what > the original RFC sought to avoid. I think there is some missing information here. First, the original proposal that used invalid UTF-8 sequences never was an RFC, only an Internet Draft. But what's more important, the protocol that motivated all this work (ACAP) never went anywhere. Nor did any other use of the plane 14 language tag characters get any kind of significant traction. That lead to depreciation, because it would have been a bad idea to let people think that the information in these taggings would actually be used. For some people (including me), that was always seen as the likely outcome; the language tag characters were mostly introduced as a defensive mechanism (way better than invalid UTF-8) rather than something we hoped everybody would jump on. Putting them on plane 14 (which meant that it would be four bytes for each character, and therefore quite a lot of bytes for each tag) was part of that message. > "Not recommended," "not standard," "not interoperable," or any other > term ESC settles on for the 5000+ valid flag sequences that are not > England, Scotland, and Wales is just a short, easy step away from > deprecation for these as well. I think the situation is vastly different here. First, the Consortium never officially 'activated' any subdivision flags, so it would be impossible to deprecate them. Second, we already see some pressure (on this list) to 'recommend' more of these, and I guess the vendors and the Consortium will give in to this pressure, even if slowly and to some extent quite reluctantly. It's anyone's bet in what time frame and order e.g. the flags of California and Texas will be 'recommended'. But I have personally no doubt that these (and quite a few others) will eventually make it, even if I have mixed feelings about that. Regards, Martin. From duerst at it.aoyama.ac.jp Tue Mar 28 21:38:52 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 29 Mar 2017 11:38:52 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com> Message-ID: On 2017/03/29 01:47, Philippe Verdy wrote: > 2017-03-28 18:30 GMT+02:00 Asmus Freytag : > >> On 3/28/2017 6:56 AM, Michael Everson wrote: >> >>> An ? ligature is a ligature of a and of e. It is not some sort of pretzel. >>> >> We need a pretzel emoji. > > We need a broken tooth emoji too ! I prefer soft pretzels! Regards, Martin. From leob at mailcom.com Tue Mar 28 23:41:51 2017 From: leob at mailcom.com (Leo Broukhis) Date: Tue, 28 Mar 2017 21:41:51 -0700 Subject: Encoding of old compatibility characters In-Reply-To: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> Message-ID: On Tue, Mar 28, 2017 at 6:09 AM, Asmus Freytag wrote: > On 3/28/2017 4:00 AM, Ian Clifton wrote: > > I?ve used ? a couple of times, without explanation, in my own > emails?without, as far as I?m aware, causing any misunderstanding. > > Works especially well, whenever it renders as a box with 23E8 inscribed! > Are you still using Windows 7 or RedHat 5, or something equally old? Newer systems have ? out of the box. Leo -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Mar 29 04:59:58 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 29 Mar 2017 10:59:58 +0100 (BST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> References: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> Message-ID: <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost> Mark E. Shoulson wrote: > Kind of have to agree with Doug here. Either support the mechanism or don't. Saying "wellllllll, you CAN do this if you WANT to" always implies a "...but you probably shouldn't." Why even bother making it a possibility? Mark's use of wellllllll made me smile and brightened my day, because it resonated with my use, in a different context, of wolllll near the end of the last page of Chapter 16 of my novel. http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_016.pdf A PDF document of size 31.01 kilobytes. Returning to what Doug and Mark wrote. When I read things like "not recommended" I imagine a situation where someone who is employed by a large information technology company being the person who actually sits down with the specification documents and makes a decision as to what to encode. That person is probably not one of the people who is in charge of running the company. So the person may well have an annual review meeting with people several steps up the hierarchy of the company, people who can promote, grudgingly continue to employ, or sack the employee. So I imagine the possibility of, at that meeting, the question of "Why did you implement all of those flags in our product?" being asked. The employee then explains his or her thinking, a desire to help end users and to have compatibility with communication with devices made by other manufacturers and for it all to be colourful and fun. The employee is then asked if he or she knew that implementation was not recommended. Did he or she know of that and went the other way thinking he or she knew better or had he or she not read that part of the documentation. So maybe the employee takes such a possible scenario into account when deciding whether to implement the flags in the first place. Relying on "not recommended" is safer. If the people higher up get letters from consumers asking for implementation and they ask for it to be done, then good, that would be enjoyable, but why be the one who could be criticised. I also imagine a scenario that instead of the "not recommended" that the advice might have been that it would be great and progressive if lots of flags were implemented in lots of products and it would be great if it could be done as soon as possible, by this summer if possible, ready for displaying at the conference in the autumn and to help that along here are some links to some free-to-use open source artwork that Unicode Inc. is making available in case you want to use it and here are some links to some free-to-use open source OpenType font glyph substitution code that Unicode Inc. is making available in case you want to use it. Well, why not? :-) William Overington Wednesday 29 March 2017 From duerst at it.aoyama.ac.jp Wed Mar 29 05:12:19 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Wed, 29 Mar 2017 19:12:19 +0900 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> Message-ID: Hello everybody, Let me start with a short summary of where I think we are at, and how we got there. - The discussion started out with two letters, with two letter forms each. There is explicit talk of the 40-letter alphabet and glyphs in the Wikipedia page, not of two different letters. - That suggests that IF this script is in current use, and the shapes for these diphthongs are interchangeable (for those who use the script day-to-day, not for meta-purposes such as historic and typographic texts), keeping things unified is preferable. - As far as we have heard (in the course of the discussion, after questioning claims made without such information), it seems that: - There may not be enough information to understand how the creators and early users of the script saw this issue, on a scale that may range between "everybody knows these are the same, and nobody cares too much who uses which, even if individual people may have their preferences in their handwriting" to something like "these are different choices, and people wouldn't want their texts be changed in any way when published". - Similarly, there seem to be not enough modern practitioners of the script using the ligatures that could shed any light on the question asked in the previous item in a historical context, first apparently because there are not that many modern practitioners at all, and second because modern practitioners seem to prefer spelling with individual letters rather than using the ligatures. - IF the above is true, then it may be that these ligatures are mostly used for historic purposes only, in which case it wouldn't do any harm to present-day users if they were separated. If the above is roughly correct, then it's important that we reached that conclusion after explicitly considering the potential of a split to create inconvenience and confusion for modern practitioners, not after just looking at the shapes only, coming up with separate historical derivations for each of them, and deciding to split because history is way more important than modern practice. In that light, some more comments lower down. On 2017/03/28 22:56, Michael Everson wrote: > On 28 Mar 2017, at 11:39, Martin J. D?rst wrote: > An ? ligature is a ligature of a and of e. It is not some sort of pretzel. Yes. But it's important that we know that because we have been faced with many cases where "?" and "ae" were used interchangeably. For somebody not knowing the (extended) Latin alphabet and its usages, they might easily see more of a pretzel and less of 'a' and 'e'. I might try some experiments with some of my students (although I'm using "formul?" in my lecture notes, and so they might already be too familiar with the "?"). Also, if it were the case that shapes like "?" and "?" were used interchangeably across all uses of the Latin alphabet, I'm quite sure we would encode it with one code point rather than two, even if some researchers might claim that the later was derived from an "o" rather than an "?", or even if we knew it was derived from an "o" (as we know for the ?). > What Deseret has is this: > > 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE > * officially named ?ew? in the code chart > * used for ew in earlier texts > 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE > * officially named ?oi? in the code chart > * used for oi in earlier texts > 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE > * used for oi in later texts > 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE > * used for ew in later texts Currently, it has this: 10426 ?? DESERET CAPITAL LETTER OI 10427 ?? DESERET CAPITAL LETTER EW My personal opinion is that names are mostly hints, and not too much should be read into them, but if anything, the names in the current charts would suggest that the encoding is for the 39th/40th letter of the Deseret alphabet, whatever its shape, not for some particular shape. And you know as well as I do that we can't change names. So if we split, we might end up with something like: 10426 ?? DESERET CAPITAL LETTER OI 10427 ?? DESERET CAPITAL LETTER EW 1xxxx DESERET CAPITAL LETTER VARIANT OI 1xxxx DESERET CAPITAL LETTER VARIANT EW > Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character. > > Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character. We have just established that there are no characters with such names in the standard. It's not the names or the history that I'm arguing. > To do so is to show no understanding of the history of writing systems at all. What I'd agree to is that cases where shapes with different historical origins merge and get treated as one and the same character are quite a lot rarer than cases where they don't merge. But we have seen cases where such a merge happens. ? is one of them. There are quite a few in Han (not surprising because there are tons of ideographs there to begin with). But that experience doesn't mean that we have to rush to a conclusion without examining as much of the evidence as we can get hold of. > You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here. Skepticism is when presented with options without background facts is a virtue in my opinion. >> And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion. > > The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this. As for BROCCOLI that you mention later and other emoji, first I would like to make clear that I don't use emoji personally nor do I push for their encoding. But what's important for the discussion at hand is that when it comes to emoji, the question of whether we should unify or disunify BROCCOLI and CAULIFLOWER (just a hypothetical example) isn't as important. That's because there is no preexisting user community that would be seriously inconvenienced the way it would happen if we suddenly disunified the ?s/?z ligature, or suddenly unified "?" and "?". Emoji are a hopeless hodgepodge, where users click on what they see, and hope that it shows close enough to what they meant at the other end or after a few years. >> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different. > > Martin, there is no answer to this unless you can read the minds of people who are dead a century or more. Thanks for telling us, finally. >> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters. > > I do not believe you. It's true. When younger, I tried to read some old books written in Fraktur. It was hard work. Most of the lower letters were okay, but the ? and the f were easy to confuse, and the k is also confusing. A lot of guessing was needed for upper case. I'm quite sure most people these days couldn't easily identify upper case letters in isolation. Of course, context helps a lot. > If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all. Shops and newspapers (e.g. NYT) and the like rely a lot on a logo effect. And the situation may be slightly different in Germany and in Switzerland. > I?ll stipulate that few Germans can read S?tterlin or similar hands. :-) Definitely agreed! > On 28 Mar 2017, at 11:59, Mark Davis ?? wrote: > >> ?I agree with Martin. >> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape. > > Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake. And coming to a discussion like this out of a concern for modern practitioners of the script (even if it seems, after a lot of discussion, that there aren't that many of these, and the issue at hand may indeed not concern them that much) is not some sort of attempt to unify things for unification's sake. Regards, Martin. From everson at evertype.com Wed Mar 29 08:08:59 2017 From: everson at evertype.com (Michael Everson) Date: Wed, 29 Mar 2017 14:08:59 +0100 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> Message-ID: Martin, It?s as though you?d not participated in this work for many years, really. > On 29 Mar 2017, at 11:12, Martin J. D?rst wrote: > > Hello everybody, > > Let me start with a short summary of where I think we are at, and how we got there. > > - The discussion started out with two letters, with two letter forms each. There is explicit talk of the 40-letter alphabet and glyphs in the Wikipedia page, not of two different letters. SO WHAT? Alphabets have ?letters? in them. ?Letters? are not ?characters?. In Welsh, ?ch? and ?dd? and ?ll? are ?letters?. > - That suggests that IF this script is in current use, You don?t even know? You?re kidding, right? > and the shapes for these diphthongs are interchangeable It does NOT ?suggest? that at all. > (for those who use the script day-to-day, not for meta-purposes such as historic and typographic texts), keeping things unified is preferable. Deseret was a spelling reform replacement alphabet used for a period of time by the Mormons in what is now Utah. It is structurally very similar to Pitman?s Phonotypic alphabets. Alphabets. There were many revisions of those. Some of them used letterforms we have encoded today, for IPA for instance. Some used letterforms we?d hardly recognize, and we?d never, ever consider them to be glyph variants of the IPA letters. > - As far as we have heard (in the course of the discussion, after questioning claims made without such information), it seems that: Yeah, it doesn?t ?seem? anything but a whole lot of special pleading to bolster your rigid view that the glyphs in question can be interchangeable because of the sounds they may represent. > - There may not be enough information to understand how the creators and early users of the script saw this issue, Um, yeah. As if there were for Phoenician, or Luwian hieroglyphs, right? > on a scale that may range between "everybody knows these are the same, and nobody cares too much who uses which, even if individual people may have their preferences in their handwriting" to something like "these are different choices, and people wouldn't want their texts be changed in any way when published?. We know what the diphthongs were. We know that the script had a spelling reform where some characters were abandoned in favour of other characters. There was at least one font wh And there is lots of handwriting in which people write what they want to write, in the non-Latin alphabet they learned. As far as your guessing what people had in their minds about what they were writing, and as to your speculation about what the very few printers who had Deseret type might have done with such manuscripts, well, it is all reine Phantasie on your part. Oh! Look! There was a spelling reform. I should write ?Fantasie?, shouldn?t I? Wait! I can have spell-check dictionaries suit my preference! Wow! That?s amazing! > - Similarly, there seem to be not enough modern practitioners of the script using the ligatures that could shed any light on the question asked in the previous item in a historical context, Completely irrelevant. Nobody worried about the number of modern users of the Insular letters we encoded. Why put such a constraints on users of Deseret? ?? ?? ?? ? ?? ?? ??. > first apparently because there are not that many modern practitioners at all, and second because modern practitioners seem to prefer spelling with individual letters rather than using the ligatures. This is equally ridiculous. John Jenkins chooses not write the digraphs in the works which he transcribed, because that?s what *he* chooses. He doesn?t speak for anyone else who may choose to write in Deseret, and your assumption that ?modern practitioners? do this is groundless. It also ignores the fact that the script had a reform and that the value of separate encodings for the various characters is of value to those studying the provenance and orthographic practices of those who wrote Deseret when it was in active use. This is exactly the same thing as the medievalist Latin abbreviation and other characters we encoded. There is neither sense nor logic nor utility in trying to argue for why editors of Deseret documents shouldn?t have the same kinds of tools that medievalists have. And as far as medievalist concerns go, many of the characters are used by relatively few researchers. Some of the characters we encoded are used all over Europe at many times. Some are used only by Nordicists, some by Celticists, and some by subsets within the Nordicist and Celticist communities. > - IF the above is true, then it may be that these ligatures are mostly used for historic purposes only, in which case it wouldn't do any harm to present-day users if they were separated. Harm? What harm? Recently the UTC looked at a proposal for capital letters for ? and ?. Evidence for their existence was shown. One person on the call to the UTC said he didn?t think anyone needed them. Two of us do need them. I needed them last weekend and I had to use awkward workarounds. They weren?t accepted. There wasn?t any good rationale for the rejection. I mean, the letters exist. Case is a normal function of the script. But they weren?t accepted. For the guy who didn?t think he needed them, well, so what? If they?re encoded, he doesn?t have to use them. Harm to present-day users? I agree with you. Any modern-day user creating new texts who doesn?t like to use the diphthong letters doesn?t have to use them. Any modern-day user trying to represent historic texts accurately, however, can?t, because not all the letters are encoded. > If the above is roughly correct, then it's important that we reached that conclusion after explicitly considering the potential of a split to create inconvenience and confusion for modern practitioners, People who use Deseret use it to for historical purposes and for cultural reasons. Everybody in Utah reads English in standard Latin orthography. > not after just looking at the shapes only, coming up with separate historical derivations for each of them, and deciding to split because history is way more important than modern practice. I didn?t ?come up? with separate historical derivations for the four characters in question. It is entirely obvious that LONG AH, SHORT AH, LONG OO, and SHORT OO are variously combined with the stroke of SHORT I. Entirely obvious. There is no other interpretation. > In that light, some more comments lower down. > > On 2017/03/28 22:56, Michael Everson wrote: >> On 28 Mar 2017, at 11:39, Martin J. D?rst wrote: > >> An ? ligature is a ligature of a and of e. It is not some sort of pretzel. > > Yes. But it's important that we know that because we have been faced with many cases where "?" and "ae" were used interchangeably. Irrelevant. This is just spelling. It?s no different than colour/color or maximize/maximise or aluminium/aluminum. > For somebody not knowing the (extended) Latin alphabet and its usages, they might easily see more of a pretzel and less of 'a' and 'e'. I might try some experiments with some of my students (although I'm using "formul?" in my lecture notes, and so they might already be too familiar with the "??). You have missed the point fabulously. The point was that the ? ligature can be easily identified as being made of A and of E. And the four Deseret characters can easily be identified as being made of LONG AH, SHORT AH, LONG OO, and SHORT OO with the stroke of SHORT I. > Also, if it were the case that shapes like "?" and "?" were used interchangeably across all uses of the Latin alphabet, I'm quite sure we would encode it with one code point rather than two, even if some researchers might claim that the later was derived from an "o" rather than an "?", or even if we knew it was derived from an "o" (as we know for the ?). I don?t agree, and there are hundreds of >> What Deseret has is this: >> >> 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE >> * officially named ?ew? in the code chart >> * used for ew in earlier texts >> 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE >> * officially named ?oi? in the code chart >> * used for oi in earlier texts >> 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE >> * used for oi in later texts >> 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE >> * used for ew in later texts > > Currently, it has this: > > 10426 ?? DESERET CAPITAL LETTER OI > > 10427 ?? DESERET CAPITAL LETTER EW You are being deliberately obtuse. Note that I stated clearly ?officially named ?ew/oi? in the code chart?. > My personal opinion is that names are mostly hints, and not too much should be read into them, I do not share this opinion. > but if anything, the names in the current charts would suggest that the encoding is for the 39th/40th letter of the Deseret alphabet, whatever its shape, not for some particular shape. You make too much of these numbers, but then there are charts of the 38-letter alphabet and charts of the 40-letter alphabet, but those numbers have to do with the number of English phonemes represented in Phonotypy and in Deseret, and with the augmentation of that by the addition of letters which represent phonemes. > And you know as well as I do that we can't change names. So if we split, we might end up with something like: > > 10426 ?? DESERET CAPITAL LETTER OI > > 10427 ?? DESERET CAPITAL LETTER EW > > 1xxxx DESERET CAPITAL LETTER VARIANT OI > > 1xxxx DESERET CAPITAL LETTER VARIANT EW I?m pretty sure we will propose the names LONG AH WITh STROKE and SHORT OO WITH STROKE. The two un-encoded characters are used for the *diphthongs* oi and ew but they are not ?variants? of the other letters. We do not require matching names here. Compare LATIN LETTER YR and LATIN LETTER SMALL CAPITAL R. Compare LATIN CAPITAL LETTER HWAIR and LATIN SMALL LETTER HV. >> Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character. >> >> Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character. > > We have just established that there are no characters with such names in the standard. It's not the names or the history that I'm arguing. You?re being obtuse again. Fine. Don?t go trying to tell me that EW and SHORT OO WITH STROKE are glyph variants of the same character. Don?t go trying to tell me that LONG AH WITH STROKE and OI are glyph variants of the same character. They?re not. The origin of all those letterforms is obvious, and we do not encode sounds, we encode the elements of writing systems. >> To do so is to show no understanding of the history of writing systems at all. > > What I'd agree to is that cases where shapes with different historical origins merge and get treated as one and the same character are quite a lot rarer than cases where they don't merge. They didn?t merge in Deseret. They had a reform, removing some characters and adding some other characters. > But we have seen cases where such a merge happens. ? is one of them. That?s even arguable because ?? only really occurs in the whole-font Fraktur style. It?s pretty rare to see it in Antiqua. Of course it must be attested there, but it?s by no means common. > There are quite a few in Han (not surprising because there are tons of ideographs there to begin with). > > But that experience doesn't mean that we have to rush to a conclusion without examining as much of the evidence as we can get hold of. I haven?t rushed to a conclusion. I?ve made a thorough analysis. >> You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here. > > Skepticism is when presented with options without background facts is a virtue in my opinion. Your argument seemed to be based solely on the use of the letters for the sounds, ignoring the historical derivation and the facts of the spelling reform in Deseret. >> The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this. > > As for BROCCOLI that you mention later and other emoji, first I would like to make clear that I don't use emoji personally nor do I push for their encoding. I *do* use emoji and I have devised many emoji which are now in current use. I do find that the process for adding symbols to the UCS (which is not the same thing as giving symbols the emoji property) is not functioning particularly well at present. > But what's important for the discussion at hand is that when it comes to emoji, the question of whether we should unify or disunify BROCCOLI and CAULIFLOWER (just a hypothetical example) isn't as important. Eventually we will have CABBAGE, and then some people will need to use ZWJ to join CABBAGE and KNIFE so that sauerkraut can be represented, and then other people will need to use ZWJ to join CABBAGE and HOT PEPPER for kimchi, and in Ireland we?ve got bacon and cabbage of course, and... Heh. > That's because there is no preexisting user community that would be seriously inconvenienced the way it would happen if we suddenly disunified the ?s/?z ligature, or suddenly unified "?" and "?". Emoji are a hopeless hodgepodge, where users click on what they see, and hope that it shows close enough to what they meant at the other end or after a few years. No one using Deseret will be inconvenienced by adding additional historical characters for the already historical script. Anyone using modern Deseret fonts *would* be inconvenience by unifying the LONG-AH-WITH-STROKE and SHORT-AH-WITH-STROKE characters and the LONG-OO-WITH-STROKE and SHORT-OO-WITH-STROKE characters, I think. No current fonts that I know of have the 1859 glyphs, apart from private fonts Ken Beesley used for his own work. >>> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different. >> >> Martin, there is no answer to this unless you can read the minds of people who are dead a century or more. > > Thanks for telling us, finally. What on earth do you mean? I have withheld no secrets. I?ve objected to your wilful unification of characters with obviously different origins. >>> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters. >> >> I do not believe you. > > It's true. When younger, I tried to read some old books written in Fraktur. It was hard work. Most of the lower letters were okay, but the ? and the f were easy to confuse, and the k is also confusing. A lot of guessing was needed for upper case. I'm quite sure most people these days couldn't easily identify upper case letters in isolation. Of course, context helps a lot. It?s not the easiest thing but it does not take all that much to accustom oneself to it. >> If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all. > > Shops and newspapers (e.g. NYT) and the like rely a lot on a logo effect. And the situation may be slightly different in Germany and in Switzerland. People can read the menus and the public signage nevertheless. Fraktur is not so unbelievably different that it?s entirely opaque. >> I?ll stipulate that few Germans can read S?tterlin or similar hands. :-) > > Definitely agreed! I learned to write S?tterlin. Going back and reading something written takes work too? > > >> On 28 Mar 2017, at 11:59, Mark Davis ?? wrote: >> >>> ?I agree with Martin. > >>> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape. >> >> Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake. > > And coming to a discussion like this out of a concern for modern practitioners of the script (even if it seems, after a lot of discussion, that there aren't that many of these, and the issue at hand may indeed not concern them that much) is not some sort of attempt to unify things for unification's sake. I think you made a lot of assumptions about ?modern practitioners? which you didn?t disclose. A proposal will be forthcoming. I want to thank several people who have written to me privately supporting my position with regard to this topic on this list. I can only say that supporting me in public is more useful than supporting me in private. Michael From asmusf at ix.netcom.com Wed Mar 29 09:04:21 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 29 Mar 2017 07:04:21 -0700 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> Message-ID: <700f1d2c-daf2-470a-b75c-166812805887@ix.netcom.com> Martin, thanks for the careful summary. As in all these cases it is possible to argue from different premises, so I would, unfortunately, not expect that this discussion will reach the consensus of all parties. In the end, Unicode is made for the modern user, whether they are native users of a script, or modern users archiving or discussing historic texts. The specific principles used in each encoding decision matter, but only insofar as the result works for the modern (and future!) users of the standard. A./ PS: as to modern use of Fraktur -- many fonts for black-letter logos are modified to help modern readers recognize the words. On 3/29/2017 3:12 AM, Martin J. D?rst wrote: > Hello everybody, > > Let me start with a short summary of where I think we are at, and how > we got there. > > - The discussion started out with two letters, > with two letter forms each. There is explicit talk of the > 40-letter alphabet and glyphs in the Wikipedia page, not > of two different letters. > - That suggests that IF this script is in current use, and the > shapes for these diphthongs are interchangeable (for those > who use the script day-to-day, not for meta-purposes such > as historic and typographic texts), keeping things unified > is preferable. > - As far as we have heard (in the course of the discussion, > after questioning claims made without such information), > it seems that: > - There may not be enough information to understand how the > creators and early users of the script saw this issue, > on a scale that may range between "everybody knows these > are the same, and nobody cares too much who uses which, > even if individual people may have their preferences in > their handwriting" to something like "these are different > choices, and people wouldn't want their texts be changed > in any way when published". > - Similarly, there seem to be not enough modern practitioners > of the script using the ligatures that could shed any > light on the question asked in the previous item in a > historical context, first apparently because there are not > that many modern practitioners at all, and second because > modern practitioners seem to prefer spelling with > individual letters rather than using the ligatures. > - IF the above is true, then it may be that these ligatures > are mostly used for historic purposes only, in which case > it wouldn't do any harm to present-day users if they were separated. > > If the above is roughly correct, then it's important that we reached > that conclusion after explicitly considering the potential of a split > to create inconvenience and confusion for modern practitioners, not > after just looking at the shapes only, coming up with separate > historical derivations for each of them, and deciding to split because > history is way more important than modern practice. > > In that light, some more comments lower down. > > On 2017/03/28 22:56, Michael Everson wrote: >> On 28 Mar 2017, at 11:39, Martin J. D?rst >> wrote: > >> An ? ligature is a ligature of a and of e. It is not some sort of >> pretzel. > > Yes. But it's important that we know that because we have been faced > with many cases where "?" and "ae" were used interchangeably. For > somebody not knowing the (extended) Latin alphabet and its usages, > they might easily see more of a pretzel and less of 'a' and 'e'. I > might try some experiments with some of my students (although I'm > using "formul?" in my lecture notes, and so they might already be too > familiar with the "?"). > > Also, if it were the case that shapes like "?" and "?" were used > interchangeably across all uses of the Latin alphabet, I'm quite sure > we would encode it with one code point rather than two, even if some > researchers might claim that the later was derived from an "o" rather > than an "?", or even if we knew it was derived from an "o" (as we know > for the ?). > > >> What Deseret has is this: >> >> 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE >> * officially named ?ew? in the code chart >> * used for ew in earlier texts >> 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE >> * officially named ?oi? in the code chart >> * used for oi in earlier texts >> 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE >> * used for oi in later texts >> 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE >> * used for ew in later texts > > Currently, it has this: > > 10426 ?? DESERET CAPITAL LETTER OI > > 10427 ?? DESERET CAPITAL LETTER EW > > My personal opinion is that names are mostly hints, and not too much > should be read into them, but if anything, the names in the current > charts would suggest that the encoding is for the 39th/40th letter of > the Deseret alphabet, whatever its shape, not for some particular shape. > > And you know as well as I do that we can't change names. So if we > split, we might end up with something like: > > 10426 ?? DESERET CAPITAL LETTER OI > > 10427 ?? DESERET CAPITAL LETTER EW > > 1xxxx DESERET CAPITAL LETTER VARIANT OI > > 1xxxx DESERET CAPITAL LETTER VARIANT EW > > >> Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH >> STROKE are glyph variants of the same character. >> >> Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH >> STROKE are glyph variants of the same character. > > We have just established that there are no characters with such names > in the standard. It's not the names or the history that I'm arguing. > > >> To do so is to show no understanding of the history of writing >> systems at all. > > What I'd agree to is that cases where shapes with different historical > origins merge and get treated as one and the same character are quite > a lot rarer than cases where they don't merge. But we have seen cases > where such a merge happens. ? is one of them. There are quite a few in > Han (not surprising because there are tons of ideographs there to > begin with). > > But that experience doesn't mean that we have to rush to a conclusion > without examining as much of the evidence as we can get hold of. > > >> You?re smarter than that. So are Asmus and Mark and Erkki and any of >> the other sceptics who have chimed in here. > > Skepticism is when presented with options without background facts is > a virtue in my opinion. > > >>> And as for precedent, the fact that we have encoded a lot of >>> characters in Unicode doesn't mean that we can encode more >>> characters without checking each and every single case very >>> carefully, as we are doing in this discussion. >> >> The UTC encodes a great many characters without checking them at all, >> or even offering documentation on them to SC2. Don?t think we haven?t >> observed this. > > As for BROCCOLI that you mention later and other emoji, first I would > like to make clear that I don't use emoji personally nor do I push for > their encoding. > > But what's important for the discussion at hand is that when it comes > to emoji, the question of whether we should unify or disunify BROCCOLI > and CAULIFLOWER (just a hypothetical example) isn't as important. > That's because there is no preexisting user community that would be > seriously inconvenienced the way it would happen if we suddenly > disunified the ?s/?z ligature, or suddenly unified "?" and "?". Emoji > are a hopeless hodgepodge, where users click on what they see, and > hope that it shows close enough to what they meant at the other end or > after a few years. > > >>> Of course they will easily see different shapes, but what's >>> important isn't the shapes, it's what they associate it with. If for >>> them, it's just two shapes for one and the same 40th letter of the >>> Deseret alphabet, then that is a strong suggestion for not encoding >>> separately, even if the shapes look really different. >> >> Martin, there is no answer to this unless you can read the minds of >> people who are dead a century or more. > > Thanks for telling us, finally. > > >>> To use another analogy, many people these days (me included) would >>> have difficulties identifying Fraktur letters, in particular if they >>> show up just as individual letters. >> >> I do not believe you. > > It's true. When younger, I tried to read some old books written in > Fraktur. It was hard work. Most of the lower letters were okay, but > the ? and the f were easy to confuse, and the k is also confusing. A > lot of guessing was needed for upper case. I'm quite sure most people > these days couldn't easily identify upper case letters in isolation. > Of course, context helps a lot. > >> If this were true menus in restaurants and public signage on shops >> wouldn?t have Fraktur at all. It?s true that sometimes the >> orthography on such things is bad, as where they don?t use ligatures >> correctly or the ? at all. > > Shops and newspapers (e.g. NYT) and the like rely a lot on a logo > effect. And the situation may be slightly different in Germany and in > Switzerland. > >> I?ll stipulate that few Germans can read S?tterlin or similar hands. :-) > > Definitely agreed! > > >> On 28 Mar 2017, at 11:59, Mark Davis ?? wrote: >> >>> ?I agree with Martin. > >>> Simply because someone used a particular shape at some time to mean >>> a letter doesn't mean that Unicode should encode a letter for that >>> shape. >> >> Coming to a forum like this out of a concern for the corpus of >> Deseret literature is not some sort of attempt to encode things for >> encoding?s sake. > > And coming to a discussion like this out of a concern for modern > practitioners of the script (even if it seems, after a lot of > discussion, that there aren't that many of these, and the issue at > hand may indeed not concern them that much) is not some sort of > attempt to unify things for unification's sake. > > > Regards, Martin. > From markus.icu at gmail.com Wed Mar 29 11:09:30 2017 From: markus.icu at gmail.com (Markus Scherer) Date: Wed, 29 Mar 2017 09:09:30 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost> References: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk> <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost> Message-ID: I think "recommended" could be renamed to "(expected to be) widely implemented". markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 29 15:09:25 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 29 Mar 2017 13:09:25 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com> Markus Scherer wrote: > I think "recommended" could be renamed to "(expected to be) widely > implemented". That's a modest improvement; it shifts from an advisory health warning not to use certain sequences to what it is, speculation that some sequences will be far better supported in the field than others. I still don't see why this distinction is necessary. It's not made for other emoji or non-emoji. I have no fonts for Tai Tham,? which has been in Unicode since 2009, but I don't see any warnings against using Tai Tham because someone like me might not have a font for it. ? No, I'm not looking for one; that isn't the point. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Wed Mar 29 15:12:11 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 29 Mar 2017 13:12:11 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com> Martin J. D?rst wrote: > I think there is some missing information here. First, the original > proposal that used invalid UTF-8 sequences never was an RFC, only an > Internet Draft. Yes, you're right. I realized that a minute after "Send" but didn't think it changed the story enough to justify a correction. For the curious, the I-D is at https://www.ietf.org/archive/id/draft-ietf-acap-mlsf-01.txt . > But what's more important, the protocol that motivated all this work > (ACAP) never went anywhere. Nor did any other use of the plane 14 > language tag characters get any kind of significant traction. That > lead to depreciation, because it would have been a bad idea to let > people think that the information in these taggings would actually be > used. Is that common practice in Unicode, that if something doesn't gain significant traction in the comparatively short term, it becomes a candidate for deprecation? > For some people (including me), that was always seen as the likely > outcome; the language tag characters were mostly introduced as a > defensive mechanism (way better than invalid UTF-8) rather than > something we hoped everybody would jump on. Putting them on plane 14 > (which meant that it would be four bytes for each character, and > therefore quite a lot of bytes for each tag) was part of that message. I understand the "defensive" aspect of trying to prevent people from abusing Unicode, especially in the 1997?1998 time frame when UTF-8 was still new and people didn't realize the cost of tampering with it. But if you're going to build a mechanism at all, it seems peculiar to define it in full but then discourage its intended use at the outset, or to build it in such a way that users will find it difficult or unpalatable to use. > I think the situation is vastly different here. First, the Consortium > never officially 'activated' any subdivision flags, so it would be > impossible to deprecate them. The Emoji 5.0 mechanism of using tag sequences for three subdivision flags was announced earlier this week. The specification grudgingly allows, but non-recommends, use of the mechanism for any other flags. It is that grudging allowance that could be deprecated, not any of the specific flags. > Second, we already see some pressure (on this list) to 'recommend' > more of these, and I guess the vendors and the Consortium will give in > to this pressure, even if slowly and to some extent quite reluctantly. > It's anyone's bet in what time frame and order e.g. the flags of > California and Texas will be 'recommended'. But I have personally no > doubt that these (and quite a few others) will eventually make it, > even if I have mixed feelings about that. Then what was the benefit of "not recommending" them in the first place? Why is it a problem if vendors look at the list of 5100 or so subdivisions, or even the small subset that actually have flags, and think, "OMG, look at all those new flags we'll be forced to support"? Is this any different from when a new CJK extension or other large block of characters is added? I would think vendors could make their own business decisions about what flags to support. "Hmm, yeah, definitely Texas, maybe Lombardy, not so sure about Colorado, probably not Guna Yala." I don't see why they had to be essentially told what to support and what not to. -- Doug Ewell | Thornton, CO, US | ewellic.org From jenkins at apple.com Wed Mar 29 15:45:27 2017 From: jenkins at apple.com (John H. Jenkins) Date: Wed, 29 Mar 2017 14:45:27 -0600 Subject: Standaridized variation sequences for the Desert alphabet? In-Reply-To: References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com> <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com> <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp> <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com> <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp> <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com> <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp> <6C843948-F554-4C52-B103-36508595C4FB@evertype.com> Message-ID: > On Mar 29, 2017, at 4:12 AM, Martin J. D?rst wrote: > > Let me start with a short summary of where I think we are at, and how we got there. > > - The discussion started out with two letters, > with two letter forms each. There is explicit talk of the > 40-letter alphabet and glyphs in the Wikipedia page, not > of two different letters. > - That suggests that IF this script is in current use, and the > shapes for these diphthongs are interchangeable (for those > who use the script day-to-day, not for meta-purposes such > as historic and typographic texts), keeping things unified > is preferable. > - As far as we have heard (in the course of the discussion, > after questioning claims made without such information), > it seems that: > - There may not be enough information to understand how the > creators and early users of the script saw this issue, > on a scale that may range between "everybody knows these > are the same, and nobody cares too much who uses which, > even if individual people may have their preferences in > their handwriting" to something like "these are different > choices, and people wouldn't want their texts be changed > in any way when published". I see this part of the problem more one of proper transcription of existing materials, and less of one of what the original authors saw the issues as. Handwritten material is very important in the study of 19th century LDS history, and although the materials actually in the DA are scant (at best), the peculiarities of the spelling can be instructive. As such, I certainly agree that being able to transcribe material "faithfully" is important. I'm not an expert in this area, though, so I can't speak for myself whether this separate encoding or variation selectors or some other mechanism is the best way to provide support for this. I'm more than happy to defer to Michael and other people who *are* experts. If paleographers think separate encoding is best, then I'm for separate encoding. > - Similarly, there seem to be not enough modern practitioners > of the script using the ligatures that could shed any > light on the question asked in the previous item in a > historical context, first apparently because there are not > that many modern practitioners at all, and second because > modern practitioners seem to prefer spelling with > individual letters rather than using the ligatures. Well, as one of the people in this camp, and as Michael has pointed out, I eschew use of these letters altogether. I restrict myself to the 1869 version of the alphabet, which is used in virtually all of the printed materials and has only thirty-eight letters. > - IF the above is true, then it may be that these ligatures > are mostly used for historic purposes only, in which case > it wouldn't do any harm to present-day users if they were separated. > > If the above is roughly correct, then it's important that we reached that conclusion after explicitly considering the potential of a split to create inconvenience and confusion for modern practitioners, not after just looking at the shapes only, coming up with separate historical derivations for each of them, and deciding to split because history is way more important than modern practice. Fortunately, since the existing Deseret block is full, any separately encoded entities will have to be put somewhere else, making it easier to document the nature and purpose of the symbols involved. Not that we can be confident that it will help. (http://www.deseretalphabet.info/XKCD/1726.html ) -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Wed Mar 29 15:55:49 2017 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 29 Mar 2017 13:55:49 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com> References: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com> Message-ID: On 3/29/2017 1:12 PM, Doug Ewell wrote: > I would think vendors could make their own business decisions about what > flags to support. "Hmm, yeah, definitely Texas, maybe Lombardy, not so > sure about Colorado, probably not Guna Yala." I don't see why they had > to be essentially told what to support and what not to. I think you have it approximately backwards. It isn't the UTC telling the vendors "what to support and what not to" -- it was the vendors saying "this is what we need to support, and we'd like to not do it in a haphazard way, so let's tell the UTC what we want them to document in the data for UTS #51." You are correct that the vendors can make their own business decisions. And apparently as of now, Microsoft, for whatever reason, has made its business decision not to support flag emoji *at all* on its phones. O.k., that is their decision. So no Texas, no Lombardy, no Colorado, no Guna Yala, but also no Japan, no Great Britain, no Scotland... Other vendors have decided *to* support flag emoji on their phone platforms. O.k., that is their decision. *But*, the ones who do have flags on their phones don't want to be in the situation where the iPhone has a flag of Scotland which then shows up as a flag tofu on an Android phone, but an Android phone has a flag of Texas which then shows up as a flag tofu on on iPhone, etc., etc. That way leads to customer complaint madness, with 1000's (hundreds of 1000's?) of complaints: "My phone is screwed up, fix it!" Or maybe you want the job on the consumer complaint line about that topic. ;-) --Ken From andrewcwest at gmail.com Wed Mar 29 16:00:59 2017 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 29 Mar 2017 22:00:59 +0100 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com> References: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com> Message-ID: On 29 March 2017 at 21:09, Doug Ewell wrote: > >> I think "recommended" could be renamed to "(expected to be) widely >> implemented". > > That's a modest improvement; it shifts from an advisory health warning > not to use certain sequences to what it is, speculation that some > sequences will be far better supported in the field than others. I don't think that would work. http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt explicitly lists just the three subdivision flags for England, Scotland and Wales under Emoji Tag Sequences, which indicates that they are special in an undefined way that none of the thousands of other potential subdivision flag tag sequences are. There must be a higher bar for inclusion in the Emoji data files than simply that they are expected to be widely implemented. Their inclusion in the Emoji data files and the Emoji charts (http://www.unicode.org/emoji/charts/emoji-ordering.html) must indicate that only these three tag sequences are recommended or sanctioned by the UTC. (In case anyone thinks I support the current situation, let me state that I disagree vehemently with the UTC decision to only "recommend" these three particular subdivision flag tag sequences.) Andrew From doug at ewellic.org Wed Mar 29 16:07:20 2017 From: doug at ewellic.org (Doug Ewell) Date: Wed, 29 Mar 2017 14:07:20 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com> Ken Whistler wrote: > *But*, the ones who do have flags on their phones don't want to be in > the situation where the iPhone has a flag of Scotland which then shows > up as a flag tofu on an Android phone, but an Android phone has a flag > of Texas which then shows up as a flag tofu on on iPhone, etc., etc. > That way leads to customer complaint madness, with 1000's (hundreds of > 1000's?) of complaints: "My phone is screwed up, fix it!" Doesn't this same problem exist for other emoji, or non-emoji, that are supported on some phones but not others? What's the customer service resolution in those cases? -- Doug Ewell | Thornton, CO, US | ewellic.org From christoph.paeper at crissov.de Wed Mar 29 16:17:58 2017 From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 29 Mar 2017 23:17:58 +0200 (CEST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> Message-ID: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Mark Davis ?? : > On Tue, Mar 28, 2017 at 11:56 AM, Joan Montan? wrote: > > > 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site) > > arises something like chicken-egg problem. Vendors don't easily add new > > subdivision-flags (because they aren't recommended), and Unicode doesn't > > recommend new subdivision flags (because they aren't supported by vendors). > ? > That isn't really the case. In particular, vendors can propose adding > additional subdivisions to the recommended list. Awesome, "vendors" can do that. (._.m) If I made an open-source emoji font that contained flags for all of the 5000ish ISO 3166-2 codes that actually map to one, would I automatically be considered a vendor? Do I need to have to pay 18000(?) dollars a year for full membership first? (That's peanuts for multi-billion dollar companies, but unaffordable for most individuals and many FOSS projects.) Someone could try to push such an edit onto Emojione, Twemoji or Noto Emoji, but something tells me none of the maintainers would accept flag PRs by random users unless UTR/UTS#51 already recommended them. - - - - <- The last one currently already has support for UK countries, US states and Canadian provinces. Go figure. > The UTC Consideration?s ... would come into play in assessing those proposals. >? So it is certainly possible for there to be (say) a flag of Texas or >Catalonia > appearing in an Emoji 6.0 release this year. Those are desired, for sure, but so are emoji flags for Kurdistan, Confederated States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy, Communism, Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian Aboriginals, and many more. Of these, only the Kurdish and the Sami flag *may* be covered by Unicode Emoji 5.0+ (possibly with multiple codes) until yet another (Tag-based) scheme is adopted. > Similarly, Microsoft could propose adding the ninja cat ZWJ sequences. I still fail to see how it is a good or smart thing to have to maintain Emoji Tag Sequences *and* Emoji ZWJ Sequences, when adopting the latter for flags would have had at least the following advantages: - actually useful fallback - application beyond ISO 3166 restrictions From kenwhistler at att.net Wed Mar 29 16:22:32 2017 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 29 Mar 2017 14:22:32 -0700 Subject: Traction and Deprecation (was: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com> References: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com> Message-ID: <82f78fdc-42e7-0ad4-a8fc-717f732b05a9@att.net> On 3/29/2017 1:12 PM, Doug Ewell wrote: > Is that common practice in Unicode, that if something doesn't gain > significant traction in the comparatively short term, it becomes a > candidate for deprecation? If a mechanism was dodgy in the first place and was dubious as a part of plain text, then yes. If a mechanism is clearly a necessary part of the text model, but takes a while to catch on, because it is inherently complicated to implement and roll out, then no. Remember, it took a good part of a decade for significant support of combining marks to start appearing in Unicode implementations. Even longer for fairly good support of the Indic rendering models. If you are worried about the emoji tag sequence mechanism, then I'd say no. Once the use of regional indicator symbols caught on to represent flag emoji, that basically settled the question of whether pictographic symbols for flags were a part of plain text. Once the emoji tag sequences are rolled out for the regional flags (a process I can surmise is happening even now as we debate this), there will be no going back. You can be guaranteed, given the current attention to Brexit, that the tag sequence for the Scotland flag, at least, will leap up the emoji frequency list almost immediately. And that data will end up being supported essentially forever. --Ken From christoph.paeper at crissov.de Wed Mar 29 16:34:18 2017 From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 29 Mar 2017 23:34:18 +0200 (CEST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170328222944.3c53914c@JRWUBU2> References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com> <20170328222944.3c53914c@JRWUBU2> Message-ID: <1002744966.74666.1490823258995.JavaMail.open-xchange@app09.ox.hosteurope.de> Richard Wordingham : > "Doug Ewell" wrote: > > > "Not recommended," "not standard," "not interoperable," or any other > > term ESC settles on for the 5000+ valid flag sequences that are not > > England, Scotland, and Wales is just a short, easy step away from > > deprecation for these as well. *Sigh* Instead of 26 RIS characters and all the TAGs, Unicode should have added a single new character: U+2065 Flag Code Joiner. > It's certainly on the cards that the sequence for the Scottish flag will > be deprecated in favour of an RI sequence. Which would very likely be U+1F1E6-1F1E7 ???? 'AB' for Alba, because all other intuitive alpha-2 code elements are either reserved or already assigned. From beckiergb at gmail.com Wed Mar 29 16:52:15 2017 From: beckiergb at gmail.com (Rebecca Bettencourt) Date: Wed, 29 Mar 2017 14:52:15 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: On Wed, Mar 29, 2017 at 2:17 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > If I made an open-source emoji font that contained flags for all of the > 5000ish > ISO 3166-2 codes that actually map to one, would I automatically be > considered a > vendor? Do I need to have to pay 18000(?) dollars a year for full > membership > first? (That's peanuts for multi-billion dollar companies, but > unaffordable for > most individuals and many FOSS projects.) > ... Those are desired, for sure, but so are emoji flags for Kurdistan, > Confederated > States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy, > Communism, > Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian Aboriginals, > and > many more. Of these, only the Kurdish and the Sami flag *may* be covered by > Unicode Emoji 5.0+ (possibly with multiple codes) until yet another > (Tag-based) > scheme is adopted. > Heh, I actually started an open-source emoji font that kinda does this: https://github.com/kreativekorp/vexillo It encodes not only some subdivision flags using sequences like [usca], [ustx], and [caqc], but a whole lot of nowhere-near-standardized-for-encoding flags under the XX code, such as [xxcascadia], [xxconlangesperanto], [xxpridebisexual], [xxpridetrans], etc. And hey, it works already in OS X 10.8+ and Firefox, even if it makes text selection a little dodgy. :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Wed Mar 29 17:31:41 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 29 Mar 2017 15:31:41 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com> References: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com> Message-ID: <90cdaf94-281c-0bd5-7c0e-e56f0041ae9e@ix.netcom.com> On 3/29/2017 2:07 PM, Doug Ewell wrote: > Ken Whistler wrote: > >> *But*, the ones who do have flags on their phones don't want to be in >> the situation where the iPhone has a flag of Scotland which then shows >> up as a flag tofu on an Android phone, but an Android phone has a flag >> of Texas which then shows up as a flag tofu on on iPhone, etc., etc. >> That way leads to customer complaint madness, with 1000's (hundreds of >> 1000's?) of complaints: "My phone is screwed up, fix it!" > Doesn't this same problem exist for other emoji, or non-emoji, that are > supported on some phones but not others? What's the customer service > resolution in those cases? > Sure, let them go form a consortium and agree on which ones are in the recommended set. But why form a new consortium if you have one already where they are all members? Agreeing on recommended level of support in the sense of "best practice" is something that is done for many of the specifications, for example some of the algorithms. A useful guide in evaluating whether it's appropriate to "recommend" something is to treat it as if it was mandatory, but with a costly override option: if you decide to go against the recommendation you'd better have a really solid reason. Recommending to vendors to support a minimal set is one thing. Recommending to users to only use sequences from that set / or vendors to not extend coverage beyond the minimum is something else. Both use the word "recommendation" but the flavor is rather different (which becomes more obvious when you re-phrase as I suggested). That seems to be the source of the disconnect. A./ From verdy_p at wanadoo.fr Wed Mar 29 17:40:19 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 30 Mar 2017 00:40:19 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: Note: in your collection you say that the EU flag is the flag of the European Union, actually it is a flag for Europe at whole, made and proposed since long by the CoE, Council of Europe (not the european Union that still did not exist, and not even the EEC or even the CECA that were also created after the European Council. The European Union displays the EC flag **under permission** permanently granted by the European Council. The non-EU members that are CoE members, or that were invited by the CoE, have a legal right to display it (so it includes as well Turkey since ever as it was a founding member of CoE, also Russia, Belarus even if its seat in the EC is suspended, Ukraine, Kazakhstan, Morocco, Vatican, Andorra, Iceland, Switzerland, Liechtenstein, Norway...). When the CECA was created (and later the European communities) it had initially no flag, but it rapidly started to reuse the European flag proposed by the EC, because every member of the European Community was also a member of EC, In ISO 3166-1 however the "EU" code was granted to the European Union (for legal reasons related to some WIPO standards with specific rules enforced throughout the EU, plus optionally some volunteer countries in the EEA). It usually displays the flag adopted by the CoE. There's no ISO 3166-1 code for Europe at the whole (does it exist legally if we can't clearly define its borders?) or the CoE itself (which has a logo derived now from the European flag, but distinctive and reserved as a logo and not encodable. Note that there's also a flag for a wider region with 56 countries covered by the EBU (Eurovision Broadcast union), including for example Israel, Palestine, Armenia, Georgia, Syria, Lebanon, Morocco, Algeria, Tunisia, Libya and Egypt (not to be confused by the logos used by the Eurovision song contest: these logos are not flags). However the EBU still does not include Kazahstan. The EBU howver is a private organization, and its "flag" looking like a blue "(O)" on white, is in fact a logo and not encodable. Another logo was used in the past that looked similar to the European flag with stars on a circle (this old logo, initially monochromatic using white stars on grey, slightly modernized, is still visible along with some video test patterns at start of some Eurovision broadcasts). 2017-03-29 23:52 GMT+02:00 Rebecca Bettencourt : > On Wed, Mar 29, 2017 at 2:17 PM, Christoph P?per < > christoph.paeper at crissov.de> wrote: > >> If I made an open-source emoji font that contained flags for all of the >> 5000ish >> ISO 3166-2 codes that actually map to one, would I automatically be >> considered a >> vendor? Do I need to have to pay 18000(?) dollars a year for full >> membership >> first? (That's peanuts for multi-billion dollar companies, but >> unaffordable for >> most individuals and many FOSS projects.) >> > > ... > > Those are desired, for sure, but so are emoji flags for Kurdistan, >> Confederated >> States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy, >> Communism, >> Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian >> Aboriginals, and >> many more. Of these, only the Kurdish and the Sami flag *may* be covered >> by >> Unicode Emoji 5.0+ (possibly with multiple codes) until yet another >> (Tag-based) >> scheme is adopted. >> > > Heh, I actually started an open-source emoji font that kinda does this: > > https://github.com/kreativekorp/vexillo > > It encodes not only some subdivision flags using sequences like [usca], > [ustx], and [caqc], but a whole lot of nowhere-near-standardized-for-encoding > flags under the XX code, such as [xxcascadia], [xxconlangesperanto], > [xxpridebisexual], [xxpridetrans], etc. > > And hey, it works already in OS X 10.8+ and Firefox, even if it makes text > selection a little dodgy. :) > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From irgendeinbenutzername at gmail.com Wed Mar 29 17:52:03 2017 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Thu, 30 Mar 2017 00:52:03 +0200 Subject: Unicode Emoji 5.0 characters now final Message-ID: Ken Whistler wrote: > *But*, the ones who do have flags on their > phones don't want to be in the situation where the iPhone has a flag of > Scotland which then shows up as a flag tofu on an Android phone, but an > Android phone has a flag of Texas which then shows up as a flag tofu on > on iPhone, etc., etc. That way leads to customer complaint madness, with > 1000's (hundreds of 1000's?) of complaints: "My phone is screwed up, fix > it!" And this is where the problem becomes even worse. Because there are no ?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences, the fallback for all unsupported tag sequences looks exactly the same and carries absolutely no meaning unless put through some Unicode analyzer machine: ?? WAVING BLACK FLAG, a well-supported emoji that means nothing in the context it is used in, followed by a single, featureless tofu. At least a text containing ten different unsupported RI sequences will show you ten distinct images, even if you are completely unaware that those peculiar pairs of colourful letters you?ve just been sent are used to build flag emoji. Heck, if your device has a default font that includes CANCEL TAG (like my phone does, but my laptop doesn?t) and therefore doesn?t render it, then you won?t even be able to see the difference between a regular, generic black flag and an emoji that was meant to represent some region. This could potentially lead to great misunderstandings since a plane black flag is often associated with anarchism and piracy, but rather rarely with England, Scotland or Wales. The waving white flag that was used as the base in earlier drafts at the very least had the benefit of looking like a ?blank slate? of sorts. This is one of the few cases where the terrible web browser of the Nintendo 3DS can actually be considered superior to any modern device because for some bizarre reason it applies modulo 65,536 to all code points on display, resulting in tag characters rendering as visible ASCII. It would have been much more sensible to construct subdivision flags out of new, visible characters just like RI sequences. That way we could have had a fallback rendering that is actually in any way useful. We could also have preserved the original properties of the tag characters. Last time I checked their correct usage for language tagging is still rigorously explained in the standard despite deprecation. But no, we absolutely had to put out this update as soon as possible because peoplez want da emojiz. We had to use existing characters for region sequences because if we had actually given ourselves enough time to properly think this whole endeavour through we couldn't have made the precious Scottish flag available until Unicode 11. (Although that hardly seems to matter anyways seeing how we apparently now release technical reports and data files that rely on certain characters before those characters even exist in the standard.) And we had to use the invisible tag characters from Plane 14 because potatoes, I guess. You know, back when Emoji Modifiers were released I was initially sceptical of them being spacing, visibly rendering pictographs rather than formatting characters. Nowadays I understand that decision. Too bad we were seemingly unable to make the same decision for flags. I eagerly await the return of hair colour tags in Emoji 6. -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Wed Mar 29 18:29:22 2017 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 30 Mar 2017 00:29:22 +0100 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: Message-ID: <20170330002922.7843ab3c@JRWUBU2> On Thu, 30 Mar 2017 00:52:03 +0200 Charlotte Buff wrote: > And this is where the problem becomes even worse. Because there are no > ?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences, > the fallback for all unsupported tag sequences looks exactly the same > and carries absolutely no meaning unless put through some Unicode > analyzer machine: ?? WAVING BLACK FLAG, a well-supported emoji that > means nothing in the context it is used in, followed by a single, > featureless tofu. At least a text containing ten different > unsupported RI sequences will show you ten distinct images, even if > you are completely unaware that those peculiar pairs of colourful > letters you?ve just been sent are used to build flag emoji. I don't see why the tag characters can't be represented by some form of corresponding ASCII characters as a fallback registering. The bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG declares a sequence of 3 to 6 intervening ordinary tags to be a flag emoji, and in an OpenType font a GSUB contextual substitution can easily convert unrecognised sequences to modified ASCII characters. It does not have to explicitly handle each possible combination. Richard. From irgendeinbenutzername at gmail.com Wed Mar 29 18:45:43 2017 From: irgendeinbenutzername at gmail.com (Charlotte Buff) Date: Thu, 30 Mar 2017 01:45:43 +0200 Subject: Unicode Emoji 5.0 characters now final Message-ID: Richard Wordingham wrote: > I don't see why the tag characters can't be represented by some form of > corresponding ASCII characters as a fallback registering. The > bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG > declares a sequence of 3 to 6 intervening ordinary tags to be a flag > emoji, and in an OpenType font a GSUB contextual substitution can > easily convert unrecognised sequences to modified ASCII characters. It > does not have to explicitly handle each possible combination. I suppose this is an adequate solution, but it*?*s also needlessly convoluted in comparison to RIS where good fallback behaviour just happens automatically with only the most bare-bones font feature imaginable, i.e. simply displaying single characters one after another as they would appear anyways. It is also questionable whether most vendors are going to employ such system in the first place. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Mar 29 19:16:26 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 30 Mar 2017 02:16:26 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170330002922.7843ab3c@JRWUBU2> References: <20170330002922.7843ab3c@JRWUBU2> Message-ID: 2017-03-30 1:29 GMT+02:00 Richard Wordingham < richard.wordingham at ntlworld.com>: > On Thu, 30 Mar 2017 00:52:03 +0200 > Charlotte Buff wrote: > > > And this is where the problem becomes even worse. Because there are no > > ?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences, > > the fallback for all unsupported tag sequences looks exactly the same > > and carries absolutely no meaning unless put through some Unicode > > analyzer machine: ?? WAVING BLACK FLAG, a well-supported emoji that > > means nothing in the context it is used in, followed by a single, > > featureless tofu. At least a text containing ten different > > unsupported RI sequences will show you ten distinct images, even if > > you are completely unaware that those peculiar pairs of colourful > > letters you?ve just been sent are used to build flag emoji. > > I don't see why the tag characters can't be represented by some form of > corresponding ASCII characters as a fallback registering. The > bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG > declares a sequence of 3 to 6 intervening ordinary tags to be a flag > emoji, and in an OpenType font a GSUB contextual substitution can > easily convert unrecognised sequences to modified ASCII characters. It > does not have to explicitly handle each possible combination. > I also think so: the unique black flag (even if it is marked on the corner with a ? on a diamond) is the worst solution. You can easily set up a left-side part showing the hoist and the start of the flag, a right part showing the floating end of the flag, and display the letters with top and bottom borders connecting together and with the left-side and right-side part. May be you can also arrange the letters in rows: the first top row for the 2-letter ISO 3166-1 code, the bottom row for the appended 1-to-4 characters code (letters and digits) of the subdivision. You may also improve the display by displaying the last letters on top of the national flag. If subdivision codes are known you may alternatively render a short name of the subdivision above or below the national flag (but here there's a problem of language choice: even if official names are accepted, some subdivisions have several official names in distinct languages, possibly in distinct scripts; and when there's only one, probably many users will have problems reading these labels in a foreign script, such as Arabic or Chinese) My opinion is that renderers should better support the interactive display of hints in the user language of its UI, independantly of the language of the encoded document itself, if the rendering engine is capable of such interactivity, provided that there's no other competing hint such as title attributes which may be used in HTML to explain the flag ven when it is actually rendered. The same will apply for non-graphical rendering such as aural rendering., instead of spelling the code letters (as a last fallback). May be it will be larger than an actual flag, but I see no problem at all if all flags do not have the same ratio (in fact ratios are already not the same for the official flags of recognized countries). There is absolutely no obligation -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Mar 30 02:45:34 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 30 Mar 2017 09:45:34 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: > If I made an open-source emoji font that contained flags for all of the > 5000ish > ISO 3166-2 codes that actually map to one, would I automatically be > considered a > vendor? > Do I need to have to pay 18000(?) dollars a year for full membership > first? (That's peanuts for multi-billion dollar companies, but > unaffordable for > most individuals and many FOSS projects.) > The answer to both of your questions is no. Please see http://unicode.org/emoji/selection.html#timeline for details. What the UTC is looking for is commitments from major vendors. It is not sufficient to join Unicode: we have members who are not major vendors of emoji. And there are some major vendors that are not members. Of course, there is some judgment involved as to what constitutes "major": at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K doesn't. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Thu Mar 30 03:42:55 2017 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Thu, 30 Mar 2017 17:42:55 +0900 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: On 2017/03/30 06:17, Christoph P?per wrote: > Mark Davis ?? : >> That isn't really the case. In particular, vendors can propose adding >> additional subdivisions to the recommended list. > > Awesome, "vendors" can do that. (._.m) > > If I made an open-source emoji font that contained flags for all of the 5000ish > ISO 3166-2 codes that actually map to one, would I automatically be considered a > vendor? I don't think so. But if you want to get more flags listed, then creating actual flags, with suitable licenses, and telling others to use them and tell other, and so on, may easily reach vendors sooner or later. > - > - > - > - <- > > > The last one currently already has support for UK countries, US states and > Canadian provinces. Go figure. And most if not all of these flags are from Wikimedia. So that shows that open source has some influence, even without money. Regards, Martin. From christoph.paeper at crissov.de Thu Mar 30 04:48:02 2017 From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 30 Mar 2017 11:48:02 +0200 (CEST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de> Philippe Verdy hat am 30. M?rz 2017 um 00:40 geschrieben: > There's no ISO 3166-1 code for Europe at the whole (does it exist legally if > we can't clearly define its borders?) `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with. CLDR could safely adopt that if needed. No alpha-2 and hence no RIS sequence, though. An Emoji Tag Sequence would be straight-forward, though: U+1F3F4-E0031-E0035-E0030-E007F. From christoph.paeper at crissov.de Thu Mar 30 04:59:21 2017 From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 30 Mar 2017 11:59:21 +0200 (CEST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: Message-ID: <2046545915.16798.1490867961823.JavaMail.open-xchange@app08.ox.hosteurope.de> Charlotte Buff > > > Heck, if your device has a default font that includes CANCEL TAG (...) and > therefore doesn?t render it, > then you won?t even be able to see the difference between a regular, generic > black flag and an emoji that was meant to represent some region. > This could potentially lead to great misunderstandings since a plane black > flag is often associated with anarchism and piracy, > but rather rarely with England, Scotland or Wales. > The waving white flag that was used as the base in earlier drafts at the very > least had the benefit of looking like a ?blank slate? of sorts. White flags are associated with surrender (but also peace). That is at least as bad as a black flag. The checkered flag U+1F3C1 ?? could have been a compromise. It is also readily associated with sports. From mark at macchiato.com Thu Mar 30 06:58:47 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 30 Mar 2017 13:58:47 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de> Message-ID: > `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with. CLDR could safely adopt that if needed. No need to "safely adopt". It is already valid: http://www.unicode.org/reports/tr51/proposed.html#flag-emoji-tag-sequences If you follow the links you'll end up at http://unicode.org/repos/cldr/trunk/common/validity/region.xml And find that 150 is already valid. (For the format of that file, see LDML.) ==== Where people have looked at the documentation and their questions are still not answered, that feedback is useful so that the documentation can be improved. But it appears that at least some people haven't bothered to do that, when it could answer a lot of the questions/complaints on this list. Mark On Thu, Mar 30, 2017 at 11:48 AM, Christoph P?per < christoph.paeper at crissov.de> wrote: > Philippe Verdy hat am 30. M?rz 2017 um 00:40 > geschrieben: > > > There's no ISO 3166-1 code for Europe at the whole (does it exist > legally if > > we can't clearly define its borders?) > > `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with. > CLDR > could safely adopt that if needed. > > No alpha-2 and hence no RIS sequence, though. An Emoji Tag Sequence would > be > straight-forward, though: U+1F3F4-E0031-E0035-E0030-E007F. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Mar 30 09:58:09 2017 From: doug at ewellic.org (Doug Ewell) Date: Thu, 30 Mar 2017 07:58:09 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com> Asmus Freytag wrote: > Recommending to vendors to support a minimal set is one thing. > Recommending to users to only use sequences from that set / or vendors > to not extend coverage beyond the minimum is something else. Both use > the word "recommendation" but the flavor is rather different (which > becomes more obvious when you re-phrase as I suggested). > > That seems to be the source of the disconnect. That seems a fair analysis. Another way of putting this is that marking a particular subset of valid sequences as "recommended" is one thing, while listing sequences in a table with a column "Standard sequence?", with some sequences marked "Yes" and others marked "No," is something else. Equivalently, characterizing a group of valid sequences as "Valid, but not recommended" is something else. If the goal is to tell users that three of the sequences are especially likely to be supported, or to tell vendors that they should prioritize support for these three, then "recommended" and "additional," used as a pair, would be more appropriate. If the goal is to tell users "we don't want you to use the other 5100 sequences" and to tell vendors "we don't want you to offer support for them," then the existing wording is fine. -- Doug Ewell | Thornton, CO, US | ewellic.org From wjgo_10009 at btinternet.com Thu Mar 30 09:03:11 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 30 Mar 2017 15:03:11 +0100 (BST) Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost> > What the UTC is looking for is commitments from major vendors. Well should it be applying such a filter on progress? I opine that assessment should be on merit and that new ideas should be considered on an even-handed basis. Progress should not be on the basis of what major vendors choose to do. Requiring commitments from major vendors could be a barrier to new enterprises developing and a barrier to progress for the benefit of consumers being made. > Of course, there is some judgment involved as to what constitutes "major": at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K doesn't. What does 1B DAUs mean please? William Overington Thursday 30 March 2017 From doug at ewellic.org Thu Mar 30 12:12:04 2017 From: doug at ewellic.org (Doug Ewell) Date: Thu, 30 Mar 2017 10:12:04 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170330101204.665a7a7059d7ee80bb4d670165c8327d.19670e161a.wbe@email03.godaddy.com> William_J_G Overington wrote: >> Of course, there is some judgment involved as to what constitutes >> "major": at one extreme clearly 1B DAUs qualifies, and at the other >> extreme, 1K doesn't. > > What does 1B DAUs mean please? >From http://acronyms.thefreedictionary.com/DAU I gathered that this might be search-engine industry jargon for "1 billion daily active users" as opposed to 1000 of them. -- Doug Ewell | Thornton, CO, US | ewellic.org From charupdate at orange.fr Thu Mar 30 14:06:39 2017 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 30 Mar 2017 21:06:39 +0200 (CEST) Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost> Message-ID: <1001013024.10243.1490900799142.JavaMail.www@wwinf2209> On Thu, 30 Mar 2017 15:03:11 +0100 (BST), William_J_G Overington wrote: > > > What the UTC is looking for is commitments from major vendors. > > Well should it be applying such a filter on progress? > > I opine that assessment should be on merit and that new ideas should be > considered on an even-handed basis. Progress should not be on the basis of > what major vendors choose to do. Requiring commitments from major vendors > could be a barrier to new enterprises developing and a barrier to progress > for the benefit of consumers being made. That?s exactly the point: that the marketplace should be tailored for the benefit of consumers, not for the sole benefit of vendors. Instead, the question seems always to be ?who is paying for it?? Another example has been recently discussed: the use of superscript letters is ?discouraged?, seemingly to prevent a set of consumers from being able to write in an acceptable way a couple of languages in plain text, and to subjugate these customers to the use of a series of rich text software. The problem is not whether to use high-end software or not, but the way how users get their stuff messed up if they don?t. When it was up to encode the first set of superscript Latin letters in Unicode 1.0 ? or were they *too* enforced by Bruce Paterson of ISO/IEC?10646? ? all straightforward people surely were going to follow the pattern of: 2071 SUPERSCRIPT LATIN SMALL LETTER I * functions as a modifier letter # 0069 207F SUPERSCRIPT LATIN SMALL LETTER N * functions as a modifier letter # 006E @ Latin subscript modifier letters 1D62 LATIN SUBSCRIPT SMALL LETTER I # 0069 1D63 LATIN SUBSCRIPT SMALL LETTER R # 0072 1D64 LATIN SUBSCRIPT SMALL LETTER U # 0075 1D65 LATIN SUBSCRIPT SMALL LETTER V # 0076 and name them accordingly. But given the way of finally calling them: @@ 02B0 Spacing Modifier Letters 02FF @ Latin superscript modifier letters x (superscript latin small letter i - 2071) x (superscript latin small letter n - 207F) 02B0 MODIFIER LETTER SMALL H * aspiration # 0068 02B1 MODIFIER LETTER SMALL H WITH HOOK and so on, somebody must have arisen telling ?Wait! if we label them as what they are, folks will use these instead of our software, so let?s disguise them a bit!? As a result, we?ve ended up with every script on earth being writeable in plain text except Latin. That seems to be an abuse of dominant position, to make an unknown amount of more bargain at the expense of a relatively narrow subset of disfavored end-users, as if the usefulness of vendors? software would essentially depend on one single feature: superscript formatting. Regards, Marcel From verdy_p at wanadoo.fr Thu Mar 30 14:13:29 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 30 Mar 2017 21:13:29 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de> Message-ID: 2017-03-30 11:48 GMT+02:00 Christoph P?per : > Philippe Verdy hat am 30. M?rz 2017 um 00:40 > geschrieben: > > > There's no ISO 3166-1 code for Europe at the whole (does it exist > legally if > > we can't clearly define its borders?) > > `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with. > CLDR > could safely adopt that if needed. > I have not seen a clear statement that UN M.49 code 150 for Europe (as a whole) was related to the EU assignment in ISO 3166-1 which refers to the European Union (but in fact still refers legally to the European Community the only part legally recognized, even the the European Union attempted to unify the communities this unification was partial, and three separat "pilars" were kept). I've clearly read that EU was assigned in ISO3166 only because of its use in WIPO standards. There are some other assignments made for keeping compatibility with ITU standards, or with the Postal Union. Note the ITU also defines a "European broadcasting region" that covers north Africa and come countries of the Middle East: it is the base of existence of the EBU (Eurovision), the second base being also the Council of Europe one or the other being a requirement for full membership. The ITU definition is appropriate because it matchs with coverage areas by satellites. So I don't think there is any equivalence between code 150 and code EU (which includes parts outside 150, for example some of the French and Dutch overseas dependencies in America, and Africa). After the "Brexit" we don't know if GB will still be part of EU for WIPO standards.. But British domain names registered in the ".eu" ccTLD will remain valid (the TLD is not bound to the same rules as WIPO standards). As far as I have seen GB will keep its existing status in WIPO so it should still be part the "EU" code in ISO3166-1, unless its own membership in WIPO is amended (I have doubt it will ever happen, GB would loose some of their existing IP right protection). -------------- next part -------------- An HTML attachment was scrubbed... URL: From tuvalkin at gmail.com Thu Mar 30 15:17:24 2017 From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=) Date: Thu, 30 Mar 2017 21:17:24 +0100 Subject: Encoding of old compatibility characters In-Reply-To: <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com> References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com> Message-ID: On 2017.03.29 05:41, Leo Broukhis asked: > Are you still using Windows 7 or RedHat 5, or something equally old? > Newer systems have ? out of the box. I?m using Windows XP and "?" renders perfectly as "??". Maybe fonts can be installed without ?upgrading? the whole operating system? Who knew?! -- ____. Ant?nio MARTINS-Tuv?lkin | ()| |####| PT-1500-239 Lisboa N?o me invejo de quem tem | PT-2695-010 Bobadela LRS carros, parelhas e montes | +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | --------------------------------------------------------------------- De sable uma fonte e bordadura escaqueada de jalde e goles por timbre bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" --------------------------------------------------------------------- From doug at ewellic.org Thu Mar 30 16:39:18 2017 From: doug at ewellic.org (Doug Ewell) Date: Thu, 30 Mar 2017 14:39:18 -0700 Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final) Message-ID: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com> The UN "M49 Standard" (that's how they're styling it now; I guess we should stop writing "M.49") assigns a code element for each "country or area" and groups these into "geographical regions." To find the "countries or areas" included within code element 150 for "Europe," simply visit https://unstats.un.org/unsd/methodology/m49/ , select Geographic Regions from the menu at the left, and expand the entries for Europe and its four subregions. The lists are available in six languages, including French. To find the countries that make up the European Union at any given moment, visit http://europa.eu/european-union/about-eu/countries_fr (or similar for other EU languages). As is well known, this list has changed in the past and will change in the future. The point is that UNSD's definition of Europe and the roster of the European Union are different lists, and no attempt is made by either organization to make these lists identical or to explain or justify differences. -- Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Thu Mar 30 17:02:13 2017 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 31 Mar 2017 00:02:13 +0200 Subject: Encoding of old compatibility characters In-Reply-To: References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com> Message-ID: Probably you've installed the Noto collection on your Windows XP, or installed some software that added fonts to the system (pmossibly with updates to the Uniscribe library, suc has an old version of Office). Anyway I would no longer trust XP for doing correct rendering for many scripts, even with Uniscribe which is not needed for this simple character mapped in the BMP. Now minimal support in XP is essentially by third party software providers. Most have resigned, except Mozilla and some security suites that attempt to fill the gaps abandonned now by Microsoft (but still maintain it... because there are still various banks using it for example in their ATM: you know it when you frequently see the ATM rebooting of sometimes unusable as it has crashed with a "BSOD" displayed). 2017-03-30 22:17 GMT+02:00 Ant?nio Martins-Tuv?lkin : > On 2017.03.29 05:41, Leo Broukhis asked: > > Are you still using Windows 7 or RedHat 5, or something equally old? >> Newer systems have ? out of the box. >> > > I?m using Windows XP and "?" renders perfectly as "??". Maybe fonts can > be installed without ?upgrading? the whole operating system? Who knew?! > > -- ____. > Ant?nio MARTINS-Tuv?lkin | ()| > |####| > PT-1500-239 Lisboa N?o me invejo de quem tem | > PT-2695-010 Bobadela LRS carros, parelhas e montes | > +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | > facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | > --------------------------------------------------------------------- > De sable uma fonte e bordadura escaqueada de jalde e goles por timbre > bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" > --------------------------------------------------------------------- > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Thu Mar 30 17:16:12 2017 From: c933103 at gmail.com (gfb hjjhjh) Date: Fri, 31 Mar 2017 06:16:12 +0800 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> Message-ID: On the topic I am surprised to see the only large Chinese comoany in the member list is Huawei, with none of large Chinese internet company, including Baidu, Alibaba, Tencent, Sina, Netease participating in Unicode. In associate member list there is a company named zhongyi but that link is already 404ed... > > > 2017?3?30? 15:51 ? "Mark Davis ??" ??? >> >> >>> If I made an open-source emoji font that contained flags for all of the 5000ish >>> ISO 3166-2 codes that actually map to one, would I automatically be considered a >>> vendor? >> >> >>> Do I need to have to pay 18000(?) dollars a year for full membership >>> first? (That's peanuts for multi-billion dollar companies, but unaffordable for >>> most individuals and many FOSS projects.) >> >> >> The answer to both of your questions is no. >> >> Please see http://unicode.org/emoji/selection.html#timeline for details. What the UTC is looking for is commitments from major vendors. It is not sufficient to join Unicode: we have members who are not major vendors of emoji. And there are some major vendors that are not members. >> >> Of course, there is some judgment involved as to what constitutes "major": at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K doesn't. >> >> Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu Mar 30 19:49:01 2017 From: petercon at microsoft.com (Peter Constable) Date: Fri, 31 Mar 2017 00:49:01 +0000 Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <1001013024.10243.1490900799142.JavaMail.www@wwinf2209> References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com> <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de> <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost> <1001013024.10243.1490900799142.JavaMail.www@wwinf2209> Message-ID: The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends. If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not. So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marcel Schneider Sent: Thursday, March 30, 2017 12:07 PM To: unicode at unicode.org Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) On Thu, 30 Mar 2017 15:03:11 +0100 (BST), William_J_G Overington wrote: > > > What the UTC is looking for is commitments from major vendors. > > Well should it be applying such a filter on progress? > > I opine that assessment should be on merit and that new ideas should > be considered on an even-handed basis. Progress should not be on the > basis of what major vendors choose to do. Requiring commitments from > major vendors could be a barrier to new enterprises developing and a > barrier to progress for the benefit of consumers being made. That?s exactly the point: that the marketplace should be tailored for the benefit of consumers, not for the sole benefit of vendors. Instead, the question seems always to be ?who is paying for it?? Another example has been recently discussed: the use of superscript letters is ?discouraged?, seemingly to prevent a set of consumers from being able to write in an acceptable way a couple of languages in plain text, and to subjugate these customers to the use of a series of rich text software. The problem is not whether to use high-end software or not, but the way how users get their stuff messed up if they don?t. When it was up to encode the first set of superscript Latin letters in Unicode 1.0 ? or were they *too* enforced by Bruce Paterson of ISO/IEC?10646? ? all straightforward people surely were going to follow the pattern of: 2071 SUPERSCRIPT LATIN SMALL LETTER I * functions as a modifier letter # 0069 207F SUPERSCRIPT LATIN SMALL LETTER N * functions as a modifier letter # 006E @ Latin subscript modifier letters 1D62 LATIN SUBSCRIPT SMALL LETTER I # 0069 1D63 LATIN SUBSCRIPT SMALL LETTER R # 0072 1D64 LATIN SUBSCRIPT SMALL LETTER U # 0075 1D65 LATIN SUBSCRIPT SMALL LETTER V # 0076 and name them accordingly. But given the way of finally calling them: @@ 02B0 Spacing Modifier Letters 02FF @ Latin superscript modifier letters x (superscript latin small letter i - 2071) x (superscript latin small letter n - 207F) 02B0 MODIFIER LETTER SMALL H * aspiration # 0068 02B1 MODIFIER LETTER SMALL H WITH HOOK and so on, somebody must have arisen telling ?Wait! if we label them as what they are, folks will use these instead of our software, so let?s disguise them a bit!? As a result, we?ve ended up with every script on earth being writeable in plain text except Latin. That seems to be an abuse of dominant position, to make an unknown amount of more bargain at the expense of a relatively narrow subset of disfavored end-users, as if the usefulness of vendors? software would essentially depend on one single feature: superscript formatting. Regards, Marcel From boldewyn at gmail.com Fri Mar 31 02:10:08 2017 From: boldewyn at gmail.com (Manuel Strehl) Date: Fri, 31 Mar 2017 09:10:08 +0200 Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com> References: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com> Message-ID: Maybe I'm missing context, but what is the specific problem of those lists differing? The EU and Europe _are_ two different things. The United States of America similarly do not include the whole of America, despite the name. And Norway and Switzerland and some others (incl. soon England) might not be too happy with either institution to make a forced move to unify those lists. ?Manuel 2017-03-30 23:39 GMT+02:00 Doug Ewell : > The UN "M49 Standard" (that's how they're styling it now; I guess we > should stop writing "M.49") assigns a code element for each "country or > area" and groups these into "geographical regions." > > To find the "countries or areas" included within code element 150 for > "Europe," simply visit https://unstats.un.org/unsd/methodology/m49/ , > select Geographic Regions from the menu at the left, and expand the > entries for Europe and its four subregions. The lists are available in > six languages, including French. > > To find the countries that make up the European Union at any given > moment, visit http://europa.eu/european-union/about-eu/countries_fr (or > similar for other EU languages). As is well known, this list has changed > in the past and will change in the future. > > The point is that UNSD's definition of Europe and the roster of the > European Union are different lists, and no attempt is made by either > organization to make these lists identical or to explain or justify > differences. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Fri Mar 31 02:57:11 2017 From: eliz at gnu.org (Eli Zaretskii) Date: Fri, 31 Mar 2017 10:57:11 +0300 Subject: Encoding of old compatibility characters In-Reply-To: (message from Philippe Verdy on Fri, 31 Mar 2017 00:02:13 +0200) References: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com> <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org> <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com> <4q7f39oed2.fsf@chem.ox.ac.uk> <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com> <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com> Message-ID: <83fuht6fqg.fsf@gnu.org> > From: Philippe Verdy > Date: Fri, 31 Mar 2017 00:02:13 +0200 > Cc: unicode Unicode Discussion > > Probably you've installed the Noto collection on your Windows XP, or installed some software that added fonts > to the system (pmossibly with updates to the Uniscribe library, suc has an old version of Office). Arial Unicode MS supports that character, FWIW. From philip_chastney at yahoo.com Fri Mar 31 03:24:01 2017 From: philip_chastney at yahoo.com (philip chastney) Date: Fri, 31 Mar 2017 08:24:01 +0000 (UTC) Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final) References: <1372498955.7962993.1490948641169.ref@mail.yahoo.com> Message-ID: <1372498955.7962993.1490948641169@mail.yahoo.com> ahem -- as I expect you're well aware, it's the United Kingdon that's opting to quit the EU, and England is only a part of the United Kingdom ... and the United Kingdon, in turn, only covers part of the British Isles /phil -------------------------------------------- On Fri, 31/3/17, Manuel Strehl wrote: Subject: Re: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final) To: "Unicode Mailing List" Date: Friday, 31 March, 2017, 7:10 AM Maybe I'm missing context, but what is the specific problem of those lists differing? The EU and Europe _are_ two different things. The United States of America similarly do not include the whole of America, despite the name. And Norway and Switzerland and some others (incl. soon England) might not be too happy with either institution to make a forced move to unify those lists. ?Manuel 2017-03-30 23:39 GMT+02:00 Doug Ewell : The UN "M49 Standard" (that's how they're styling it now; I guess we should stop writing "M.49") assigns a code element for each "country or area" and groups these into "geographical regions." To find the "countries or areas" included within code element 150 for "Europe," simply visit https://unstats.un.org/unsd/ methodology/m49/ , select Geographic Regions from the menu at the left, and expand the entries for Europe and its four subregions. The lists are available in six languages, including French. To find the countries that make up the European Union at any given moment, visit http://europa.eu/european- union/about-eu/countries_fr (or similar for other EU languages). As is well known, this list has changed in the past and will change in the future. The point is that UNSD's definition of Europe and the roster of the European Union are different lists, and no attempt is made by either organization to make these lists identical or to explain or justify differences. -- Doug Ewell | Thornton, CO, US | ewellic.org From mark at macchiato.com Fri Mar 31 05:03:14 2017 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 31 Mar 2017 12:03:14 +0200 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com> References: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com> Message-ID: Ken's observation "?approximately backwards?" is exactly right, and that's the same reason why Markus suggested something along the lines of "interoperable". I don't think we've come up with a pithy category name yet, but I tried different wording on the slides on http://unicode.org/emoji/. See what you think, Doug. Mark Mark On Thu, Mar 30, 2017 at 4:58 PM, Doug Ewell wrote: > Asmus Freytag wrote: > > > Recommending to vendors to support a minimal set is one thing. > > Recommending to users to only use sequences from that set / or vendors > > to not extend coverage beyond the minimum is something else. Both use > > the word "recommendation" but the flavor is rather different (which > > becomes more obvious when you re-phrase as I suggested). > > > > That seems to be the source of the disconnect. > > That seems a fair analysis. > > Another way of putting this is that marking a particular subset of valid > sequences as "recommended" is one thing, while listing sequences in a > table with a column "Standard sequence?", with some sequences marked > "Yes" and others marked "No," is something else. > > Equivalently, characterizing a group of valid sequences as "Valid, but > not recommended" is something else. > > If the goal is to tell users that three of the sequences are especially > likely to be supported, or to tell vendors that they should prioritize > support for these three, then "recommended" and "additional," used as a > pair, would be more appropriate. > > If the goal is to tell users "we don't want you to use the other 5100 > sequences" and to tell vendors "we don't want you to offer support for > them," then the existing wording is fine. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Mar 31 10:47:55 2017 From: doug at ewellic.org (Doug Ewell) Date: Fri, 31 Mar 2017 08:47:55 -0700 Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final) Message-ID: <20170331084755.665a7a7059d7ee80bb4d670165c8327d.2259ebbb4d.wbe@email03.godaddy.com> Manuel Strehl wrote: > Maybe I'm missing context, but what is the specific problem of those > lists differing? > > The EU and Europe _are_ two different things. The United States of > America similarly do not include the whole of America, despite the > name. A previous offshoot of the flag thread had veered into discussion of the UN code element for Europe, and the ISO exceptionally reserved code element for the EU, and the lists of countries in each, and something about WIPO and ITU and ccTLDs. I was pointing out what you said, that the lists differ by nature and comparing them is a fruitless exercise. -- Doug Ewell | Thornton, CO, US | ewellic.org From petercon at microsoft.com Fri Mar 31 10:59:26 2017 From: petercon at microsoft.com (Peter Constable) Date: Fri, 31 Mar 2017 15:59:26 +0000 Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost> References: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>, <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost> Message-ID: William, you completely miss the point: As long as Unicode is the way to provide emoji to consumers, their needs and desires will not be best or fully met. Unicode as an AND gate is too many AND gates. Peter Sent from my Windows 10 phone From: William_J_G Overington Sent: Friday, March 31, 2017 7:50 AM To: Peter Constable; unicode at unicode.org Subject: Re: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) Peter Constable wrote: > The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends. The consumers can only choose from what is available to consumers. So what the Unicode Technical Committee recommends or "not-recommends" may well have a very significant effect upon the choices available to the consumer. > If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not. Sally may not know that the Unicode Technical Committee exists. Sally may have bought her computer or mobile telephone and just uses it, choosing from the emoji available in a menu system, perhaps never realizing all of the detailed standards work and implementation work that took place before the device was manufactured. It is not that Sally is having a particular emoji recommended to her as such, yet if the Unicode Technical Committee "not-recommends" implementation of some emoji that are in the standards document, then Sally may never get the opportunity to choose to use those emoji. > So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers. Well, all of the various standards needed to produce useful products are important. It is not a matter of one being considered before the other. For a particular emoji to become available in a device that is available to a consumer there are various stages. They are like an AND gate where all inputs must be true in order for the result to be true. The Unicode Technical Committee has enormous power and influence to affect the future of information technology. It works both ways. Where an encoding is made there can be progress, yet where an idea is rejected then there is no way forward for an interoperable plain text encoding to become achieved. I submitted a document in 2015. It was determined to be out of scope and was not included in the Document Register and the Unicode Technical Committee did not consider it. I submitted a later version and received no reply about it at all. So I cannot make progress over an interoperable plain text encoding becoming implemented at the present time. Quite a number of UTC meetings have taken place since. Yet the scope of Unicode is a people-made rule, it could change if people with influence want it to change. The UTC could consider my document and hold a Public Review if it chose to do so. So, the Unicode Technical Committee has enormous power and influence to affect the future of information technology. When a "not-recommendation" of what to support takes place the decision to do that "not-recommending" can have significant and long-lasting effects on progress. William Overington Friday 31 March 2017 -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Mar 31 09:50:08 2017 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 31 Mar 2017 15:50:08 +0100 (BST) Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final) In-Reply-To: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> References: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk> Message-ID: <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost> Peter Constable wrote: > The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends. The consumers can only choose from what is available to consumers. So what the Unicode Technical Committee recommends or "not-recommends" may well have a very significant effect upon the choices available to the consumer. > If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not. Sally may not know that the Unicode Technical Committee exists. Sally may have bought her computer or mobile telephone and just uses it, choosing from the emoji available in a menu system, perhaps never realizing all of the detailed standards work and implementation work that took place before the device was manufactured. It is not that Sally is having a particular emoji recommended to her as such, yet if the Unicode Technical Committee "not-recommends" implementation of some emoji that are in the standards document, then Sally may never get the opportunity to choose to use those emoji. > So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers. Well, all of the various standards needed to produce useful products are important. It is not a matter of one being considered before the other. For a particular emoji to become available in a device that is available to a consumer there are various stages. They are like an AND gate where all inputs must be true in order for the result to be true. The Unicode Technical Committee has enormous power and influence to affect the future of information technology. It works both ways. Where an encoding is made there can be progress, yet where an idea is rejected then there is no way forward for an interoperable plain text encoding to become achieved. I submitted a document in 2015. It was determined to be out of scope and was not included in the Document Register and the Unicode Technical Committee did not consider it. I submitted a later version and received no reply about it at all. So I cannot make progress over an interoperable plain text encoding becoming implemented at the present time. Quite a number of UTC meetings have taken place since. Yet the scope of Unicode is a people-made rule, it could change if people with influence want it to change. The UTC could consider my document and hold a Public Review if it chose to do so. So, the Unicode Technical Committee has enormous power and influence to affect the future of information technology. When a "not-recommendation" of what to support takes place the decision to do that "not-recommending" can have significant and long-lasting effects on progress. William Overington Friday 31 March 2017 From doug at ewellic.org Fri Mar 31 12:38:03 2017 From: doug at ewellic.org (Doug Ewell) Date: Fri, 31 Mar 2017 10:38:03 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com> Mark Davis wrote: > Ken's observation "?approximately backwards?" is exactly right, and > that's the same reason why Markus suggested something along the lines > of "interoperable". If the list was arrived at by members of the Consortium who are vendors responsible for implementing (or not) emoji flags, then it would be good to state this fact rather clearly and visibly. Otherwise it really does look like UTC doing the recommending, and the recommending-against. > I don't think we've come up with a pithy category name yet, but I > tried different wording on the slides on http://unicode.org/emoji/. > See what you think, Doug. Slide 37 (speaker's notes) says: "While at this point only three flags are on the recommended list, implementations can provide other subdivision flags." That's not a problem, except for being buried in speaker's notes. It implies that all valid sequences are fine but some might not be universally supported. That's normal for Unicode. Slide 38 (slide and speaker's notes) says: "Valid (but not recommended for vendors)" Nope. That brings it right back to "Hey, vendors, Unicode recommends that you don't support these." As I said Thursday, if that is the intent, then don't change the wording; it's perfect as is. The wordsmithing -- if that's all it is and not truly a warning-against -- needs to apply primarily to the "not recommended" category. I suggested "additional" to remove the explicit negative of "not recommended" and "Standard? - No." In today's tread-lightly speech, "not recommended" has the strong sense of "recommended against." Eating poison ivy is Not Recommended. -- Doug Ewell | Thornton, CO, US | ewellic.org From petercon at microsoft.com Fri Mar 31 17:06:54 2017 From: petercon at microsoft.com (Peter Constable) Date: Fri, 31 Mar 2017 22:06:54 +0000 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com> References: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com> Message-ID: Would "are not very likely to be well-supported in common platforms or applications" work? Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Friday, March 31, 2017 10:38 AM To: Mark Davis ?? Cc: Asmus Freytag ; Unicode Mailing List Subject: RE: Unicode Emoji 5.0 characters now final Mark Davis wrote: > Ken's observation "?approximately backwards?" is exactly right, and > that's the same reason why Markus suggested something along the lines > of "interoperable". If the list was arrived at by members of the Consortium who are vendors responsible for implementing (or not) emoji flags, then it would be good to state this fact rather clearly and visibly. Otherwise it really does look like UTC doing the recommending, and the recommending-against. > I don't think we've come up with a pithy category name yet, but I > tried different wording on the slides on http://unicode.org/emoji/. > See what you think, Doug. Slide 37 (speaker's notes) says: "While at this point only three flags are on the recommended list, implementations can provide other subdivision flags." That's not a problem, except for being buried in speaker's notes. It implies that all valid sequences are fine but some might not be universally supported. That's normal for Unicode. Slide 38 (slide and speaker's notes) says: "Valid (but not recommended for vendors)" Nope. That brings it right back to "Hey, vendors, Unicode recommends that you don't support these." As I said Thursday, if that is the intent, then don't change the wording; it's perfect as is. The wordsmithing -- if that's all it is and not truly a warning-against -- needs to apply primarily to the "not recommended" category. I suggested "additional" to remove the explicit negative of "not recommended" and "Standard? - No." In today's tread-lightly speech, "not recommended" has the strong sense of "recommended against." Eating poison ivy is Not Recommended. -- Doug Ewell | Thornton, CO, US | ewellic.org From doug at ewellic.org Fri Mar 31 17:38:00 2017 From: doug at ewellic.org (Doug Ewell) Date: Fri, 31 Mar 2017 15:38:00 -0700 Subject: Unicode Emoji 5.0 characters now final Message-ID: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com> Peter Constable wrote: > Would "are not very likely to be well-supported in common platforms or > applications" work? No, I think it should be even longer, maybe a paragraph or two, because the concept of "A-list" versus "everything else" is just too complex and unfamiliar to express concisely. What's wrong with "other" or "additional" in contrast to "recommended" or "preferred"? Or is the intent really to say "don't use these"? -- Doug Ewell | Thornton, CO, US | ewellic.org From asmusf at ix.netcom.com Fri Mar 31 18:43:18 2017 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 31 Mar 2017 16:43:18 -0700 Subject: Unicode Emoji 5.0 characters now final In-Reply-To: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com> References: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com> Message-ID: An HTML attachment was scrubbed... URL: