From alastair at alastairs-place.net  Wed Mar  1 03:43:57 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Wed, 1 Mar 2017 09:43:57 +0000
Subject: Northern Khmer on iPhone
In-Reply-To: <20170228210056.6e56fcf9@JRWUBU2>
References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2>
Message-ID: <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net>

On 28 Feb 2017, at 21:00, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Tue, 28 Feb 2017 07:37:10 +0000
> Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
>> Does iPhone support the use of Northern Khmer in Thai script?  I would
>> count an interface in Thai as support.
>> 
>> The reason I ask is that I tried entering the word ??? <U+0E01 THAI
>> CHARACTER KO KAI, U+0E3A THAI CHARACTER PHINTHU, U+0E35 THAI CHARACTER
>> SARA II> 'he' and got a dotted circle.  I also got a dotted circle for
>> the alternative spelling <U+0E01, U+0E35, U+0E3A>.
> 
> It's been suggested to me that this is just a font issue.
> Unfortunately, it seems that one can't change the font without
> jailbreaking the phone.

It?s definitely a font issue - the same problem exists on macOS Sierra (if I change the message to Rich Text, such that the font used is Helvetica, I see the same dotted circle problem; the fixed-width font I use, SF Mono, does not have this problem?).

The best solution here may be to file a bug report at <http://bugreport.apple.com> asking for font support, assuming the program you were using is using one of the Apple supplied fonts.

(Also, FYI, iOS applications can - and some do - install and use their own fonts.  It?s per-application, though; you can?t install them system-wide.)

Kind regards,

Alastair.

--
http://alastairs-place.net


From jean.aurambault at gmail.com  Wed Mar  1 14:56:23 2017
From: jean.aurambault at gmail.com (Jean Aurambault)
Date: Wed, 1 Mar 2017 12:56:23 -0800
Subject: Translations of city names
Message-ID: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>

Hi,

I'm looking for (lightweight) libraries to translate city names,
potentially country as well (but I know that's available in CLDR/ICU in
some ways). Ideally it wouldn't need a database but rely on static assets.

I'm wondering if there is any standard that defines a universal city id
(similar to country codes).

Wikipedia has lots of information on exonyms in different languages.

I also found thing like http://www.geonames.org/ that seems to have a
complete dataset with translations in many language but the relevant data
would need to be extracted.

Right now we use a old version of Maxmind library to get geolocated data
that has no translation. Looks like new version have some translation but
not enough language supported

Any recommandation?

Best,
Jean
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170301/f7f09082/attachment.html>

From richard.wordingham at ntlworld.com  Wed Mar  1 15:37:07 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 1 Mar 2017 21:37:07 +0000
Subject: Translations of city names
In-Reply-To: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
Message-ID: <20170301213707.733696d4@JRWUBU2>

On Wed, 1 Mar 2017 12:56:23 -0800
Jean Aurambault <jean.aurambault at gmail.com> wrote:

> I'm wondering if there is any standard that defines a universal city
> id (similar to country codes).

ISO 3166-2 defines codes for some cities, but its uneven.  However,
what's a city? Does Constantinople exist?

Richard.

From unicode at lindenbergsoftware.com  Thu Mar  2 03:06:57 2017
From: unicode at lindenbergsoftware.com (Norbert Lindenberg)
Date: Thu, 2 Mar 2017 18:06:57 +0900
Subject: Northern Khmer on iPhone
In-Reply-To: <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net>
References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2>
 <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net>
Message-ID: <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com>

On iOS, applications can and do install custom fonts for system-wide use, although the installation user experience is pretty bad:
http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html

Norbert


> On Mar 1, 2017, at 18:43 , Alastair Houghton <alastair at alastairs-place.net> wrote:

[?]

> (Also, FYI, iOS applications can - and some do - install and use their own fonts.  It?s per-application, though; you can?t install them system-wide.)


From sisrivas at blueyonder.co.uk  Thu Mar  2 04:22:00 2017
From: sisrivas at blueyonder.co.uk (srivas sinnathurai)
Date: Thu, 2 Mar 2017 10:22:00 +0000 (GMT)
Subject: Translations of city names
In-Reply-To: <20170301213707.733696d4@JRWUBU2>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <20170301213707.733696d4@JRWUBU2>
Message-ID: <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>

I think there is a telephone area code, throughout the world.


> 
>     On 01 March 2017 at 21:37 Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
> 
> 
>     On Wed, 1 Mar 2017 12:56:23 -0800
>     Jean Aurambault <jean.aurambault at gmail.com> wrote:
> 
>     > I'm wondering if there is any standard that defines a universal city
>     > id (similar to country codes).
> 
>     ISO 3166-2 defines codes for some cities, but its uneven. However,
>     what's a city? Does Constantinople exist?
> 
>     Richard.
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170302/ee4bbe87/attachment.html>

From verdy_p at wanadoo.fr  Thu Mar  2 05:20:40 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 2 Mar 2017 12:20:40 +0100
Subject: Translations of city names
In-Reply-To: <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <20170301213707.733696d4@JRWUBU2>
 <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>
Message-ID: <CAGa7JC276KCoV0xH0fHMZpJaOxq5z2y+4LE+p=Lb4=hV_B-W4w@mail.gmail.com>

Wrong, many countries have largely relaxed their phone number plans by
using a single nation wide plan and allowed portability of numbers. Area
codes are no longer needed (single call rate nation wide, the rate only
depends on operators; and ranges of numbers are allocated also nationwide
for value added services; long distance calls are things of the past since
the very large adoption of mobile phones, also not located by area but only
by country).

2017-03-02 11:22 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk>:

> I think there is a telephone area code, throughout the world.
>
>
> On 01 March 2017 at 21:37 Richard Wordingham <richard.wordingham at ntlworld.
> com> wrote:
>
>
> On Wed, 1 Mar 2017 12:56:23 -0800
> Jean Aurambault <jean.aurambault at gmail.com> wrote:
>
> > I'm wondering if there is any standard that defines a universal city
> > id (similar to country codes).
>
> ISO 3166-2 defines codes for some cities, but its uneven. However,
> what's a city? Does Constantinople exist?
>
> Richard.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170302/04266ebc/attachment.html>

From sisrivas at blueyonder.co.uk  Thu Mar  2 09:19:41 2017
From: sisrivas at blueyonder.co.uk (srivas sinnathurai)
Date: Thu, 2 Mar 2017 15:19:41 +0000 (GMT)
Subject: Translations of city names
In-Reply-To: <CAGa7JC276KCoV0xH0fHMZpJaOxq5z2y+4LE+p=Lb4=hV_B-W4w@mail.gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <20170301213707.733696d4@JRWUBU2>
 <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>
 <CAGa7JC276KCoV0xH0fHMZpJaOxq5z2y+4LE+p=Lb4=hV_B-W4w@mail.gmail.com>
Message-ID: <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net>

Skype for Business,and others cover (free global phone!!) for accounts based on
area codes.

Microsoft might have a list of this apparently adheres to a global standard.


Yes, there is single nationwide plans also available, as addition to area plans.


Sinnathurai

 
> On 02 March 2017 at 11:20 Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
>     Wrong, many countries have largely relaxed their phone number plans by
> using a single nation wide plan and allowed portability of numbers. Area codes
> are no longer needed (single call rate nation wide, the rate only depends on
> operators; and ranges of numbers are allocated also nationwide for value added
> services; long distance calls are things of the past since the very large
> adoption of mobile phones, also not located by area but only by country).
> 
>     2017-03-02 11:22 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk
> mailto:sisrivas at blueyonder.co.uk >:
> 
>         > > 
> >         I think there is a telephone area code, throughout the world.
> > 
> > 
> >             > > > 
> > >             On 01 March 2017 at 21:37 Richard Wordingham
> > > <richard.wordingham at ntlworld.com mailto:richard.wordingham at ntlworld.com >
> > > wrote:
> > > 
> > > 
> > >             On Wed, 1 Mar 2017 12:56:23 -0800
> > >             Jean Aurambault <jean.aurambault at gmail.com
> > > mailto:jean.aurambault at gmail.com > wrote:
> > > 
> > >             > I'm wondering if there is any standard that defines a
> > >             > universal city
> > >             > id (similar to country codes).
> > > 
> > >             ISO 3166-2 defines codes for some cities, but its uneven.
> > > However,
> > >             what's a city? Does Constantinople exist?
> > > 
> > >             Richard.
> > > 
> > >         > > 
> >     > 
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170302/4472ed75/attachment.html>

From mark at macchiato.com  Thu Mar  2 09:20:18 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 2 Mar 2017 16:20:18 +0100
Subject: Northern Khmer on iPhone
In-Reply-To: <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com>
References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2>
 <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net>
 <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com>
Message-ID: <CAJ2xs_E=33RyzEOohJzhk7kitYOPePQriqe+_UZtZZe7DE8yjg@mail.gmail.com>

On Thu, Mar 2, 2017 at 10:06 AM, Norbert Lindenberg <
unicode at lindenbergsoftware.com> wrote:

> http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html


?Thanks for writing that, Norbert. Sounds a tad painful.?


Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170302/f2607835/attachment.html>

From tom at bluesky.org  Thu Mar  2 10:01:00 2017
From: tom at bluesky.org (Tom Gewecke)
Date: Thu, 2 Mar 2017 09:01:00 -0700
Subject: Northern Khmer on iPhone
In-Reply-To: <CAJ2xs_E=33RyzEOohJzhk7kitYOPePQriqe+_UZtZZe7DE8yjg@mail.gmail.com>
References: <20170228073710.75af64d4@JRWUBU2> <20170228210056.6e56fcf9@JRWUBU2>
 <7EDE490E-4FB7-4035-A97F-FC88859C7C04@alastairs-place.net>
 <5381BF62-FB9D-4F71-A854-700000E53F38@lindenbergsoftware.com>
 <CAJ2xs_E=33RyzEOohJzhk7kitYOPePQriqe+_UZtZZe7DE8yjg@mail.gmail.com>
Message-ID: <8F75DFA8-D34D-4D05-92B3-1C40AB0CB175@bluesky.org>


> On Mar 2, 2017, at 8:20 AM, Mark Davis ?? <mark at macchiato.com> wrote:
> 
> 
> On Thu, Mar 2, 2017 at 10:06 AM, Norbert Lindenberg <unicode at lindenbergsoftware.com <mailto:unicode at lindenbergsoftware.com>> wrote:
> http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html <http://norbertlindenberg.com/2015/06/installing-fonts-on-ios/index.html>
> ?Thanks for writing that, Norbert. Sounds a tad painful.?

From the standpoint of the ordinary user, adding fonts to iOS is pretty simple, as since iOS 7 there are apps that let you do it for anything you can download or get via email.  Of course that is no guarantee that a particular font will work perfectly, and there's also no way to get a downloaded font to substitute for the iOS default font in the many apps where the user is not given any way to choose fonts.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170302/0d9577d9/attachment.html>

From frederic.grosshans at gmail.com  Thu Mar  2 10:22:57 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Thu, 2 Mar 2017 17:22:57 +0100
Subject: Translations of city names
In-Reply-To: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
Message-ID: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>

It looks like the community having the expertise to answer such question 
is the GIS (Geography Information Systems) community, more than the 
Unicode community. Have you tried asking a question on 
http://gis.stackexchange.com/ ?

     Fr?d?ric

Le 01/03/2017 ? 21:56, Jean Aurambault a ?crit :
> Hi,
>
> I'm looking for (lightweight) libraries to translate city names, 
> potentially country as well (but I know that's available in CLDR/ICU 
> in some ways). Ideally it wouldn't need a database but rely on static 
> assets.
>
> I'm wondering if there is any standard that defines a universal city 
> id (similar to country codes).
>
> Wikipedia has lots of information on exonyms in different languages.
>
> I also found thing like http://www.geonames.org/ that seems to have a 
> complete dataset with translations in many language but the relevant 
> data would need to be extracted.
>
> Right now we use a old version of Maxmind library to get geolocated 
> data that has no translation. Looks like new version have some 
> translation but not enough language supported
>
> Any recommandation?
>
> Best,
> Jean
>
>
>


From mheijdra at princeton.edu  Thu Mar  2 10:29:17 2017
From: mheijdra at princeton.edu (Martin Heijdra)
Date: Thu, 2 Mar 2017 16:29:17 +0000
Subject: Translations of city names
In-Reply-To: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
Message-ID: <0001012FBBD4FE40857959B0B65DE95B821103B5@CSGMBX212W.pu.win.princeton.edu>

Libraries in the US are required to follow the BGN: https://geonames.usgs.gov/. 

Martin Heijdra

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Fr?d?ric Grosshans
Sent: Thursday, March 02, 2017 11:23 AM
To: unicode at unicode.org
Subject: Re: Translations of city names

It looks like the community having the expertise to answer such question is the GIS (Geography Information Systems) community, more than the Unicode community. Have you tried asking a question on http://gis.stackexchange.com/ ?

     Fr?d?ric

Le 01/03/2017 ? 21:56, Jean Aurambault a ?crit :
> Hi,
>
> I'm looking for (lightweight) libraries to translate city names, 
> potentially country as well (but I know that's available in CLDR/ICU 
> in some ways). Ideally it wouldn't need a database but rely on static 
> assets.
>
> I'm wondering if there is any standard that defines a universal city 
> id (similar to country codes).
>
> Wikipedia has lots of information on exonyms in different languages.
>
> I also found thing like http://www.geonames.org/ that seems to have a 
> complete dataset with translations in many language but the relevant 
> data would need to be extracted.
>
> Right now we use a old version of Maxmind library to get geolocated 
> data that has no translation. Looks like new version have some 
> translation but not enough language supported
>
> Any recommandation?
>
> Best,
> Jean
>
>
>


From kenwhistler at att.net  Thu Mar  2 11:47:22 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 2 Mar 2017 09:47:22 -0800
Subject: Translations of city names
In-Reply-To: <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
Message-ID: <ec4d77d9-6501-31d4-92a7-c3133994574d@att.net>

The UN Group of Experts on Geographical Names (UNGEGN) is also relevant:

https://unstats.un.org/unsd/geoinfo/ungegn/default.html

They keep up a list of searchable geographical names databases in a wide 
variety of languages:

https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html

--Ken


On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote:
> It looks like the community having the expertise to answer such 
> question is the GIS (Geography Information Systems) community, more 
> than the Unicode community. Have you tried asking a question on 
> http://gis.stackexchange.com/ ?


From doug at ewellic.org  Thu Mar  2 13:31:45 2017
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 02 Mar 2017 12:31:45 -0700
Subject: Translations of city names
Message-ID: <20170302123145.665a7a7059d7ee80bb4d670165c8327d.7f4470c2bc.wbe@email03.godaddy.com>

Some clarifications...

ISO 3166-2 defines code elements for (normally) first-level country
subdivisions (states, provinces, regions, districts, etc.), but these
almost never correlate in general to cities. In some countries, the name
of a subdivision may be the same as that of its capital or another city,
but that leaves out all the other cities within that subdivision, and in
any case this convention very seldom applies to Northern America.

Telephone area codes are not relevant in this regard, because they also
may not correlate to cities per se, so again the desired granularity is
not available. Area codes in Northern America may apply to an entire
state or province, hundreds of thousands of square kilometers in size.
(Number portability and calling plans are even less relevant to this.)

In addition to the other standards given, there is UN/LOCODE [1], which
provides code elements for "trade and transport locations," which may or
may not correlate to "cities" depending on your needs.

[1] http://www.unece.org/cefact/locode/welcome.html
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From jr at qsm.co.il  Thu Mar  2 14:05:59 2017
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Thu, 2 Mar 2017 20:05:59 +0000
Subject: Translations of city names
In-Reply-To: <ec4d77d9-6501-31d4-92a7-c3133994574d@att.net>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
 <ec4d77d9-6501-31d4-92a7-c3133994574d@att.net>
Message-ID: <HE1PR0701MB18338D71F81D351CACC3717D84280@HE1PR0701MB1833.eurprd07.prod.outlook.com>

FWIW, I looked up Copenhagen in the Danish list and received "S?gning gav ikke noget resultat" which means literally "Search produced no result" (I do know that in Danish it is K?benhavn).

Best Regards,

Jonathan Rosenne

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler
Sent: Thursday, March 02, 2017 7:47 PM
To: Fr?d?ric Grosshans
Cc: unicode at unicode.org
Subject: Re: Translations of city names

The UN Group of Experts on Geographical Names (UNGEGN) is also relevant:

https://unstats.un.org/unsd/geoinfo/ungegn/default.html

They keep up a list of searchable geographical names databases in a wide variety of languages:

https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html

--Ken


On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote:
> It looks like the community having the expertise to answer such 
> question is the GIS (Geography Information Systems) community, more 
> than the Unicode community. Have you tried asking a question on 
> http://gis.stackexchange.com/ ?


From jr at qsm.co.il  Thu Mar  2 14:12:21 2017
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Thu, 2 Mar 2017 20:12:21 +0000
Subject: Translations of city names
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <52c2fa1d-1394-d0ab-bca5-479debdc154e@gmail.com>
 <ec4d77d9-6501-31d4-92a7-c3133994574d@att.net> 
Message-ID: <HE1PR0701MB183380BCE99E84B711AFB1A084280@HE1PR0701MB1833.eurprd07.prod.outlook.com>

P.S. The US database does a good job on Copenhagen.

Best Regards,

Jonathan Rosenne


-----Original Message-----
From: Jonathan Rosenne 
Sent: Thursday, March 02, 2017 10:06 PM
To: 'Ken Whistler'; Fr?d?ric Grosshans
Cc: unicode at unicode.org; 'navneforskning at hum.ku.dk'
Subject: RE: Translations of city names

FWIW, I looked up Copenhagen in the Danish list and received "S?gning gav ikke noget resultat" which means literally "Search produced no result" (I do know that in Danish it is K?benhavn).

Best Regards,

Jonathan Rosenne

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler
Sent: Thursday, March 02, 2017 7:47 PM
To: Fr?d?ric Grosshans
Cc: unicode at unicode.org
Subject: Re: Translations of city names

The UN Group of Experts on Geographical Names (UNGEGN) is also relevant:

https://unstats.un.org/unsd/geoinfo/ungegn/default.html

They keep up a list of searchable geographical names databases in a wide variety of languages:

https://unstats.un.org/unsd/geoinfo/ungegn/geonames.html

--Ken


On 3/2/2017 8:22 AM, Fr?d?ric Grosshans wrote:
> It looks like the community having the expertise to answer such 
> question is the GIS (Geography Information Systems) community, more 
> than the Unicode community. Have you tried asking a question on 
> http://gis.stackexchange.com/ ?


From verdy_p at wanadoo.fr  Fri Mar  3 07:01:10 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 3 Mar 2017 14:01:10 +0100
Subject: Translations of city names
In-Reply-To: <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <20170301213707.733696d4@JRWUBU2>
 <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>
 <CAGa7JC276KCoV0xH0fHMZpJaOxq5z2y+4LE+p=Lb4=hV_B-W4w@mail.gmail.com>
 <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net>
Message-ID: <CAGa7JC2ECphf2d2s0xbL6OnHHL9kpAzQvBXFG0-Zzg0bjpbfQA@mail.gmail.com>

At least in the European Union, portability of numbers is open to every
customers. And almost everywhere local call rates are disappearing for all
operators, going to a situation with a single national rate.
What replaces the local call rates is different rates depending on source
and target operators or the kind of service (fixed line or mobile) rather
than the actual location of callers and callees.

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Garanti
sans virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-03-02 16:19 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk>:

> Skype for Business,and others cover (free global phone!!) for accounts
> based on area codes.
>
> Microsoft might have a list of this apparently adheres to a global
> standard.
>
>
> Yes, there is single nationwide plans also available, as addition to area
> plans.
>
>
> Sinnathurai
>
>
>
> On 02 March 2017 at 11:20 Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>
> Wrong, many countries have largely relaxed their phone number plans by
> using a single nation wide plan and allowed portability of numbers. Area
> codes are no longer needed (single call rate nation wide, the rate only
> depends on operators; and ranges of numbers are allocated also nationwide
> for value added services; long distance calls are things of the past since
> the very large adoption of mobile phones, also not located by area but only
> by country).
>
> 2017-03-02 11:22 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk>:
>
> I think there is a telephone area code, throughout the world.
>
>
> On 01 March 2017 at 21:37 Richard Wordingham <
> richard.wordingham at ntlworld.com> wrote:
>
>
> On Wed, 1 Mar 2017 12:56:23 -0800
> Jean Aurambault <jean.aurambault at gmail.com> wrote:
>
> > I'm wondering if there is any standard that defines a universal city
> > id (similar to country codes).
>
> ISO 3166-2 defines codes for some cities, but its uneven. However,
> what's a city? Does Constantinople exist?
>
> Richard.
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170303/348cb0f2/attachment.html>

From simon at simon-cozens.org  Mon Mar  6 16:48:45 2017
From: simon at simon-cozens.org (Simon Cozens)
Date: Tue, 7 Mar 2017 09:48:45 +1100
Subject: Stokoe Notation (sign language)
Message-ID: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>

Hello,
	A few years back, there was a set of questions to the UTC (L2/12-133)
asking for direction on encoding Stokoe notation. Did these ever get an
answer, and is there anything currently happening with Stokoe encoding?

Simon

From verdy_p at wanadoo.fr  Mon Mar  6 19:59:28 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 7 Mar 2017 02:59:28 +0100
Subject: Stokoe Notation (sign language)
In-Reply-To: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
Message-ID: <CAGa7JC3mGNEMnAFCCSgUTPecBM+tm0Hx9C4z70XgKufXCm9trA@mail.gmail.com>

And probably the same question could be asked again for the few other sign
languages notations (at least those listed in Wikipedia), but I wonder if
some of them may just be variants/simplifications of SingWriting, but more
usable in handwritten text, or not needing complax layouts for precise
reproduction of gesture (in a way similar to alphabets for spoken languages
that simplify a lot the actual phonetic representation, or even the
phonemic one).

It seems that those simplified alphabet-like notations are much easier to
encode, than the long waited complex SignWriting notation. In addition they
could already use existing font technics without complex development (and
already some of them already have working fonts, usable on vaerious
systems, so they should already become interoperable).


2017-03-06 23:48 GMT+01:00 Simon Cozens <simon at simon-cozens.org>:

> Hello,
>         A few years back, there was a set of questions to the UTC
> (L2/12-133)
> asking for direction on encoding Stokoe notation. Did these ever get an
> answer, and is there anything currently happening with Stokoe encoding?
>
> Simon
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170307/f5513e76/attachment.html>

From c933103 at gmail.com  Mon Mar  6 22:15:42 2017
From: c933103 at gmail.com (gfb hjjhjh)
Date: Tue, 7 Mar 2017 12:15:42 +0800
Subject: Stokoe Notation (sign language)
In-Reply-To: <CAGa7JC3mGNEMnAFCCSgUTPecBM+tm0Hx9C4z70XgKufXCm9trA@mail.gmail.com>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
 <CAGa7JC3mGNEMnAFCCSgUTPecBM+tm0Hx9C4z70XgKufXCm9trA@mail.gmail.com>
Message-ID: <CAGHjPPKgoO+gVrDJhJDvQtxuAWZyqKZRe-ZZq+xis1-oYTX8=w@mail.gmail.com>

According to Wikipedia, that's what exactly the Stokoe notation is. quoted
below:

The Stokoe notation is mostly restricted to linguists and academics. The
notation is arranged linearly on the page and can be written with a
typewriter that has the proper font installed. Unlike SignWriting
</wiki/SignWriting> or the Hamburg Notation System
</wiki/Hamburg_Notation_System>, it is based on the Latin alphabet
</wiki/Latin_alphabet> and is phonemic </wiki/Phoneme>, being restricted to
the symbols needed to meet the requirements of ASL (or extended to BSL,
etc.) rather than accommodating all possible signs. For example, there is a
single symbol for circling movement, regardless of whether the plane of the
movement is horizontal or vertical.

*Writing direction*

Stokoe notation is written horizontally left to right like the Latin
alphabet (plus limited vertical stacking of movement symbols, and some
diacritical marks written above or below other symbols). This contrasts
with SignWriting </wiki/SignWriting>, which is written vertically from top
to bottom (plus partially free two-dimensional placement of components
within the writing of a single sign).

2017?3?7? 10:05 ? "Philippe Verdy" <verdy_p at wanadoo.fr> ???

> And probably the same question could be asked again for the few other sign
> languages notations (at least those listed in Wikipedia), but I wonder if
> some of them may just be variants/simplifications of SingWriting, but more
> usable in handwritten text, or not needing complax layouts for precise
> reproduction of gesture (in a way similar to alphabets for spoken languages
> that simplify a lot the actual phonetic representation, or even the
> phonemic one).
>
> It seems that those simplified alphabet-like notations are much easier to
> encode, than the long waited complex SignWriting notation. In addition they
> could already use existing font technics without complex development (and
> already some of them already have working fonts, usable on vaerious
> systems, so they should already become interoperable).
>
>
> 2017-03-06 23:48 GMT+01:00 Simon Cozens <simon at simon-cozens.org>:
>
>> Hello,
>>         A few years back, there was a set of questions to the UTC
>> (L2/12-133)
>> asking for direction on encoding Stokoe notation. Did these ever get an
>> answer, and is there anything currently happening with Stokoe encoding?
>>
>> Simon
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170307/c2f04f3b/attachment.html>

From kenwhistler at att.net  Tue Mar  7 11:04:31 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Tue, 7 Mar 2017 09:04:31 -0800
Subject: Stokoe Notation (sign language)
In-Reply-To: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
Message-ID: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net>


On 3/6/2017 2:48 PM, Simon Cozens wrote:
> A few years back, there was a set of questions to the UTC (L2/12-133)
> asking for direction on encoding Stokoe notation. Did these ever get an
> answer, and is there anything currently happening with Stokoe encoding?
>

The short answer is no.

Stokoe notation has a bunch of features that make it a very low priority 
for UTC attention.

And for those who never saw a systematic collection of marks on paper 
that they didn't think deserved immediate encoding in the Unicode 
Standard, riddle me this:

Would anyone be willing to put in the effort to define a formal markup 
language (ML) specification that would accurately cover all aspects of 
the notation and its use? If not, why would you expect the UTC to devote 
time to figuring out how to "flatten" all that markup complexity and 
create a text model and plain text encoding for the same notation? 
Particularly if there is very little indication that implementers of 
generic rendering systems have the interest, time, or resources to then 
add that complexity to their text renderers.

--Ken


From lorna_evans at sil.org  Tue Mar  7 11:46:41 2017
From: lorna_evans at sil.org (Lorna Evans)
Date: Tue, 7 Mar 2017 11:46:41 -0600
Subject: Stokoe Notation (sign language)
In-Reply-To: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
Message-ID: <e0b49f17-4e6a-21e8-9925-121d1ed00d79@sil.org>

Hi Simon,

I did a lot of research on Stokoe Notation between 2010-2012. It has 
primarily been used in dictionaries. When I presented that document 
(L2/12-133) to UTC, these are the summary notes I took:

We only had about 10-15 minutes to discuss Stokoe Notation in the main 
UTC but they decided to have an ad-hoc meeting at lunch so we had about 
40 minutes over lunch on Stokoe. There was no support for encoding it as 
a script. They feel it "should not be encoded as a script any more than 
math or music is a script. " It "doesn't make sense to do it as plain 
text...would be a serious mistake". So, they want it to use all the 
existing Latin characters and symbols in the standard and just encode 
new characters as symbols and then use a higher level protocol for the 
shaping. All the fancy layout can be expressed in MathML.

I think I still disagree (I think it should be encoded as a writing 
system), but Stokoe isn't high on my list of priorities and I haven't 
had a chance to do further research. Major dictionaries that have been 
produced using Stokoe Notation are for British Sign Language (BSL), 
American Sign Language (ASL), Hong Kong Sign Language (HKSL), Signed 
Swedish (SS), Italian Sign Language (LIS) and Czech Sign Language. I 
also reviewed books or documents discussing Dutch Sign Language (DSE) 
and Australian Aboriginal (ASL).

Each and every one of these had differing levels of rendering 
requirements (from very minor all the way to the need for control codes 
for positioning) because they had "enhanced" the original ASL Stokoe 
Notation. I feel it's complex enough that it should definitely have 
further research done on it. I just don't have the time to put on it at 
this point. (The L2/12-133 document didn't include a review of Czech 
Sign Language. When I did get a copy of that dictionary I discovered 
even more enhancements. I never documented those.)

Lorna


-------- Original Message --------
Subject: Stokoe Notation (sign language)
From: Simon Cozens <simon at simon-cozens.org>
To: unicode Unicode Discussion <unicode at unicode.org>
CC: lorna_evans at sil.org
Date: 3/6/2017 4:48 PM
> Hello,
> 	A few years back, there was a set of questions to the UTC (L2/12-133)
> asking for direction on encoding Stokoe notation. Did these ever get an
> answer, and is there anything currently happening with Stokoe encoding?
>
> Simon


From jean.aurambault at gmail.com  Tue Mar  7 20:40:13 2017
From: jean.aurambault at gmail.com (Jean Aurambault)
Date: Tue, 7 Mar 2017 18:40:13 -0800
Subject: Translations of city names
In-Reply-To: <CAGa7JC2ECphf2d2s0xbL6OnHHL9kpAzQvBXFG0-Zzg0bjpbfQA@mail.gmail.com>
References: <CALiZ9fJcM9GORgMib2cdgOaPkbqDG+t6zTi4eV2MSwi-KsBVag@mail.gmail.com>
 <20170301213707.733696d4@JRWUBU2>
 <1886288822.2622752.1488450120241.JavaMail.open-xchange@oxbe5.tb.ukmail.iss.as9143.net>
 <CAGa7JC276KCoV0xH0fHMZpJaOxq5z2y+4LE+p=Lb4=hV_B-W4w@mail.gmail.com>
 <1805883977.2652251.1488467981135.JavaMail.open-xchange@oxbe15.tb.ukmail.iss.as9143.net>
 <CAGa7JC2ECphf2d2s0xbL6OnHHL9kpAzQvBXFG0-Zzg0bjpbfQA@mail.gmail.com>
Message-ID: <CALiZ9f+XiHhHnE3fCb2AzDtJqHkAKOzXqh5vUMz2SY5WYD5eTA@mail.gmail.com>

thank you all for your input!

Jean

On Fri, Mar 3, 2017 at 5:01 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> At least in the European Union, portability of numbers is open to every
> customers. And almost everywhere local call rates are disappearing for all
> operators, going to a situation with a single national rate.
> What replaces the local call rates is different rates depending on source
> and target operators or the kind of service (fixed line or mobile) rather
> than the actual location of callers and callees.
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Garanti
> sans virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> <#m_1585433171908523202_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> 2017-03-02 16:19 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk>:
>
>> Skype for Business,and others cover (free global phone!!) for accounts
>> based on area codes.
>>
>> Microsoft might have a list of this apparently adheres to a global
>> standard.
>>
>>
>> Yes, there is single nationwide plans also available, as addition to area
>> plans.
>>
>>
>> Sinnathurai
>>
>>
>>
>> On 02 March 2017 at 11:20 Philippe Verdy <verdy_p at wanadoo.fr> wrote:
>>
>> Wrong, many countries have largely relaxed their phone number plans by
>> using a single nation wide plan and allowed portability of numbers. Area
>> codes are no longer needed (single call rate nation wide, the rate only
>> depends on operators; and ranges of numbers are allocated also nationwide
>> for value added services; long distance calls are things of the past since
>> the very large adoption of mobile phones, also not located by area but only
>> by country).
>>
>> 2017-03-02 11:22 GMT+01:00 srivas sinnathurai <sisrivas at blueyonder.co.uk
>> >:
>>
>> I think there is a telephone area code, throughout the world.
>>
>>
>> On 01 March 2017 at 21:37 Richard Wordingham <
>> richard.wordingham at ntlworld.com> wrote:
>>
>>
>> On Wed, 1 Mar 2017 12:56:23 -0800
>> Jean Aurambault <jean.aurambault at gmail.com> wrote:
>>
>> > I'm wondering if there is any standard that defines a universal city
>> > id (similar to country codes).
>>
>> ISO 3166-2 defines codes for some cities, but its uneven. However,
>> what's a city? Does Constantinople exist?
>>
>> Richard.
>>
>>
>>
>>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170307/d3ec310f/attachment.html>

From wjgo_10009 at btinternet.com  Wed Mar  8 09:45:05 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 8 Mar 2017 15:45:05 +0000 (GMT)
Subject: Stokoe Notation (sign language)
In-Reply-To: <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
 <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net>
Message-ID: <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost>

Ken Whistler asked:

> And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this:

Well, I am not quite congruently in that category, but not far off, so I will answer the question anyway.

> Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use?

Yes, I would. It seems a very worthwhile project.

I am not a linguist, though I am interested linguistics. I have very little knowledge of sign language. I do not remember knowing of Stokoe Notation before reading this thread.

What interests me about this project and where I feel that I could make a contribution to a group effort is that Ken included the following.

> .... figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation?

Now that interests me and is the sort of problem that I enjoy trying to solve.

Some time ago there was discussion of encoding Ancient Egyptian and I devised an idea for solving the advanced issues of that encoding. At first glance, the encoding of Stokoe Notation seems to have some similarities to what is needed regarding the encoding of the advanced glyph layout of Ancient Egyptian.

I published my ideas, in fact including them as a chapter in my novel.

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_009.pdf

I used the technique of including the idea in a chapter of the novel as it allows a dialogue of discussion about the ideas. 

The document has been deposited at the British Library.

Today Unicode has tag sequences available as a technique and it might be that by using the ideas in Chapter 9 of my novel, in particular of having a Glyph as a type in the object code of a virtual computer so that glyphs could be scaled, moved and added together, that the implementation would be fairly straightforward by using short pieces of software each expressed as a tag sequence to produce a result. Thus implementing the spatial layout of the system by software in a virtual computer rather than by a sort of hardwired encoding. 

Ken also wrote:

> Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers.

Well maybe the implementation of that complexity might make a good student project or a good student group project somewhere.

I opine that progress is important.

William Overington

Wednesday 8 March 2017


----Original message----
>From : kenwhistler at att.net
Date : 07/03/2017 - 17:04 (GMTST)
To : simon at simon-cozens.org
Cc : unicode at unicode.org
Subject : Re: Stokoe Notation (sign language)


On 3/6/2017 2:48 PM, Simon Cozens wrote:
> A few years back, there was a set of questions to the UTC (L2/12-133)
> asking for direction on encoding Stokoe notation. Did these ever get an
> answer, and is there anything currently happening with Stokoe encoding?
>

The short answer is no.

Stokoe notation has a bunch of features that make it a very low priority 
for UTC attention.

And for those who never saw a systematic collection of marks on paper 
that they didn't think deserved immediate encoding in the Unicode 
Standard, riddle me this:

Would anyone be willing to put in the effort to define a formal markup 
language (ML) specification that would accurately cover all aspects of 
the notation and its use? If not, why would you expect the UTC to devote 
time to figuring out how to "flatten" all that markup complexity and 
create a text model and plain text encoding for the same notation? 
Particularly if there is very little indication that implementers of 
generic rendering systems have the interest, time, or resources to then 
add that complexity to their text renderers.

--Ken


From petercon at microsoft.com  Thu Mar  9 10:49:36 2017
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 9 Mar 2017 16:49:36 +0000
Subject: Stokoe Notation (sign language)
In-Reply-To: <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost>
References: <df4c162f-c110-56f8-c37d-cd8376d7fe63@simon-cozens.org>
 <0135d0ac-0b9d-bd37-4dee-277f1f90a447@att.net>
 <22737973.45429.1488987905408.JavaMail.defaultUser@defaultHost>
Message-ID: <MWHPR03MB2798F81B9DB73A6A60B1620DD5210@MWHPR03MB2798.namprd03.prod.outlook.com>

I opine that opining an opinion is periphrastic, circumlocutious, consumptive, wasteful spending of one's own and others' resources.

Just say it.

"Progress is important."

Thank you for that most insightful of generalizations. 


/S
Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of William_J_G Overington
Sent: Wednesday, March 8, 2017 7:45 AM
To: c933103 at gmail.com; kenwhistler at att.net; verdy_p at wanadoo.fr; simon at simon-cozens.org; lorna_evans at sil.org; unicode at unicode.org
Subject: Re: Stokoe Notation (sign language)

Ken Whistler asked:

> And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this:

Well, I am not quite congruently in that category, but not far off, so I will answer the question anyway.

> Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use?

Yes, I would. It seems a very worthwhile project.

I am not a linguist, though I am interested linguistics. I have very little knowledge of sign language. I do not remember knowing of Stokoe Notation before reading this thread.

What interests me about this project and where I feel that I could make a contribution to a group effort is that Ken included the following.

> .... figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation?

Now that interests me and is the sort of problem that I enjoy trying to solve.

Some time ago there was discussion of encoding Ancient Egyptian and I devised an idea for solving the advanced issues of that encoding. At first glance, the encoding of Stokoe Notation seems to have some similarities to what is needed regarding the encoding of the advanced glyph layout of Ancient Egyptian.

I published my ideas, in fact including them as a chapter in my novel.

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_009.pdf

I used the technique of including the idea in a chapter of the novel as it allows a dialogue of discussion about the ideas. 

The document has been deposited at the British Library.

Today Unicode has tag sequences available as a technique and it might be that by using the ideas in Chapter 9 of my novel, in particular of having a Glyph as a type in the object code of a virtual computer so that glyphs could be scaled, moved and added together, that the implementation would be fairly straightforward by using short pieces of software each expressed as a tag sequence to produce a result. Thus implementing the spatial layout of the system by software in a virtual computer rather than by a sort of hardwired encoding. 

Ken also wrote:

> Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers.

Well maybe the implementation of that complexity might make a good student project or a good student group project somewhere.

I opine that progress is important.

William Overington

Wednesday 8 March 2017


----Original message----
>From : kenwhistler at att.net
Date : 07/03/2017 - 17:04 (GMTST)
To : simon at simon-cozens.org
Cc : unicode at unicode.org
Subject : Re: Stokoe Notation (sign language)


On 3/6/2017 2:48 PM, Simon Cozens wrote:
> A few years back, there was a set of questions to the UTC (L2/12-133) 
> asking for direction on encoding Stokoe notation. Did these ever get 
> an answer, and is there anything currently happening with Stokoe encoding?
>

The short answer is no.

Stokoe notation has a bunch of features that make it a very low priority for UTC attention.

And for those who never saw a systematic collection of marks on paper that they didn't think deserved immediate encoding in the Unicode Standard, riddle me this:

Would anyone be willing to put in the effort to define a formal markup language (ML) specification that would accurately cover all aspects of the notation and its use? If not, why would you expect the UTC to devote time to figuring out how to "flatten" all that markup complexity and create a text model and plain text encoding for the same notation? 
Particularly if there is very little indication that implementers of generic rendering systems have the interest, time, or resources to then add that complexity to their text renderers.

--Ken


From petercon at microsoft.com  Thu Mar  9 10:56:42 2017
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 9 Mar 2017 16:56:42 +0000
Subject: Northern Khmer on iPhone
In-Reply-To: <20170228073710.75af64d4@JRWUBU2>
References: <20170228073710.75af64d4@JRWUBU2>
Message-ID: <MWHPR03MB27984FE537715D5AA13722B0D5210@MWHPR03MB2798.namprd03.prod.outlook.com>

Too bad more people didn't use Windows Phones, as your word displays as expected on mine.

Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Richard Wordingham
Sent: Monday, February 27, 2017 11:37 PM
To: unicode at unicode.org
Subject: Northern Khmer on iPhone

Does iPhone support the use of Northern Khmer in Thai script?  I would count an interface in Thai as support.

The reason I ask is that I tried entering the word ??? <U+0E01 THAI CHARACTER KO KAI, U+0E3A THAI CHARACTER PHINTHU, U+0E35 THAI CHARACTER SARA II> 'he' and got a dotted circle.  I also got a dotted circle for the alternative spelling <U+0E01, U+0E35, U+0E3A>.

This might be an application issue.  The application I was using was Line.

Richard.


From petercon at microsoft.com  Fri Mar 10 11:00:55 2017
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 10 Mar 2017 17:00:55 +0000
Subject: "A Programmer's Introduction to Unicode"
Message-ID: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>

FYI:

http://reedbeta.com/blog/programmers-intro-to-unicode/

The visuals may be the most interesting part. E.g., in the usage heat map, Arabic Presentation Forms-B lights up much more than I would have expected - as much as a lot of emoji.


Peter
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170310/ce17bad7/attachment.html>

From khaledhosny at eglug.org  Fri Mar 10 11:53:27 2017
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Fri, 10 Mar 2017 19:53:27 +0200
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
Message-ID: <20170310175234.GA8291@macbook>

On Fri, Mar 10, 2017 at 05:00:55PM +0000, Peter Constable wrote:
> FYI:
> 
> http://reedbeta.com/blog/programmers-intro-to-unicode/
> 
> The visuals may be the most interesting part. E.g., in the usage heat
> map, Arabic Presentation Forms-B lights up much more than I would have
> expected

I often see U+FEFB and other lam-alef ligatures used on social media (I
easily spot it because my default font does not have them so they end up
using fallback font).

My guess is that might be because some keyboard layouts (Xorg, Android?)
use them for the lam-alef keys on the keyboard (I?m guilty of doing this
for Xorg keyboard layout because it didn?t handle more than one
character per key, this was then decomposed back inside XIM input
method, but many people don?t use XIM and the decomposition does not
happen, it was messy overall).

Regards,
Khaled

From manish at mozilla.com  Fri Mar 10 12:55:44 2017
From: manish at mozilla.com (Manish Goregaokar)
Date: Fri, 10 Mar 2017 10:55:44 -0800
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
Message-ID: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>

I recently wrote
http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
, which sort of addresses the whole hangup programmers have with
treating code points as "characters".

I also wrote http://manishearth.github.io/blog/2017/01/15/breaking-our-latin-1-assumptions/
that provides a useful list of scripts to check against when figuring
out if your design makes sense uniformly across scripts.


There's also https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/
-Manish


On Fri, Mar 10, 2017 at 9:00 AM, Peter Constable <petercon at microsoft.com> wrote:
> FYI:
>
>
>
> http://reedbeta.com/blog/programmers-intro-to-unicode/
>
>
>
> The visuals may be the most interesting part. E.g., in the usage heat map,
> Arabic Presentation Forms-B lights up much more than I would have expected ?
> as much as a lot of emoji.
>
>
>
>
>
>
>
> Peter


From jsbien at mimuw.edu.pl  Sun Mar 12 00:04:56 2017
From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=)
Date: Sun, 12 Mar 2017 07:04:56 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 (Manish Goregaokar's message of "Fri, 10 Mar 2017 10:55:44 -0800")
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
Message-ID: <864lyzhvxz.fsf@mimuw.edu.pl>

On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes:
> I recently wrote
> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
> , which sort of addresses the whole hangup programmers have with
> treating code points as "characters".

[...]

This is just another confirmation that the present Unicode terminology
is confusing. Let me remind below a fragment of an old thread about
"textels".

Best regards

Janusz


On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes:
> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
>
> [...]
>
>> In the new Swift programming language, which is white-hot in the Apple
>> community, Apple is moving toward a model of a transparent, generic
>> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary,
>> but in which a ?character? contains however many code points it needs
>> (?e? with a stacked macron, acute accent, and dieresis is
>> algorithmically one ?character? in Swift). Moreover,
>> e-with-an-acute-accent and e followed by a combining acute accent, for
>> example, compare as equal. At present, the underlying code is still
>> UTF-16LE.
>
> For several years I use the name "textel" (text element, in Polish
> "tekstel") for such objects. I do it mostly orally in my presentations
> for my students, but I used it also in writing e.g. in
> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
> definition. A rudymentary definition was provided for me only in my
> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
> (on p. 69) "an elementary text element independently of its Unicode
> representation" (meaning in particular composed vs precomposed). I still
> hope to formulate sooner or later a more satisfactory definition :-)
>
> I think Swift confirms that such a notion is really needed.
>
> Best regards
>
> Janusz

On Wed, Sep 21 2016 at  6:44 CEST, jsbien at mimuw.edu.pl writes:
> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes:
>> Janusz Bie? wrote:
>>
>>> For me it means that Swift's characters are equivalence classes of the
>>> set of extended grapheme clusters by canonical equivalence relation.
>>
>> I still hope we can come to some conclusion on the correct Unicode name
>> for this concept. I don't think non-Unicode interpretations of terms
>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>
> I agree.
>
>> but I can see that the equivalence class itself is lacking a name.
>
> I'glad.
>
>>
>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>> are identical entities, only that the language compares them as equal.
>
> I'm fully aware of this.
>
> Best regards
>
> Janusz


-- 
                           ,   
Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From manish at mozilla.com  Sun Mar 12 13:43:22 2017
From: manish at mozilla.com (Manish Goregaokar)
Date: Sun, 12 Mar 2017 11:43:22 -0700
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <864lyzhvxz.fsf@mimuw.edu.pl>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
Message-ID: <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>

> This is just another confirmation that the present Unicode terminology
is confusing.

I find this to be a symptom of our pedagogy around "characters" in
programming; most folks get taught that characters are bytes are code
points, especially because many languages try to make this the case.
The name "grapheme cluster" could be improved upon, but it's not the
primary source of this confusion.
-Manish


On Sat, Mar 11, 2017 at 10:04 PM, Janusz S. Bie? <jsbien at mimuw.edu.pl> wrote:
> On Fri, Mar 10 2017 at 19:55 CET, manish at mozilla.com writes:
>> I recently wrote
>> http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
>> , which sort of addresses the whole hangup programmers have with
>> treating code points as "characters".
>
> [...]
>
> This is just another confirmation that the present Unicode terminology
> is confusing. Let me remind below a fragment of an old thread about
> "textels".
>
> Best regards
>
> Janusz
>
>
> On Thu, Sep 15 2016 at 21:12 CEST, jsbien at mimuw.edu.pl writes:
>> On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
>>
>> [...]
>>
>>> In the new Swift programming language, which is white-hot in the Apple
>>> community, Apple is moving toward a model of a transparent, generic
>>> Unicode that can be ?viewed? as UTF-8, UTF-16, or UTF-32 if necessary,
>>> but in which a ?character? contains however many code points it needs
>>> (?e? with a stacked macron, acute accent, and dieresis is
>>> algorithmically one ?character? in Swift). Moreover,
>>> e-with-an-acute-accent and e followed by a combining acute accent, for
>>> example, compare as equal. At present, the underlying code is still
>>> UTF-16LE.
>>
>> For several years I use the name "textel" (text element, in Polish
>> "tekstel") for such objects. I do it mostly orally in my presentations
>> for my students, but I used it also in writing e.g. in
>> http://bc.klf.uw.edu.pl/118/, unfortunately without a proper
>> definition. A rudymentary definition was provided for me only in my
>> recent paper in Polish: http://bc.klf.uw.edu.pl/480/. It states simply
>> (on p. 69) "an elementary text element independently of its Unicode
>> representation" (meaning in particular composed vs precomposed). I still
>> hope to formulate sooner or later a more satisfactory definition :-)
>>
>> I think Swift confirms that such a notion is really needed.
>>
>> Best regards
>>
>> Janusz
>
> On Wed, Sep 21 2016 at  6:44 CEST, jsbien at mimuw.edu.pl writes:
>> On Tue, Sep 20 2016 at 18:09 CEST, doug at ewellic.org writes:
>>> Janusz Bie? wrote:
>>>
>>>> For me it means that Swift's characters are equivalence classes of the
>>>> set of extended grapheme clusters by canonical equivalence relation.
>>>
>>> I still hope we can come to some conclusion on the correct Unicode name
>>> for this concept. I don't think non-Unicode interpretations of terms
>>> like "grapheme" are grounds for throwing out "grapheme cluster,"
>>
>> I agree.
>>
>>> but I can see that the equivalence class itself is lacking a name.
>>
>> I'glad.
>>
>>>
>>> Note that the Swift definition doesn't say that <00E9> and <0065 0301>
>>> are identical entities, only that the language compares them as equal.
>>
>> I'm fully aware of this.
>>
>> Best regards
>>
>> Janusz
>
>
> --
>                            ,
> Prof. dr hab. Janusz S. Bien -  Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
> Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
> jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
>


From jsbien at mimuw.edu.pl  Sun Mar 12 14:02:28 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sun, 12 Mar 2017 20:02:28 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
Message-ID: <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>

Quote/Cytat - Manish Goregaokar <manish at mozilla.com> (Sun 12 Mar 2017  
07:43:22 PM CET):

>> This is just another confirmation that the present Unicode terminology
> is confusing.
>
> I find this to be a symptom of our pedagogy around "characters" in
> programming; most folks get taught that characters are bytes are code
> points, especially because many languages try to make this the case.
> The name "grapheme cluster" could be improved upon, but it's not the
> primary source of this confusion.

I agree that it's not the primary source. However the pedagogy depends  
on the terminology used.

If the basic notion has to be referred in a cumbersome way as  
"extended grapheme cluster" then it is easier to talk about "Unicode  
characters" despite the fact that they have a rather loose relation to  
real-life/user-perceived characters.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From richard.wordingham at ntlworld.com  Sun Mar 12 15:10:22 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 12 Mar 2017 20:10:22 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
Message-ID: <20170312201022.7ec8d858@JRWUBU2>

On Sun, 12 Mar 2017 20:02:28 +0100
"Janusz S. Bien" <jsbien at mimuw.edu.pl> wrote:

> If the basic notion has to be referred in a cumbersome way as  
> "extended grapheme cluster" then it is easier to talk about "Unicode  
> characters" despite the fact that they have a rather loose relation
> to real-life/user-perceived characters.

The notion that extended grapheme clusters corresponds to
user-perceived characters is also rather dodgy.  Whereas it may work
for French, it is getting very dubious by the time one adds Hebrew
cantillation marks or Vedic accentuation.  The Thais revolted when
their preposed vowels were joined with the following consonant in the
same extended grapheme cluster, and Unicode had to revoke that union.

Richard.

From jsbien at mimuw.edu.pl  Mon Mar 13 05:31:28 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Mon, 13 Mar 2017 11:31:28 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170312201022.7ec8d858@JRWUBU2>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
Message-ID: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>

Quote/Cytat - Richard Wordingham <richard.wordingham at ntlworld.com>  
(Sun 12 Mar 2017 09:10:22 PM CET):

> On Sun, 12 Mar 2017 20:02:28 +0100
> "Janusz S. Bien" <jsbien at mimuw.edu.pl> wrote:
>
>> If the basic notion has to be referred in a cumbersome way as
>> "extended grapheme cluster" then it is easier to talk about "Unicode
>> characters" despite the fact that they have a rather loose relation
>> to real-life/user-perceived characters.
>
> The notion that extended grapheme clusters corresponds to
> user-perceived characters is also rather dodgy.

The idea is not mine, but it appears from time to time on the list in  
a more or less explicit way.

> Whereas it may work
> for French, it is getting very dubious by the time one adds Hebrew
> cantillation marks or Vedic accentuation.  The Thais revolted when
> their preposed vowels were joined with the following consonant in the
> same extended grapheme cluster, and Unicode had to revoke that union.

Just yet another reason for introducing the notion of textel?

Best regards

Janusz


-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From jsbien at mimuw.edu.pl  Mon Mar 13 06:35:01 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Mon, 13 Mar 2017 12:35:01 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost>
Message-ID: <20170313123501.16162ll9e9d5bqhx@mail.mimuw.edu.pl>

Quote/Cytat - William_J_G Overington <wjgo_10009 at btinternet.com> (Mon  
13 Mar 2017 12:24:13 PM CET):

> Prof. Janusz S. Bie? wrote:
>
>> Just yet another reason for introducing the notion of textel?
>
> I opine that it would be a good idea to introduce several new words,  
> of which textel would be one, with each such new word having a  
> precisely-defined meaning so that in precise discussions of  
> programming techniques people could discuss the situation without  
> needing to use any of the words character, code point, grapheme  
> cluster.
>
> How many such new words would be needed?

In my paper (in Polish)

http://bc.klf.uw.edu.pl/480/

I propose also the term "texton" meaning a code point from a specific  
subset, not yet fully defined, but including at least the components  
of composite characters.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From wjgo_10009 at btinternet.com  Mon Mar 13 06:24:13 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Mon, 13 Mar 2017 11:24:13 +0000 (GMT)
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
Message-ID: <4234890.23006.1489404253666.JavaMail.defaultUser@defaultHost>

Prof. Janusz S. Bie? wrote:

> Just yet another reason for introducing the notion of textel?

I opine that it would be a good idea to introduce several new words, of which textel would be one, with each such new word having a precisely-defined meaning so that in precise discussions of programming techniques people could discuss the situation without needing to use any of the words character, code point, grapheme cluster.

How many such new words would be needed?

I remember how in electronics the introduction of the term Hertz to be used instead of cycles per second helped discussions.

After the introduction of the term Hertz it became easy to refer to twenty cycles of a fifty Hertz signal without confusion over one's meaning.

So introducing several new precisely-defined words now could help lots of discussions in the future.

Perhaps, apart from textel, the definitions could be produced first and then people can decide, for each such definition, which new word would be a good word to have that definition.

The recent introduction into Unicode of ZWJ sequences for some emoji and the introduction into Unicode of tag sequences applied to a base character does could mean that the introducing of such new words becomes of increasing importance due to the programming implications of those recently introduced techniques. 

William Overington

Monday 13 March 2017


From asmusf at ix.netcom.com  Mon Mar 13 12:00:08 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Mon, 13 Mar 2017 10:00:08 -0700
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
Message-ID: <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170313/a1b07732/attachment.html>

From jsbien at mimuw.edu.pl  Mon Mar 13 12:15:31 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Mon, 13 Mar 2017 18:15:31 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
Message-ID: <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>

Quote/Cytat - Asmus Freytag <asmusf at ix.netcom.com> (Mon 13 Mar 2017  
06:00:08 PM CET):

[...]

This (or similar) scenarios indicate the impossibility to come to a
single, universal definition of a "textel" -- the main reason why this
term is of lower utility than "pixel".

I agree that it is impossible  to come to a single, universal  
definition of text elements, but it seems possible to reach a  
consensus on a kind of the least common denominator of them and call  
it "textel" or something else.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From d3ck0r at gmail.com  Mon Mar 13 12:55:18 2017
From: d3ck0r at gmail.com (J Decker)
Date: Mon, 13 Mar 2017 10:55:18 -0700
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
Message-ID: <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>

I liked the Go implementation of character type - a rune type - which is a
codepoint.  and strings that return runes from by index.
https://blog.golang.org/strings

Doesn't solve the problem for composited codepoints though...

texel looks to be defined as a graphic element already.  TEXture ELement.


On Mon, Mar 13, 2017 at 10:15 AM, Janusz S. Bien <jsbien at mimuw.edu.pl>
wrote:

> Quote/Cytat - Asmus Freytag <asmusf at ix.netcom.com> (Mon 13 Mar 2017
> 06:00:08 PM CET):
>
> [...]
>
> This (or similar) scenarios indicate the impossibility to come to a
> single, universal definition of a "textel" -- the main reason why this
> term is of lower utility than "pixel".
>
> I agree that it is impossible  to come to a single, universal definition
> of text elements, but it seems possible to reach a consensus on a kind of
> the least common denominator of them and call it "textel" or something else.
>
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
> jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170313/ecff35c8/attachment.html>

From jsbien at mimuw.edu.pl  Mon Mar 13 13:02:39 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Mon, 13 Mar 2017 19:02:39 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
Message-ID: <20170313190239.12215ot2zpq9i2m7@mail.mimuw.edu.pl>

Quote/Cytat - J Decker <d3ck0r at gmail.com> (Mon 13 Mar 2017 06:55:18 PM CET):

> texel looks to be defined as a graphic element already.  TEXture ELement.

I'm aware of it, but homonymy/polysemy is something we have to live  
with. I think there is no risk of confusing texture elements with text  
elements, despite the fact that 'texture' and 'text' have similar  
origin.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From alastair at alastairs-place.net  Mon Mar 13 14:18:00 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Mon, 13 Mar 2017 19:18:00 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
Message-ID: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>

On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
> 
> I liked the Go implementation of character type - a rune type - which is a codepoint.  and strings that return runes from by index.
> https://blog.golang.org/strings

IMO, returning code points by index is a mistake.  It over-emphasises the importance of the code point, which helps to continue the notion in some developers? minds that code points are somehow ?characters?.  It also leads to people unnecessarily using UCS-4 as an internal representation, which seems to have very few advantages in practice over UTF-16.

> Doesn't solve the problem for composited codepoints though... 
> 
> texel looks to be defined as a graphic element already.  TEXture ELement.

Yes, but I thought the proposal was ?textel?, with the extra ?t?.  Re-using ?texel? would be quite inappropriate; there are certainly people who work on rendering software who would strongly object to that, for very good reasons.

I would caution, however, that there?s already a lot of terminology associated with Unicode, perhaps for understandable reasons, but if the word ?textel? is going to have a definition that differs from (say) an extended grapheme cluster, I think a great deal of consideration should be given to what exactly that definition should be.  We already have ?characters?, code units, code points, combining sequences, graphemes, grapheme clusters, extended grapheme clusters and probably other things I?ve missed off that list.  Merely adding yet another bit of terminology isn?t going to fix the problem of developers misunderstanding or simply not being aware of the correct terminology or of some aspect of Unicode?s behaviour.

Kind regards,

Alastair.

--
http://alastairs-place.net


From khaledhosny at eglug.org  Mon Mar 13 16:10:11 2017
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Mon, 13 Mar 2017 23:10:11 +0200
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
Message-ID: <20170313211011.GE1429@macbook>

On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote:
> On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
> > 
> > I liked the Go implementation of character type - a rune type - which is a codepoint.  and strings that return runes from by index.
> > https://blog.golang.org/strings
> 
> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers? minds that code points are somehow ?characters?.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

But there are many text operations that require access to Unicode code
points. Take for example text layout, as mapping characters to glyphs
and back has to operate on code points. The idea that you never need to
work with code points is too simplistic.

Regards,
Khaled

From richard.wordingham at ntlworld.com  Mon Mar 13 16:47:04 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 13 Mar 2017 21:47:04 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313211011.GE1429@macbook>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook>
Message-ID: <20170313214704.55372dfb@JRWUBU2>

On Mon, 13 Mar 2017 23:10:11 +0200
Khaled Hosny <khaledhosny at eglug.org> wrote:
 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need
> to work with code points is too simplistic.

There are advantages to interpreting and operating on text as though it
were in form NFD.  However, there are still cases where one needs
fractions of a character, such as word boundaries in Sanskrit, though I
think the locations are liable to be specified in a language-specific
form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
in at least 4 ways.

Richard.

From manish at mozilla.com  Mon Mar 13 17:26:00 2017
From: manish at mozilla.com (Manish Goregaokar)
Date: Mon, 13 Mar 2017 15:26:00 -0700
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313214704.55372dfb@JRWUBU2>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2>
Message-ID: <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>

Do you have examples of AA being split that way (and further reading)?
I think I'm aware of what you're talking about, but would love to read
more about it.
-Manish


On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
<richard.wordingham at ntlworld.com> wrote:
> On Mon, 13 Mar 2017 23:10:11 +0200
> Khaled Hosny <khaledhosny at eglug.org> wrote:
>
>> But there are many text operations that require access to Unicode code
>> points. Take for example text layout, as mapping characters to glyphs
>> and back has to operate on code points. The idea that you never need
>> to work with code points is too simplistic.
>
> There are advantages to interpreting and operating on text as though it
> were in form NFD.  However, there are still cases where one needs
> fractions of a character, such as word boundaries in Sanskrit, though I
> think the locations are liable to be specified in a language-specific
> form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
> in at least 4 ways.
>
> Richard.

From richard.wordingham at ntlworld.com  Mon Mar 13 18:48:37 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 13 Mar 2017 23:48:37 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2>
 <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
Message-ID: <20170313234837.5d891338@JRWUBU2>

On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar <manish at mozilla.com> wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.

From mark at kli.org  Mon Mar 13 19:20:25 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 13 Mar 2017 20:20:25 -0400
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2>
 <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
Message-ID: <ea9a72cd-1cb4-80cc-85b8-98ae78a814c3@kli.org>

A word ending in A *or* AA preceding a word beginning in A *or* AA will 
all coalesce to a single AA in Sanskrit.  That's four possibilities, and 
that doesn't count a word ending in a consonant preceding a word 
beginning in AA, which would be written the same.  My memory is rusty, 
so I should actually be looking things up, but I think these are valid 
constructions:

? + ???????  ?  ????????
? + ???????  ? ????????

(and indeed, ??????? is the upasarga ? plus ???????, so there too the A 
+ AA coalesced.)  I should probably find you examples for all the other 
possibilities.  Sanskrit external vowel sandhi is comparatively 
straightforward (compared to consonant sandhi), and it frequently loses 
information.  A *or* AA plus I is E; A *or* AA plus U is O (you need A + 
O to get AU).

~mark


On 03/13/2017 06:26 PM, Manish Goregaokar wrote:
> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.
> -Manish
>
>
> On Mon, Mar 13, 2017 at 2:47 PM, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
>> On Mon, 13 Mar 2017 23:10:11 +0200
>> Khaled Hosny <khaledhosny at eglug.org> wrote:
>>
>>> But there are many text operations that require access to Unicode code
>>> points. Take for example text layout, as mapping characters to glyphs
>>> and back has to operate on code points. The idea that you never need
>>> to work with code points is too simplistic.
>> There are advantages to interpreting and operating on text as though it
>> were in form NFD.  However, there are still cases where one needs
>> fractions of a character, such as word boundaries in Sanskrit, though I
>> think the locations are liable to be specified in a language-specific
>> form.  U+093E DEVANAGARI VOWEL SIGN AA can have a word boundary in it
>> in at least 4 ways.
>>
>> Richard.


From richard.wordingham at ntlworld.com  Mon Mar 13 20:56:23 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 14 Mar 2017 01:56:23 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <ea9a72cd-1cb4-80cc-85b8-98ae78a814c3@kli.org>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2>
 <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
 <ea9a72cd-1cb4-80cc-85b8-98ae78a814c3@kli.org>
Message-ID: <20170314015623.446cb440@JRWUBU2>

On Mon, 13 Mar 2017 20:20:25 -0400
"Mark E. Shoulson" <mark at kli.org> wrote:

> Sanskrit external vowel sandhi is comparatively 
> straightforward (compared to consonant sandhi), and it frequently
> loses information.  A *or* AA plus I is E; A *or* AA plus U is O (you
> need A + O to get AU).

Indeed, E can not only be A or AA plus I or II: it can also be E + A.
In the latter case avagraha is usual, at least in European practice.
(Would that generally be locale sa_Deva_GB?) I'd like advice on modern
Indian practice, and on the spacing and syllable division. I've seen a
claim that avagraha always belongs with the preceding vowel, but I'm
not sure that that rule applies in this case.

In a similar fashion, O can -AS + A-, an interesting case of visarga
sandhi. However, I'm not sure that one would want to *divide* the E or
O.

Richard.

From richard.wordingham at ntlworld.com  Mon Mar 13 21:03:56 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 14 Mar 2017 02:03:56 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
Message-ID: <20170314020356.26ff5e89@JRWUBU2>

On Mon, 13 Mar 2017 19:18:00 +0000
Alastair Houghton <alastair at alastairs-place.net> wrote:

> IMO, returning code points by index is a mistake.  It over-emphasises
> the importance of the code point, which helps to continue the notion
> in some developers? minds that code points are somehow ?characters?.
> It also leads to people unnecessarily using UCS-4 as an internal
> representation, which seems to have very few advantages in practice
> over UTF-16.

The problem is that UTF-16 based code can very easily overlook the
handling of surrogate pairs, and one very easily get confused over what
string lengths mean.

Richard.


From manish at mozilla.com  Tue Mar 14 00:57:03 2017
From: manish at mozilla.com (Manish Goregaokar)
Date: Mon, 13 Mar 2017 22:57:03 -0700
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <CAFOnWkm1k-obtY-TA3gA1_2rsohouMA9ccJhzQE7umW_k0f20Q@mail.gmail.com>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook> <20170313214704.55372dfb@JRWUBU2>
 <CAFOnWkkSLTNgzLq=vMpKWhtzWT5fm_FjDZkGRFxop1vOzPHR8Q@mail.gmail.com>
 <20170313234837.5d891338@JRWUBU2>
 <CAFOnWkm1k-obtY-TA3gA1_2rsohouMA9ccJhzQE7umW_k0f20Q@mail.gmail.com>
Message-ID: <CAFOnWk=TdrPDtPZH16Nh1LezuHGnw7atMQs0TASsxJZubknLvg@mail.gmail.com>

Ah, it was what I thought you were talking about -- I wasn't aware they
were considered word boundaries :)

Thanks for the links!

On Mar 13, 2017 4:54 PM, "Richard Wordingham" <
richard.wordingham at ntlworld.com> wrote:

On Mon, 13 Mar 2017 15:26:00 -0700
Manish Goregaokar <manish at mozilla.com> wrote:

> Do you have examples of AA being split that way (and further reading)?
> I think I'm aware of what you're talking about, but would love to read
> more about it.

Just googling for the three words 'Sanskrit', 'sandhi' and 'resolution'
brings up plenty of papers and discussion, e.g. Hellwig's at
http://ltc.amu.edu.pl/book/papers/LRL-1.pdf and a multi-author paper at
https://www.aclweb.org/anthology/C/C16/C16-1048.pdf.

There are even technical terms for before and after.  Unsplit text is
'samhita text', and text split into words is 'pada text'.

Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170313/a9142cee/attachment.html>

From alastair at alastairs-place.net  Tue Mar 14 03:44:01 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Tue, 14 Mar 2017 08:44:01 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170313211011.GE1429@macbook>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook>
Message-ID: <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net>

On 13 Mar 2017, at 21:10, Khaled Hosny <khaledhosny at eglug.org> wrote:
> 
> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote:
>> On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
>>> 
>>> I liked the Go implementation of character type - a rune type - which is a codepoint.  and strings that return runes from by index.
>>> https://blog.golang.org/strings
>> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers? minds that code points are somehow ?characters?.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> But there are many text operations that require access to Unicode code
> points. Take for example text layout, as mapping characters to glyphs
> and back has to operate on code points. The idea that you never need to
> work with code points is too simplistic.

I didn?t say you never needed to work with code points.  What I said is that there?s no advantage to UCS-4 as an encoding, and that there?s no advantage to being able to index a string by code point.  As it happens, I?ve written the kind of code you cite as an example, including glyph mapping and OpenType processing, and the fact is that it?s no harder to do it with a UTF-16 string than it is with a UCS-4 string.  Yes, certainly, surrogate pairs need to be decoded to map to glyphs; but that?s a *trivial* matter, particularly as the code point to glyph mapping is not 1:1 or even 1:N - it?s N:M, so you already need to cope with being able to map multiple code units in the string to multiple glyphs in the result.

Kind regards,

Alastair.

--
http://alastairs-place.net


From alastair at alastairs-place.net  Tue Mar 14 03:51:18 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Tue, 14 Mar 2017 08:51:18 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170314020356.26ff5e89@JRWUBU2>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170314020356.26ff5e89@JRWUBU2>
Message-ID: <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net>

On 14 Mar 2017, at 02:03, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:
> 
> On Mon, 13 Mar 2017 19:18:00 +0000
> Alastair Houghton <alastair at alastairs-place.net> wrote:
> 
>> IMO, returning code points by index is a mistake.  It over-emphasises
>> the importance of the code point, which helps to continue the notion
>> in some developers? minds that code points are somehow ?characters?.
>> It also leads to people unnecessarily using UCS-4 as an internal
>> representation, which seems to have very few advantages in practice
>> over UTF-16.
> 
> The problem is that UTF-16 based code can very easily overlook the
> handling of surrogate pairs, and one very easily get confused over what
> string lengths mean.

Yet the same problem exists for UCS-4; it could very easily overlook the handling of combining characters.  As for string lengths, string lengths in code points are no more meaningful than string lengths in UTF-16 code units.  They don?t tell you anything about the number of user-visible characters; or anything about the width the string will take up if rendered on the display (even in a fixed-width font); or anything about the number of glyphs that a given string might be transformed into by glyph mapping.  The *only* think a string length of a Unicode string will tell you is the number of code units.

Kind regards,

Alastair.

--
http://alastairs-place.net


From steffen at sdaoden.eu  Tue Mar 14 07:21:27 2017
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Tue, 14 Mar 2017 13:21:27 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net>
References: <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170313211011.GE1429@macbook>
 <8CD006C0-E500-48AB-9334-4C5F9DE4F2BB@alastairs-place.net>
Message-ID: <20170314122127.g2gcS%steffen@sdaoden.eu>

Alastair Houghton <alastair at alastairs-place.net> wrote:
 |On 13 Mar 2017, at 21:10, Khaled Hosny <khaledhosny at eglug.org> wrote:
 |> On Mon, Mar 13, 2017 at 07:18:00PM +0000, Alastair Houghton wrote:
 |>> On 13 Mar 2017, at 17:55, J Decker <d3ck0r at gmail.com> wrote:
 |>>> 
 |>>> I liked the Go implementation of character type - a rune type - \
 |>>> which is a codepoint.  and strings that return runes from by index.
 |>>> https://blog.golang.org/strings
 |>> 
 |>> IMO, returning code points by index is a mistake.  It over-emphasises
 |>> the importance of the code point, which helps to continue the notion
 |>> in some developers? minds that code points are somehow ?characters?.
 |>> It also leads to people unnecessarily using UCS-4 as an internal
 |>> representation, which seems to have very few advantages in practice
 |>> over UTF-16.
 |> 
 |> But there are many text operations that require access to Unicode code
 |> points. Take for example text layout, as mapping characters to glyphs
 |> and back has to operate on code points. The idea that you never need to
 |> work with code points is too simplistic.
 |
 |I didn?t say you never needed to work with code points.  What I said \
 |is that there?s no advantage to UCS-4 as an encoding, and that there?s \

Well, you do have eleven bits for flags per codepoint, for example.

 |no advantage to being able to index a string by code point.  As it \

With UTF-32 you can take the very codepoint and look up Unicode
classification tables.

 |happens, I?ve written the kind of code you cite as an example, including \
 |glyph mapping and OpenType processing, and the fact is that it?s no \
 |harder to do it with a UTF-16 string than it is with a UCS-4 string. \
 | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \
 |but that?s a *trivial* matter, particularly as the code point to glyph \
 |mapping is not 1:1 or even 1:N - it?s N:M, so you already need to cope \
 |with being able to map multiple code units in the string to multiple \
 |glyphs in the result.

If you have to iterate over a string to perform some high-level
processing then UTF-8 is a choice almost equally fine, for the
very same reasons you bring in.  And if the usage pattern
"hotness" pictures that this thread has shown up at the beginning
is correct, then the size overhead of UTF-8 that the UTF-16
proponents point out turns out to be a flop.

But i for one gave up on making a stand against UTF-16 or BOMs.
In fact i have turned to think UTF-16 is a pretty nice in-memory
representation, and it is a small step to get from it to the real
codepoint that you need to decide what something is, and what has
to be done with it.  I don't know whether i would really use it
for this purpose, though, i am pretty sure that my core Unicode
functions will (start to /) continue to use UTF-32, because the
codepoint to codepoint(s) is what is described, and onto which
anything else can be implemented.  I.e., you can store three
UTF-32 codepoints in a single uint64_t, and i would shoot myself
in the foot if i would make this accessible via an UTF-16 or UTF-8
converter, imho; instead, i (will) make it accessible directly as
UTF-32, and that serves equally well all other formats.  Of
course, if it is clear that you are UTF-16 all-through-the-way
then you can save the conversion, but (the) most (widespread)
Uni(x|ces) are UTF-8 based and it looks as if that would stay.
Yes, yes, you can nonetheless use UTF-16, but it will most likely
not safe you something on the database side due to storage
alignment requirements, and the necessity to be able to access
data somewhere.  You can have a single index-lookup array and
a dynamically sized database storage which uses two-byte
alignment, of course, then i can imagine UTF-16 is for the better.
I never looked how ICU does it, but i have been impressed by sheer
data facts ^.^

--steffen


From doug at ewellic.org  Tue Mar 14 10:14:33 2017
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 14 Mar 2017 08:14:33 -0700
Subject: "A Programmer's Introduction to Unicode"
Message-ID: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com>

Steffen Nurpmeso wrote:

>> I didn?t say you never needed to work with code points. What I said
>> is that there?s no advantage to UCS-4 as an encoding, and that
>
> Well, you do have eleven bits for flags per codepoint, for example. 

That's not UCS-4; that's a custom encoding.

(any UCS-4 code unit) & 0xFFE00000 == 0
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From verdy_p at wanadoo.fr  Tue Mar 14 10:35:48 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 14 Mar 2017 16:35:48 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com>
References: <20170314081433.665a7a7059d7ee80bb4d670165c8327d.711efe5c84.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC0y_bi633eSm96Q8eyAn8JeUAqUOfVAxf9=GQ3B60gK6w@mail.gmail.com>

Per definition yes, but UTC-4 is not Unicode.
As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which would
allow 32 planes instead of just the 17 first ones).
I suppose he meant 21 bits, not 11 bits which covers only a small part of
the BMP.

2017-03-14 16:14 GMT+01:00 Doug Ewell <doug at ewellic.org>:

> Steffen Nurpmeso wrote:
>
> >> I didn?t say you never needed to work with code points. What I said
> >> is that there?s no advantage to UCS-4 as an encoding, and that
> >
> > Well, you do have eleven bits for flags per codepoint, for example.
>
> That's not UCS-4; that's a custom encoding.
>
> (any UCS-4 code unit) & 0xFFE00000 == 0
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170314/8075ff5b/attachment.html>

From doug at ewellic.org  Tue Mar 14 11:15:38 2017
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 14 Mar 2017 09:15:38 -0700
Subject: "A Programmer's Introduction to Unicode"
Message-ID: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com>

Philippe Verdy wrote:

>>> Well, you do have eleven bits for flags per codepoint, for example.
>>
>> That's not UCS-4; that's a custom encoding.
>>
>> (any UCS-4 code unit) & 0xFFE00000 == 0

(changing to "UTF-32" per Ken's observation)

> Per definition yes, but UTC-4 is not Unicode.

I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
held in 1989?

> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
> would allow 32 planes instead of just the 17 first ones).

I used bitwise arithmetic strictly to address Steffen's premise that the
11 "unused bits" in a UTF-32 code unit were available to store metadata
about the code point. Of course UTF-32 does not allow 0x110000 through
0x1FFFFF either.

> I suppose he meant 21 bits, not 11 bits which covers only a small part
> of the BMP.

No, his comment "you do have eleven bits for flags per codepoint" pretty
clearly referred to using the "extra" 11 bits beyond what is needed to
hold the Unicode scalar value.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From richard.wordingham at ntlworld.com  Tue Mar 14 15:28:33 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 14 Mar 2017 20:28:33 +0000
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net>
References: <MWHPR03MB27982A111131A556E0BF2604D5200@MWHPR03MB2798.namprd03.prod.outlook.com>
 <CAFOnWknehXAG7+MSxRgKWmUM6ZknGppkqCj8uyjyXEN85j-k4g@mail.gmail.com>
 <864lyzhvxz.fsf@mimuw.edu.pl>
 <CAFOnWkmXtdMLHQ3=XMZOGudOMfTFW0PonwEk6NUgGhgLvcB3Zg@mail.gmail.com>
 <20170312200228.66175jm4ugv8aedg@mail.mimuw.edu.pl>
 <20170312201022.7ec8d858@JRWUBU2>
 <20170313113128.49786e0cyjk3q680@mail.mimuw.edu.pl>
 <39d39d6a-1d2c-16d6-666c-57e71ce66e5e@ix.netcom.com>
 <20170313181531.120575ocpe8uockz@mail.mimuw.edu.pl>
 <CAA2GJqVCM5t7zxT575hjm7WGW_rnXJSYiXq-6V9OexHdT-ATvQ@mail.gmail.com>
 <48CB7043-2F1B-4953-9A61-3A75EEB20C89@alastairs-place.net>
 <20170314020356.26ff5e89@JRWUBU2>
 <3529A80D-304B-4B65-AACA-D3E60348CA6B@alastairs-place.net>
Message-ID: <20170314202833.08eb9d55@JRWUBU2>

On Tue, 14 Mar 2017 08:51:18 +0000
Alastair Houghton <alastair at alastairs-place.net> wrote:

> On 14 Mar 2017, at 02:03, Richard Wordingham
> <richard.wordingham at ntlworld.com> wrote:
> > 
> > On Mon, 13 Mar 2017 19:18:00 +0000
> > Alastair Houghton <alastair at alastairs-place.net> wrote:

> > The problem is that UTF-16 based code can very easily overlook the
> > handling of surrogate pairs, and one very easily get confused over
> > what string lengths mean.  
> 
> Yet the same problem exists for UCS-4; it could very easily overlook
> the handling of combining characters.

That's a different issue.  I presume you mean the issues of canonical
equivalence and detecting text boundaries.  Again, there is the problem
of remembering to consider the whole surrogate pair when using
UTF-16.  (I suppose this could be largely handled by avoiding the
concept of arrays.)  Now, the supplementary characters where these
issues arise are very infrequently used.  An error in UTF-16 code might
easily not come to attention, whereas a problem with UCS-4 (or UTF-8)
comes to light as soon as one handles Thai or IPA.

> As for string lengths, string
> lengths in code points are no more meaningful than string lengths in
> UTF-16 code units.  They don?t tell you anything about the number of
> user-visible characters; or anything about the width the string will
> take up if rendered on the display (even in a fixed-width font); or
> anything about the number of glyphs that a given string might be
> transformed into by glyph mapping.  The *only* think a string length
> of a Unicode string will tell you is the number of code units.

A string length in codepoints does have the advantage of being
independent of encoding.  I'm actually using an index for UTF-16
text (I don't know whether its denominated in codepoints or code
units) to index into the UTF-8 source code.  However, the number of code
units is the more commonly used quantity, as it tells one how much
memory is required for simple array storage.

Richard.


From steffen at sdaoden.eu  Wed Mar 15 05:40:54 2017
From: steffen at sdaoden.eu (Steffen Nurpmeso)
Date: Wed, 15 Mar 2017 11:40:54 +0100
Subject: "A Programmer's Introduction to Unicode"
In-Reply-To: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com>
References: <20170314091538.665a7a7059d7ee80bb4d670165c8327d.b2df3cc5ee.wbe@email03.godaddy.com>
Message-ID: <20170315104054.tMouD%steffen@sdaoden.eu>

"Doug Ewell" <doug at ewellic.org> wrote:
 |Philippe Verdy wrote:
 |>>> Well, you do have eleven bits for flags per codepoint, for example.
 |>>
 |>> That's not UCS-4; that's a custom encoding.
 |>>
 |>> (any UCS-4 code unit) & 0xFFE00000 == 0
 |
 |(changing to "UTF-32" per Ken's observation)
 |
 |> Per definition yes, but UTC-4 is not Unicode.
 |
 |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
 |held in 1989?
 |
 |> As well (any UCS-4 code unit) & 0xFFE00000 == 0 (i.e. 21 bits) is not
 |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
 |> would allow 32 planes instead of just the 17 first ones).
 |
 |I used bitwise arithmetic strictly to address Steffen's premise that the
 |11 "unused bits" in a UTF-32 code unit were available to store metadata
 |about the code point. Of course UTF-32 does not allow 0x110000 through
 |0x1FFFFF either.
 |
 |> I suppose he meant 21 bits, not 11 bits which covers only a small part
 |> of the BMP.
 |
 |No, his comment "you do have eleven bits for flags per codepoint" pretty
 |clearly referred to using the "extra" 11 bits beyond what is needed to
 |hold the Unicode scalar value.

It surely is a weak argument for a general string encoding.  But
sometimes, and for local use cases it surely is valid.  You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.

--steffen

From 637275 at gmail.com  Fri Mar 17 11:53:43 2017
From: 637275 at gmail.com (Rebecca T)
Date: Fri, 17 Mar 2017 12:53:43 -0400
Subject: Combining solidus above for transcription of poetic meter
Message-ID: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>

When transcribing poetic meter (scansion
<https://en.m.wikipedia.org/wiki/Scansion>), it is common to use two symbols
above the line (usually a breve [U+306  ?] for stressed syllables and a
solidus
/ slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex:

     ?    /   ?  /  ?   /    ?  /     ?    /
    When I consider how my light is spent

    (John Milton, On His Blindness)

Other symbols used in place of the breve are a cross / x (U+D8 ? or U+78 x)
or
bullet (U+B7 ? or U+2022 ?).

This approach, however, is problematic; the lack of a combining slash above
character means that two lines of text must be used, and any non-monospaced
font (or any platform where multiple consecutive spaces are truncated into
one
by default, such as HTML) makes keeping the annotations properly aligned
with
the text difficult or impossible ? depending on your email client, the above
example may be entirely misaligned. Being able to use combining diacritics
for
scansion would make these problems obsolete and enable a semantic
transcription
of meter.

Would a proposal to add a combining solidus above (and possibly a combining
reversed solidus above to support Hamer, Wright, and Trager-Smith
notations) be
supported?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/b4f45c62/attachment.html>

From jcb+unicode at inf.ed.ac.uk  Fri Mar 17 12:27:47 2017
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Fri, 17 Mar 2017 17:27:47 GMT
Subject: Combining solidus above for transcription of poetic meter
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
Message-ID: <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>

On 2017-03-17, Rebecca T <637275 at gmail.com> wrote:
> When transcribing poetic meter (scansion
><https://en.m.wikipedia.org/wiki/Scansion>), it is common to use two symbols
> above the line (usually a breve [U+306  ?] for stressed syllables and a
> solidus
> / slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex:

Other way round, as you illustrate

> This approach, however, is problematic; the lack of a combining slash above
> character means that two lines of text must be used, and any non-monospaced
> font (or any platform where multiple consecutive spaces are truncated into
> one

It won't help to have a "combining solidus a long way above" (which is
what you really want) unless you also have "combining breve a long way
above".
If you are happy to use a typographically normal combining breve for
the unstressed syllables, you should be happy to use a typographically
normal acute accent for the stressed syllable.

> by default, such as HTML) makes keeping the annotations properly aligned
> with
> the text difficult or impossible ? depending on your email client, the
> above
> example may be entirely misaligned. Being able to use combining diacritics
> for
> scansion would make these problems obsolete and enable a semantic
> transcription
> of meter.

If you're working in a situation where you don't have either markup
control or the facility to use plain monospaced text, then just use
normal breves and acutes.
It's not clear to me that laying out aligned text (for which there are
many other applications than scansion, e.g. interlinear translation)
is something best achieved with combining characters!


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From nobody_uses at outlook.com  Fri Mar 17 13:46:45 2017
From: nobody_uses at outlook.com (eduardo marin)
Date: Fri, 17 Mar 2017 18:46:45 +0000
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
Message-ID: <CY1PR20MB0614EECA9DD295CFDC552C1882390@CY1PR20MB0614.namprd20.prod.outlook.com>

You would need to propose the entire set of symbols, like the caret the reverse solidus and the x above, furthermore you would need to make the solidus small so it doesn't interfere with the line of text above. So go for it.


________________________________
De: Rebecca T <637275 at gmail.com>
Enviado: viernes, 17 de marzo de 2017 10:53 a. m.
Para: Unicode Public
Asunto: Combining solidus above for transcription of poetic meter

When transcribing poetic meter (scansion<https://en.m.wikipedia.org/wiki/Scansion>), it is common to use two symbols
above the line (usually a breve [U+306  ?] for stressed syllables and a solidus
/ slash [U+2F /] for unstressed syllables) to indicate stress patterns. Ex:

     ?    /   ?  /  ?   /    ?  /     ?    /
    When I consider how my light is spent

    (John Milton, On His Blindness)

Other symbols used in place of the breve are a cross / x (U+D8 ? or U+78 x) or
bullet (U+B7 ? or U+2022 ?).

This approach, however, is problematic; the lack of a combining slash above
character means that two lines of text must be used, and any non-monospaced
font (or any platform where multiple consecutive spaces are truncated into one
by default, such as HTML) makes keeping the annotations properly aligned with
the text difficult or impossible ? depending on your email client, the above
example may be entirely misaligned. Being able to use combining diacritics for
scansion would make these problems obsolete and enable a semantic transcription
of meter.

Would a proposal to add a combining solidus above (and possibly a combining
reversed solidus above to support Hamer, Wright, and Trager-Smith notations) be
supported?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/8a00b8a3/attachment.html>

From verdy_p at wanadoo.fr  Fri Mar 17 14:03:12 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 17 Mar 2017 20:03:12 +0100
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
Message-ID: <CAGa7JC3r9Wtx8xQ9M6WrCMrpynbaMrPe0o_LP+Ow6++ZxB4yVQ@mail.gmail.com>

Isn't this a use case for interlinear annotations ? What is the current
status of interlinear encoding?

We were told that the encoded codepoints for these are more or less
deprecated (but in HTML there's still interlinear annotation supported by
ruby notations). In these annotations, we don't need any diacritics, we
could just use base symbols.

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Garanti
sans virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:

> On 2017-03-17, Rebecca T <637275 at gmail.com> wrote:
> > When transcribing poetic meter (scansion
> ><https://en.m.wikipedia.org/wiki/Scansion>), it is common to use two
> symbols
> > above the line (usually a breve [U+306  ?] for stressed syllables and a
> > solidus
> > / slash [U+2F /] for unstressed syllables) to indicate stress patterns.
> Ex:
>
> Other way round, as you illustrate
>
> > This approach, however, is problematic; the lack of a combining slash
> above
> > character means that two lines of text must be used, and any
> non-monospaced
> > font (or any platform where multiple consecutive spaces are truncated
> into
> > one
>
> It won't help to have a "combining solidus a long way above" (which is
> what you really want) unless you also have "combining breve a long way
> above".
> If you are happy to use a typographically normal combining breve for
> the unstressed syllables, you should be happy to use a typographically
> normal acute accent for the stressed syllable.
>
> > by default, such as HTML) makes keeping the annotations properly aligned
> > with
> > the text difficult or impossible ? depending on your email client, the
> > above
> > example may be entirely misaligned. Being able to use combining
> diacritics
> > for
> > scansion would make these problems obsolete and enable a semantic
> > transcription
> > of meter.
>
> If you're working in a situation where you don't have either markup
> control or the facility to use plain monospaced text, then just use
> normal breves and acutes.
> It's not clear to me that laying out aligned text (for which there are
> many other applications than scansion, e.g. interlinear translation)
> is something best achieved with combining characters!
>
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/3cbf6e18/attachment.html>

From verdy_p at wanadoo.fr  Fri Mar 17 14:10:43 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 17 Mar 2017 20:10:43 +0100
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
Message-ID: <CAGa7JC3+6mJGXCQfaso1u7QddKjshUp_BL3NJ9pfbNSndMYvFQ@mail.gmail.com>

2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:

> If you are happy to use a typographically normal combining breve for
> the unstressed syllables, you should be happy to use a typographically
> normal acute accent for the stressed syllable.
>

You've understood the reverse! the stressed syllable in those notation uses
a breve, the unstressed syllables use a slash/solidus (which many look very
similar to an acute accent, but means here exactly the opposite).
However using acute accents that are already used in many langauges for
vowel distinctions (independantly of stress) would cause problems.

It would be better to use the IPA stress mark that looks like a vertical
tick just before the syllable (i.e. before its leading consonnant and not
on top of its central vowel): these marks are not combining, they are
regular spacing symbols.

The proposal discusses about *some* specific use where symbols that look
like diacritics may be used in a row just above the actual text (in that
case it should not be confused with the actual accents).

That's why I think this better fits with interlinear annotations (there
will be some vertical margin between the notation and the text using its
native diacritics, and the interlinear stress marks will align horizontally
without colliding wit h the text whose diacritics would have variable
placement, not aligned horizontally but depending on base letters or the
presence of other diacritics).


<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Garanti
sans virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/5515431e/attachment.html>

From verdy_p at wanadoo.fr  Fri Mar 17 14:16:17 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 17 Mar 2017 20:16:17 +0100
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <CAGa7JC3+6mJGXCQfaso1u7QddKjshUp_BL3NJ9pfbNSndMYvFQ@mail.gmail.com>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
 <CAGa7JC3+6mJGXCQfaso1u7QddKjshUp_BL3NJ9pfbNSndMYvFQ@mail.gmail.com>
Message-ID: <CAGa7JC1q5EHNYvqRjwY0f9Fc49wJhbWZM56h0XckA4Z=OM7CeQ@mail.gmail.com>

Final note: the HTML ruby syntax (their standard tags) is not supported by
MediaWiki, for your example article in English Wikipedia (but there are
some templates that could simulate ruby notation, using equivalent CSS to
which the ruby notation should have a default mapping, as specified in an
annex of the HTML standard suggesting a default CSS stylesheet for standard
HTML tags).

2017-03-17 20:10 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> 2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:
>
>> If you are happy to use a typographically normal combining breve for
>> the unstressed syllables, you should be happy to use a typographically
>> normal acute accent for the stressed syllable.
>>
>
> You've understood the reverse! the stressed syllable in those notation
> uses a breve, the unstressed syllables use a slash/solidus (which many look
> very similar to an acute accent, but means here exactly the opposite).
> However using acute accents that are already used in many langauges for
> vowel distinctions (independantly of stress) would cause problems.
>
> It would be better to use the IPA stress mark that looks like a vertical
> tick just before the syllable (i.e. before its leading consonnant and not
> on top of its central vowel): these marks are not combining, they are
> regular spacing symbols.
>
> The proposal discusses about *some* specific use where symbols that look
> like diacritics may be used in a row just above the actual text (in that
> case it should not be confused with the actual accents).
>
> That's why I think this better fits with interlinear annotations (there
> will be some vertical margin between the notation and the text using its
> native diacritics, and the interlinear stress marks will align horizontally
> without colliding wit h the text whose diacritics would have variable
> placement, not aligned horizontally but depending on base letters or the
> presence of other diacritics).
>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Garanti
> sans virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
> <#m_2934369818200883392_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/3b40cf49/attachment.html>

From verdy_p at wanadoo.fr  Fri Mar 17 14:23:16 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 17 Mar 2017 20:23:16 +0100
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <CAGa7JC1q5EHNYvqRjwY0f9Fc49wJhbWZM56h0XckA4Z=OM7CeQ@mail.gmail.com>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
 <CAGa7JC3+6mJGXCQfaso1u7QddKjshUp_BL3NJ9pfbNSndMYvFQ@mail.gmail.com>
 <CAGa7JC1q5EHNYvqRjwY0f9Fc49wJhbWZM56h0XckA4Z=OM7CeQ@mail.gmail.com>
Message-ID: <CAGa7JC23d59pohj8cKhszJKuXmD+djps2=H9h0WdDiyxcDqcQw@mail.gmail.com>

An article for you to read that privides some basic guides and a
presentaiton of the concept and its use in HTML:
https://en.wikipedia.org/wiki/Ruby_character
Then look at CSS 2.0 for specifications.

In Unicode, 3 format control characters were encoded for this
(U+FFF9...U+FFFB), but that supports only a minimalist subset of the ruby
feature and which are (as far as I know) poorly supported in browsers
(almost no one use them, not even for the common ruby text used in Asian
languages, notably in Japanese for the Furigana notations using kanas above
sinographic Kanji text, or in Chinese for the Bopomofo or Latin notations
above sinographic text found in educational books for children).

<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Garanti
sans virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

2017-03-17 20:16 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> Final note: the HTML ruby syntax (their standard tags) is not supported by
> MediaWiki, for your example article in English Wikipedia (but there are
> some templates that could simulate ruby notation, using equivalent CSS to
> which the ruby notation should have a default mapping, as specified in an
> annex of the HTML standard suggesting a default CSS stylesheet for standard
> HTML tags).
>
> 2017-03-17 20:10 GMT+01:00 Philippe Verdy <verdy_p at wanadoo.fr>:
>
>> 2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:
>>
>>> If you are happy to use a typographically normal combining breve for
>>> the unstressed syllables, you should be happy to use a typographically
>>> normal acute accent for the stressed syllable.
>>>
>>
>> You've understood the reverse! the stressed syllable in those notation
>> uses a breve, the unstressed syllables use a slash/solidus (which many look
>> very similar to an acute accent, but means here exactly the opposite).
>> However using acute accents that are already used in many langauges for
>> vowel distinctions (independantly of stress) would cause problems.
>>
>> It would be better to use the IPA stress mark that looks like a vertical
>> tick just before the syllable (i.e. before its leading consonnant and not
>> on top of its central vowel): these marks are not combining, they are
>> regular spacing symbols.
>>
>> The proposal discusses about *some* specific use where symbols that look
>> like diacritics may be used in a row just above the actual text (in that
>> case it should not be confused with the actual accents).
>>
>> That's why I think this better fits with interlinear annotations (there
>> will be some vertical margin between the notation and the text using its
>> native diacritics, and the interlinear stress marks will align horizontally
>> without colliding wit h the text whose diacritics would have variable
>> placement, not aligned horizontally but depending on base letters or the
>> presence of other diacritics).
>>
>>
>>
>>
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail> Garanti
>> sans virus. www.avast.com
>> <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
>> <#m_-5720946395316280878_m_2934369818200883392_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170317/ce94ee2b/attachment.html>

From kenwhistler at att.net  Fri Mar 17 14:41:29 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Fri, 17 Mar 2017 12:41:29 -0700
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
Message-ID: <a45b9902-26e6-034c-0b62-32b95c3064a7@att.net>


On 3/17/2017 10:27 AM, Julian Bradfield wrote:
> If you're working in a situation where you don't have either markup
> control or the facility to use plain monospaced text, then just use
> normal breves and acutes.
> It's not clear to me that laying out aligned text (for which there are
> many other applications than scansion, e.g. interlinear translation)
> is something best achieved with combining characters!

I concur with Julian here. In fact, the very wiki article on scansion 
cited by Rebecca makes it clear that this is an interlinear type of 
annotation that in principle can use many *other* symbols, including x's 
(or multiplication signs), digits, circumflexes, and other symbols.

Furthermore, the application of the scansion marks is to *syllables* and 
not to individual letters, which further enhances the case for 
interlinear representation. The simplest implementation of that is 
precisely as done in that wiki: force the interlinear examples into a 
monospace font.

For simple transposing of an interlinear scansion into a single-line 
plain text representation, either combining breves and acutes (and 
circumflexes and graves, ...) can be used and/or spacing versions of 
breves (and circumflexes...) plus ordinary slashes and backslashes can 
be dropped into the syllabified text.

--Ken

>


From jcb+unicode at inf.ed.ac.uk  Fri Mar 17 15:36:13 2017
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Fri, 17 Mar 2017 20:36:13 +0000 (GMT)
Subject: Combining solidus above for transcription of poetic meter
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
 <CAGa7JC3+6mJGXCQfaso1u7QddKjshUp_BL3NJ9pfbNSndMYvFQ@mail.gmail.com>
Message-ID: <slrnocoi5s.bcl.jcb@home.stevens-bradfield.com>

On 2017-03-17, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 2017-03-17 18:27 GMT+01:00 Julian Bradfield <jcb+unicode at inf.ed.ac.uk>:
>
>> If you are happy to use a typographically normal combining breve for
>> the unstressed syllables, you should be happy to use a typographically
>> normal acute accent for the stressed syllable.
>>
>
> You've understood the reverse! the stressed syllable in those notation uses
> a breve, the unstressed syllables use a slash/solidus (which many look very
> similar to an acute accent, but means here exactly the opposite).

I have understood the situation as it actually is (and indeed as it is
described in the Wikipedia article). *As I pointed out*, had you
bothered to read what I wrote, the OP accidentally reversed the
standard notation, in which / indicates a stressed syllable, and a
breve an unstressed.

Hence there is no clash with the (e.g.) Spanish use of an acute to
indicate stress.

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From boldewyn at gmail.com  Fri Mar 17 15:44:15 2017
From: boldewyn at gmail.com (Manuel Strehl)
Date: Fri, 17 Mar 2017 21:44:15 +0100
Subject: New tool unidump
Message-ID: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>

Hi,

for my work on codepoints.net and Emojipedia I found myself repeatedly
in a place, where I needed some tool like hexdump to inspect the content
of a string. However, instead of raw bytes I am more interested in the
code points that the string is composed of. So I wrote this tool.

I reasoned, that it might come in handy for other people on this list.
It is, conveniently, named unidump and can be installed via pip (pip3,
that is, because it needs Python 3):

    pip3 install unidump

The source code is available on Github,
https://github.com/Codepoints/unidump, and the tool is MIT licensed. The
README on Github also explains some other use cases, like counting code
points in a file (as opposed to bytes) or using it as a replacement for
strings(1).

If you have any comment, feedback, bug report or other questions, I'm
glad to answer any of those.

Cheers and have a nice weekend,
Manuel

From everson at evertype.com  Fri Mar 17 15:53:24 2017
From: everson at evertype.com (Michael Everson)
Date: Fri, 17 Mar 2017 20:53:24 +0000
Subject: Combining solidus above for transcription of poetic meter
In-Reply-To: <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
References: <CANDtJjjSxsyg9EXrS1w3eZYPWJ4RpbQN52hvq98YbEgzvY-hfA@mail.gmail.com>
 <slrnoco74j.p6m.jcb@zagreb.inf.ed.ac.uk>
Message-ID: <375186EF-815E-4847-987B-03E94A6C1BBB@evertype.com>

http://www.brill.com/files/brill.nl/special_scripts_metrical_characters_unicode.pdf

From manish at mozilla.com  Fri Mar 17 18:43:04 2017
From: manish at mozilla.com (Manish Goregaokar)
Date: Fri, 17 Mar 2017 16:43:04 -0700
Subject: New tool unidump
In-Reply-To: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>
References: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>
Message-ID: <CAFOnWknmH8z3DrPEVVWpnGDeCQ7i2KJ3Hs2qUqR9pkM3mOJPiQ@mail.gmail.com>

https://r12a.github.io/uniview/

https://r12a.github.io/apps/conversion/

are excellent tools for this, as well, if you're in a situation where
you can copy into a web form.

This looks useful for commandline stuff, though, thanks!
-Manish


On Fri, Mar 17, 2017 at 1:44 PM, Manuel Strehl <boldewyn at gmail.com> wrote:
> Hi,
>
> for my work on codepoints.net and Emojipedia I found myself repeatedly
> in a place, where I needed some tool like hexdump to inspect the content
> of a string. However, instead of raw bytes I am more interested in the
> code points that the string is composed of. So I wrote this tool.
>
> I reasoned, that it might come in handy for other people on this list.
> It is, conveniently, named unidump and can be installed via pip (pip3,
> that is, because it needs Python 3):
>
>     pip3 install unidump
>
> The source code is available on Github,
> https://github.com/Codepoints/unidump, and the tool is MIT licensed. The
> README on Github also explains some other use cases, like counting code
> points in a file (as opposed to bytes) or using it as a replacement for
> strings(1).
>
> If you have any comment, feedback, bug report or other questions, I'm
> glad to answer any of those.
>
> Cheers and have a nice weekend,
> Manuel

From jsbien at mimuw.edu.pl  Sat Mar 18 00:42:05 2017
From: jsbien at mimuw.edu.pl (Janusz S. Bien)
Date: Sat, 18 Mar 2017 06:42:05 +0100
Subject: New tool unidump
In-Reply-To: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>
References: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>
Message-ID: <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl>

Quote/Cytat - Manuel Strehl <boldewyn at gmail.com> (Fri 17 Mar 2017  
09:44:15 PM CET):

> Hi,
>
> for my work on codepoints.net and Emojipedia I found myself repeatedly
> in a place, where I needed some tool like hexdump to inspect the content
> of a string. However, instead of raw bytes I am more interested in the
> code points that the string is composed of. So I wrote this tool.

Is somebody maintaining a list of such utilities?

There is a page

http://www.unicode.org/resources/online-tools.html

but I remember that earlier a page on the site used to be links to the  
programs mentioned in 2012 "Tool to convert characters to character  
names", in particular to Bill Poser's uniutils  
(http://billposer.org/Software/unidesc.html) and the orphaned unihist  
by a student of mine (https://bitbucket.org/jsbien/unihistext). I'm  
unable to find them now.

Best regards

Janusz

-- 
Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra  
Lingwistyki Formalnej)
Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/


From 637275 at gmail.com  Sun Mar 19 16:46:28 2017
From: 637275 at gmail.com (Rebecca T)
Date: Sun, 19 Mar 2017 17:46:28 -0400
Subject: New tool unidump
In-Reply-To: <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl>
References: <d4569229-ce25-50b4-a82f-b43acc547098@gmail.com>
 <20170318064205.20014j1l4r4dsuy5@mail.mimuw.edu.pl>
Message-ID: <CANDtJjhNry9P=Ebc2DCDrB2W3EcHyMYaDCbeGdHrEqWAyfReJA@mail.gmail.com>

I maintain a list of various Unicode tools and resources at
unicode.9999yea.rs and always welcome new additions!

On Sat, Mar 18, 2017 at 1:42 AM, Janusz S. Bien <jsbien at mimuw.edu.pl> wrote:

> Quote/Cytat - Manuel Strehl <boldewyn at gmail.com> (Fri 17 Mar 2017
> 09:44:15 PM CET):
>
> Hi,
>>
>> for my work on codepoints.net and Emojipedia I found myself repeatedly
>> in a place, where I needed some tool like hexdump to inspect the content
>> of a string. However, instead of raw bytes I am more interested in the
>> code points that the string is composed of. So I wrote this tool.
>>
>
> Is somebody maintaining a list of such utilities?
>
> There is a page
>
> http://www.unicode.org/resources/online-tools.html
>
> but I remember that earlier a page on the site used to be links to the
> programs mentioned in 2012 "Tool to convert characters to character names",
> in particular to Bill Poser's uniutils (http://billposer.org/Software
> /unidesc.html) and the orphaned unihist by a student of mine (
> https://bitbucket.org/jsbien/unihistext). I'm unable to find them now.
>
> Best regards
>
> Janusz
>
> --
> Prof. dr hab. Janusz S. Bie? -  Uniwersytet Warszawski (Katedra
> Lingwistyki Formalnej)
> Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department)
> jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~
> jsbien/
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170319/6611efb8/attachment.html>

From a.lukyanov at yspu.org  Mon Mar 20 04:27:35 2017
From: a.lukyanov at yspu.org (Andrey Lukyanov)
Date: Mon, 20 Mar 2017 12:27:35 +0300
Subject: New tool unidump
In-Reply-To: <mailman.0.1489856402.2109.unicode@unicode.org>
References: <mailman.0.1489856402.2109.unicode@unicode.org>
Message-ID: <3b657ee35ea3e5ff075363d9dd2cc7ea@mail>


apropos unidump:

It would be nice to add the option of printing not only numbers, but also
character names and other info from the NamesList.txt file.

I am using a homemade program at my computer:

$ typecode fc25 
66 63 32 35 

$ typecode -l fc25 
0066	LATIN SMALL LETTER F
0063	LATIN SMALL LETTER C
0032	DIGIT TWO
0035	DIGIT FIVE

$ typecode -f fc25 
0066	LATIN SMALL LETTER F
0063	LATIN SMALL LETTER C
0032	DIGIT TWO
	~ 0032 FE0E text style
	~ 0032 FE0F emoji style
0035	DIGIT FIVE
	~ 0035 FE0E text style
	~ 0035 FE0F emoji style


From c933103 at gmail.com  Tue Mar 21 07:12:10 2017
From: c933103 at gmail.com (gfb hjjhjh)
Date: Tue, 21 Mar 2017 20:12:10 +0800
Subject: Standaridized variation sequences for the Desert alphabet?
Message-ID: <CAGHjPPJnVN1HP95iTTbCxkRuAkYF1CzJyQ67_2GvMGUG3LP3AQ@mail.gmail.com>

According to the wikipedia page for the Desert alphabet, there're critism
that in the unicode chart some of the letter encoded for the alphabet used
the 1855 design instead of 1859 deisgn of those characters. Would it be a
good idea to make ?standardized variation sequences for those characters so
that they can be displayed eitherway upon users' wish?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170321/c4ec986a/attachment.html>

From doug at ewellic.org  Tue Mar 21 12:41:04 2017
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 21 Mar 2017 10:41:04 -0700
Subject: Standaridized variation sequences for the Desert
 =?UTF-8?Q?alphabet=3F?=
Message-ID: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>

gfb hjjhjh wrote:

> According to the wikipedia page for the Desert alphabet, there're
> critism that in the unicode chart some of the letter encoded for the
> alphabet used the 1855 design instead of 1859 deisgn of those
> characters. Would it be a good idea to make ?standardized variation
> sequences for those characters so that they can be displayed eitherway
> upon users' wish?

Almost any letter in any script can have glyph variations that don't
represent a change in semantics.

A Deseret font could easily, and conformantly, be constructed with
whatever set of glyphs the designer wishes to show, just as it could for
a Latin-script font.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From jameskasskrv at gmail.com  Tue Mar 21 18:17:11 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Tue, 21 Mar 2017 15:17:11 -0800
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
Message-ID: <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>

https://en.wikipedia.org/wiki/Deseret_alphabet

An interesting article.  The "Encodings" section illustrates the
differences between the older and newer forms of the two letters.

Doug Ewell wrote,

> A Deseret font could easily, and conformantly, be
> constructed with whatever set of glyphs the designer
> wishes to show, just as it could for a Latin-script
> font.

If the user community needs to preserve the distinction in plain-text,
then variation selection is the right approach.

Best regards,

James Kass

From prosfilaes at gmail.com  Tue Mar 21 18:55:26 2017
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 21 Mar 2017 23:55:26 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
Message-ID: <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>

On Tue, Mar 21, 2017 at 4:50 PM James Kass <jameskasskrv at gmail.com> wrote:

> If the user community needs to preserve the distinction in plain-text,
> then variation selection is the right approach.
>

True. However, the user community is tiny, and I suspect that those
variation selectors would never get used.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170321/1904f1d2/attachment.html>

From everson at evertype.com  Wed Mar 22 10:47:30 2017
From: everson at evertype.com (Michael Everson)
Date: Wed, 22 Mar 2017 15:47:30 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
Message-ID: <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>

The right first thing to do is to examine the letterforms and determine on structural grounds whether there is a case to be made for encoding.

Beesley claimed in 2002 that the glyphs used for EW [ju] and OI [??] changed between 1855 and 1859. Well, OK. 

1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], that is, [?] + [o?] = [?u?], that is, [ju]. 

2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], that is, [?] + [?] = [??], that is, [??].  

That?s encoded. Now evidently, the glyphs for the 1859 substitutions are as follows:

1. The 1859 glyph for EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], that is, [?] + [?] = [??], that is, [ju]. 

2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that is, [??] + [?] = [???], that is, [??].  

If there is evidence outside of the Wikipedia for the 1859 letters, they should be encoded as new letters, because their design shows them to be ligatures of different base characters. That means they?re not glyph variants of the currently encoded letters. 

Michael Everson


From wjgo_10009 at btinternet.com  Wed Mar 22 10:54:39 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 22 Mar 2017 15:54:39 +0000 (GMT)
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
Message-ID: <3780339.48221.1490198079154.JavaMail.defaultUser@defaultHost>

>> If the user community needs to preserve the distinction in plain-text, then variation selection is the right approach.

> True. However, the user community is tiny, and I suspect that those variation selectors would never get used.

I do not use Deseret myself.

I opine that encoding the variation selector sequences would be good.

My reason for that opinion is because I opine that Unicode should provide for such situations where they are known to exist, even if the usage of the encoding may be very rare.

Am I correct in thinking that making use of such a variation selector encoding would be a font issue rather than an operating system issue?

Unicode is intended to be a long-lasting standardized system, so hopefully adding the variation selector sequences into The Unicode Standard now would provide support for a very long time.

Am I correct in thinking that the cost of adding the variation selector sequences into The Unicode Standard would be very small?

William Overington

Wednesday 22 March 2017

 
From jenkins at apple.com  Wed Mar 22 11:50:26 2017
From: jenkins at apple.com (John H. Jenkins)
Date: Wed, 22 Mar 2017 10:50:26 -0600
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
Message-ID: <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com>

My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. 

In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts.

It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table.

From everson at evertype.com  Wed Mar 22 12:44:04 2017
From: everson at evertype.com (Michael Everson)
Date: Wed, 22 Mar 2017 17:44:04 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com>
Message-ID: <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com>

On 22 Mar 2017, at 16:50, John H. Jenkins <jenkins at apple.com> wrote:
> 
> My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance. 

There?s identity in terms of intended usage (two diphthongs), and identity in terms of the origin of the characters (ligatures from different sources). That kind of etymology is indeed something that we take into account when encoding characters.

> In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway.

I think I have to stand by my glyph analysis

> There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts.

Dunno what you are referring to here. 

> It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table.

I would oppose such a change given the origin of the four characters we have discussed. The old EW and OI and the new EW and OI are clearly *different* letters.

Michael

From jameskasskrv at gmail.com  Wed Mar 22 15:26:31 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Wed, 22 Mar 2017 12:26:31 -0800
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com>
 <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com>
Message-ID: <CABPY6Z0WugRdMvEj4L-kE_pTUiBe2UOp34OvGpBq-uV6rJzUtw@mail.gmail.com>

Michael Everson wrote,

> The old EW and OI and the new EW and OI are
> clearly *different* letters.

"Different" versus "variant"?

Michael's analysis seems correct.  If Deseret was not already in the
Standard, a new proposal for its encoding including eight characters
covering the two dipthongs would not be amiss, would it?  An
alternative would be to use the ZWJ mechanism to indicate a preference
for the desired letters.

My opinion that variation selectors would be the right approach was
based upon concerns about existing data getting "broken".  But, if
there isn't any existing data...

Best regards,

James Kass

From everson at evertype.com  Wed Mar 22 16:33:39 2017
From: everson at evertype.com (Michael Everson)
Date: Wed, 22 Mar 2017 21:33:39 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CABPY6Z0WugRdMvEj4L-kE_pTUiBe2UOp34OvGpBq-uV6rJzUtw@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <02DF9F54-F5C6-4107-8414-C99D38880E49@apple.com>
 <5812A5F2-DEED-42EE-93BE-C4DB97918687@evertype.com>
 <CABPY6Z0WugRdMvEj4L-kE_pTUiBe2UOp34OvGpBq-uV6rJzUtw@mail.gmail.com>
Message-ID: <F138FB2E-1A42-4F5E-BE96-34196563DE02@evertype.com>

On 22 Mar 2017, at 20:26, James Kass <jameskasskrv at gmail.com> wrote:
> Michael Everson wrote,
> 
>> The old EW and OI and the new EW and OI are clearly *different* letters.
> 
> "Different" versus "variant??

Yes, different. All of them share the SHORT I [?] stroke but the base characters are ?? ?? (1855) and ?? ?? (1859). 

> Michael's analysis seems correct.  If Deseret was not already in the Standard, a new proposal for its encoding including eight characters covering the two dipthongs would not be amiss, would it?  

Capital and small ?? ?? ?? ?? are already encoded. If the other four are required, nothing prevents them from being proposed and added. 

> An alternative would be to use the ZWJ mechanism to indicate a preference for the desired letters.

Joining what? We encoded ?? ?? ?? ?? explicitly, not as ligatures, though they are in origin ligatures. 

> My opinion that variation selectors would be the right approach was based upon concerns about existing data getting "broken".  But, if there isn't any existing data?

If ?? is in origin a ligature of ???? and the 1859 one is in origin a ligature of ???? then the 1855 and 1859 letters are **NOT** ?variants? of one another. They are *different* letters in origin, regardless of their intended use. 

The choice to use 1855 EW or 1859 EW is a matter of *spelling*, not glyph substitution. If the later letters are really required, they should be added to the standard. We should not abandon the good precedent we have for character identification just for expedience. That?d be a way to turn the UCS into a glyph registry. :-( 

Michael Everson

From prosfilaes at gmail.com  Wed Mar 22 16:39:27 2017
From: prosfilaes at gmail.com (David Starner)
Date: Wed, 22 Mar 2017 21:39:27 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
Message-ID: <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>

On Wed, Mar 22, 2017 at 8:54 AM Michael Everson <everson at evertype.com>
wrote:

> If there is evidence outside of the Wikipedia for the 1859 letters, they
> should be encoded as new letters, because their design shows them to be
> ligatures of different base characters. That means they?re not glyph
> variants of the currently encoded letters.
>

Does "?ussia" require a new Latin letter because the way R was written has
a different origin than the normal R? There's huge variation in Latin
script including all sorts of different glyphs, and I suspect ?ussia is way
more common than any use of the Deseret script.

There's the same characters here, written in different ways. The glyphs may
come from a different origin, but it's encoding the same idea. If a user
community considers them separate, then they should be separated, but I
don't see that happening, and from an idealistic perspective, I think
they're platonically the same.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170322/4f8c3026/attachment.html>

From everson at evertype.com  Wed Mar 22 19:03:44 2017
From: everson at evertype.com (Michael Everson)
Date: Thu, 23 Mar 2017 00:03:44 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
Message-ID: <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>

On 22 Mar 2017, at 21:39, David Starner <prosfilaes at gmail.com> wrote:
> 
> Does "?ussia" require a new Latin letter because the way R was written has a different origin than the normal R? 

But it doesn?t. It?s the Latin letter R turned backwards by a designer for a logo. We wouldn?t encode that, because it?s a logo. 

> There's huge variation in Latin script including all sorts of different glyphs, and I suspect ?ussia is way more common than any use of the Deseret script.

In order to represent that logo, people use the Cyrillic letter ?, as you know. 

> There's the same characters here, written in different ways.

No, it?s not. Its the same diphthong (a sound) written with different letters. 

> The glyphs may come from a different origin, but it's encoding the same idea.

We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI).

Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another. 

> If a user community considers them separate, then they should be separated, but I don't see that happening, and from an idealistic perspective, I think they're platonically the same.

I do not agree with that analysis. The ligatures and their constituent parts are distinct and distinctive. In fact, it might have been that the choice for revision was to improve the underlying phonology. In any case, there?s no way that the bottom pair in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg can be considered to be ?glyph variants? of the top pair. Usage is one thing. Character identity is another. ? is not ?. A ligature of ?? + ?? is not a ligature of ?? + ??. 

Michael Everson

From charupdate at orange.fr  Wed Mar 22 19:20:14 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 23 Mar 2017 01:20:14 +0100 (CET)
Subject: Flaw on Side View vs Front View Emoji Pairs?
Message-ID: <700225421.24225.1490228414657.JavaMail.www@wwinf1f14>


Here is an issue that admittedly is unsignificant 
when compared to on-going world events, but I need 
to work on some documents to be finished these days.

Some transport emoji pairs appear to have been encoded 
at the same time (6.0), but have their glyphs swapped 
in some current font(s).

These include:

U+1F68C BUS
U+1F68D ONCOMING BUS

U+1F692 FIRE ENGINE
U+1F6F1 ONCOMING FIRE ENGINE

U+1F693 POLICE CAR
U+1F694 ONCOMING POLICE CAR

U+1F695 TAXI
U+1F696 ONCOMING TAXI

U+1F697 AUTOMOBILE
U+1F698 ONCOMING AUTOMOBILE

While on cellphones, the first are side views 
(source: iemoji.com), the 
latter ones are conformant front views. By contrast, 
web browsers on Windows use a font or fonts that show 
the first in front view, while the others are missing.

I note that both are ?fully conformant? to the 
Standard, so far as the name is a mere identifier, not 
a descriptor, and the glyphs in the charts have little 
of a prescription. At least, whenever the name is 
generic as of perspective, any designer of somewhat 
related glyphs can claim conformance, and Unicode has 
to endorse the resulting flaw.

I note, too, that ?oncoming? is often misunderstood as 
carrying a connotation of dynamics, whereas in reality, 
many vehicles are more iconic in front view, while 
others stand more out in side view. 

Was it imaginable to be precise and call them simply:

U+1F68C BUS SIDE VIEW
U+1F68D BUS FRONT VIEW

U+1F692 FIRE ENGINE SIDE VIEW
U+1F6F1 FIRE ENGINE FRONT VIEW

U+1F693 POLICE CAR SIDE VIEW
U+1F694 POLICE CAR FRONT VIEW

U+1F695 TAXI SIDE VIEW
U+1F696 TAXI FRONT VIEW

U+1F697 AUTOMOBILE SIDE VIEW
U+1F698 AUTOMOBILE FRONT VIEW

Or did the first ones already exist in both views, 
so that it was desirable to add one more character 
for each one of them to make sure to get front views?
That would imply that fonts with the first in front view 
don?t need to support the second characters, treated as 
mere glyph variants. In any case we seem to have 
some drawbacks to choose from data interchange flaws 
and document rendering flaws.

Does the original proposer or anybody else have any 
clues on how the set was intended, and how to fix 
the discrepancy?

Regards,
Marcel


From duerst at it.aoyama.ac.jp  Thu Mar 23 00:54:03 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 23 Mar 2017 14:54:03 +0900
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
Message-ID: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>

Hello Michael, others,

[Fixed script name in subject.]

On 2017/03/23 09:03, Michael Everson wrote:
> On 22 Mar 2017, at 21:39, David Starner <prosfilaes at gmail.com> wrote:

>> There's the same characters here, written in different ways.
>
> No, it?s not. Its the same diphthong (a sound) written with different letters.

I think this may well be the *historically* correct analysis. And that 
may have some influence on how to encode this, but it shouldn't be dominant.

What's most important is (past and) *current use*. If the distinction is 
an orthographic one (e.g. different words being written with different 
shapes), then that's definitely a good indication for splitting.

On the other hand, if fonts (before/outside Unicode) only include one 
variant at the time, if people read over the variant without much ado, 
if people would be surprised to find both corresponding variants in one 
and the same text (absent font variations), if there are examples where 
e.g. the variant is adjusted in quotes from texts that used the 'old' 
variant inside a text with the 'new' variants, and so on, then all these 
would be good indications that this is, for actual usage purposes, just 
a font difference, and should therefore best be handled as such.

The closes to the current case that I was able to find was the German ?. 
It has roots in both an ss and an sz (to be precise, an ?s and an ?z) 
ligature (see https://en.wikipedia.org/wiki/?). And indeed in some 
fonts, its right part looks more like an s, and in other fonts more like 
a z (and in lower case, more often like an s, but in upper case, much 
more like a (cursive) Z). Nevertheless, there is only one character (or 
two if you count upper case) encoded, because anything else would be 
highly confusing to virtually all users.

What is right for Deseret has to be decided by and for Deseret users, 
rather than by script historians.

Regards,   Martin.

>> The glyphs may come from a different origin, but it's encoding the same idea.
>
> We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI).
>
> Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another.
>
>> If a user community considers them separate, then they should be separated, but I don't see that happening, and from an idealistic perspective, I think they're platonically the same.
>
> I do not agree with that analysis. The ligatures and their constituent parts are distinct and distinctive. In fact, it might have been that the choice for revision was to improve the underlying phonology. In any case, there?s no way that the bottom pair in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg can be considered to be ?glyph variants? of the top pair. Usage is one thing. Character identity is another. ? is not ?. A ligature of ?? + ?? is not a ligature of ?? + ??.
>
> Michael Everson
> .

From prosfilaes at gmail.com  Thu Mar 23 01:28:26 2017
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 23 Mar 2017 06:28:26 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
Message-ID: <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>

On Wed, Mar 22, 2017 at 5:09 PM Michael Everson <everson at evertype.com>
wrote:

> On 22 Mar 2017, at 21:39, David Starner <prosfilaes at gmail.com> wrote:
> >
> > Does "?ussia" require a new Latin letter because the way R was written
> has a different origin than the normal R?
>
> But it doesn?t. It?s the Latin letter R turned backwards by a designer for
> a logo. We wouldn?t encode that, because it?s a logo.
>

What logo? I honestly don't know what logo you're talking about, but a
quick Google search confirms it's used outside of a logo. I was thinking of
http://www.sjgames.com/gurps/books/Russia/img/cover_lg.jpg which actually
doesn't use the reversed R, but uses other Cyrillic characters.


> We don?t encode diphthongs. We encode the elements of writing systems. The
> ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one
> ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one
> ligature of ?? + ?? (1859 OI).
>

If they're ligatures, they should be encoded as ligatures; if they're
indivisible characters, then their glyph forms are of less interest.


> Those ligatures are not glyph variants of one another. You might as well
> say that ? and ? are glyph variants of one another.
>

? and ? have contrasting use; they're used in the same text in distinct
ways. Note that n and v? are considered glyph variants of each other,
because v? is used in Sutterlin in exactly the places that n is used in
typewritten versions of the text.


> ? is not ?.
>

? is not ? even when they are printed in fonts that make it nearly
impossible to tell them apart. It has nothing to do with the glyphs or how
those glyphs were created, it's because they're used in different ways.

The example of Sutterlin strikes me as quite relevant here; characters get
all sorts of weird shapes in handwriting. Sometimes they end up
immortalized in printing, and then they usually get encoded. Usually not.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170323/1f896e63/attachment.html>

From jameskasskrv at gmail.com  Thu Mar 23 04:33:39 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Thu, 23 Mar 2017 01:33:39 -0800
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
Message-ID: <CABPY6Z0XpWYFBuRte6Ct1hE+Ngt3=w92tNtEX1hvv658wTFpcg@mail.gmail.com>

Martin J. D?rst wrote,

> What is right for Deseret has to be decided by
> and for Deseret users, rather than by script
> historians.

The Universal Character Set is used by everyone, including script
historians.  While modern day deployment of the script is determined
by its users, the proper encoding of the script should be detemined by
character encoders based upon expert input from all interested
parties.

Best regards,

James Kass


From otto.stolz at uni-konstanz.de  Thu Mar 23 05:23:27 2017
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Thu, 23 Mar 2017 11:23:27 +0100
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
Message-ID: <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de>

Hello Michael, others,

On 2017/03/23 09:03, Michael Everson wrote:
> Its the same diphthong (a sound) written with different
> letters.

Am 23.03.2017 um 06:54 schrieb Martin J. D?rst:
> I think this may well be the *historically* correct analysis. And that
> may have some influence on how to encode this, but it shouldn't be
> dominant.
>
> What's most important is (past and) *current use*.

Same issue as with German sharp S: The blackletter ??? derives from an
?-z ligature (thence its German name ?Eszet?), whilst the Roman type
??? derives from an ?-s ligature. Still, we encode both variants as
identical letters. I?ve got a print from 1739 with legends in both
German (blackletter) and French (Roman italics), comprising both types
of ligatures in one single document.

Best wishes,
   Otto


From richard.wordingham at ntlworld.com  Thu Mar 23 06:21:28 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 23 Mar 2017 11:21:28 +0000
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
 <7906711a-abc2-8a28-963a-4c6c7f192bd4@uni-konstanz.de>
Message-ID: <20170323112128.49075cb0@JRWUBU2>

On Thu, 23 Mar 2017 11:23:27 +0100
Otto Stolz <otto.stolz at uni-konstanz.de> wrote:

> Same issue as with German sharp S: The blackletter ??? derives from an
> ?-z ligature (thence its German name ?Eszet?), whilst the Roman type
> ??? derives from an ?-s ligature. Still, we encode both variants as
> identical letters. I?ve got a print from 1739 with legends in both
> German (blackletter) and French (Roman italics), comprising both types
> of ligatures in one single document.

There's another, lesser German analogy.  If I understand correctly, in
some styles the diaeresis and umlaut marks may be distinguished
visually.  While it is permissible to use CGJ to mark the difference,
the TUS claims (TUS 9.0 p833, in Section 23.2) that CGJ does not affect
rendering, except for the direct effect of blocking canonical
reordering.  (This does appear to be in contrast to its seemingly
archaic effect in inhibiting line-breaking.)

However, combining marks are, by policy, unified more readily than
letters.

Richard.


From verdy_p at wanadoo.fr  Thu Mar 23 07:26:43 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 23 Mar 2017 13:26:43 +0100
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
Message-ID: <CAGa7JC2VFKpWKk+Bd_FSP+EbHFz4dh=M94o773L4Ajhe5KXVvA@mail.gmail.com>

2017-03-23 6:54 GMT+01:00 Martin J. D?rst <duerst at it.aoyama.ac.jp>:

> Hello Michael, others,
>
> On 2017/03/23 09:03, Michael Everson wrote:
>
>> On 22 Mar 2017, at 21:39, David Starner <prosfilaes at gmail.com> wrote:
>>
>
> There's the same characters here, written in different ways.
>>>
>>
>> No, it?s not. Its the same diphthong (a sound) written with different
>> letters.
>>
>
> The closes to the current case that I was able to find was the German ?.
> It has roots in both an ss and an sz (to be precise, an ?s and an ?z)
> ligature (see https://en.wikipedia.org/wiki/?). And indeed in some fonts,
> its right part looks more like an s, and in other fonts more like a z (and
> in lower case, more often like an s, but in upper case, much more like a
> (cursive) Z). Nevertheless, there is only one character (or two if you
> count upper case) encoded, because anything else would be highly confusing
> to virtually all users.
>

This is a good case for encoding explicit variants, including for the two
German ?, to distinguish letter forms in historic (medieval?) texts where
?s and ?z were more distinguished. This does not require disuynification,
and fonts that can have both forms can choose the correct glyph to use for
each variant, and take a default form for the unified character depending
on the contextual language (if it is detected) or based on the font style
itself (if it was initially designed for a specific language, notably in
medieval styles).


> What is right for Deseret has to be decided by and for Deseret users,
> rather than by script historians.
>

In historic texts it is not clear which letter form is better than the
other, and historic Deseret was basically for a single language (but there
may have been regional variants prefering a form instead of the other). I
think that now the distinction is in fact more recent, where some eople
will want to distinguish them for new uses with dinstinctions. Here also a
variant encoding would solve these special cases but we should not disunify
the character (and in fact there's not a lot of fonts except for fancy
usages, such as trying to mimic handwritten styles for specific authors
about how they draw these shapes; I've not seen however any conclusive case
of distinction in typesetted texts).

In fact we are in a situation similar to the case of shapes for decimal
digits like 4 (open or closed), 7 (with an overstriking bar or none), or 0
(with an overstriking slash or dot, or none), 3 (with an angular or circle
top part), or letters like g (with a curled leg drawn counterclockwise, or
just a bottom foot from right to left: here a distinctive shape was encoded
for the IPA symbol)

>
> Regards,   Martin.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170323/0b2219d9/attachment.html>

From everson at evertype.com  Thu Mar 23 08:32:46 2017
From: everson at evertype.com (Michael Everson)
Date: Thu, 23 Mar 2017 13:32:46 +0000
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
Message-ID: <540B0347-175E-4B73-9420-2E6410202995@evertype.com>


> On 23 Mar 2017, at 05:54, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> Hello Michael, others,
> 
> [Fixed script name in subject.]
> 
> On 2017/03/23 09:03, Michael Everson wrote:
>> On 22 Mar 2017, at 21:39, David Starner <prosfilaes at gmail.com> wrote:
> 
>>> There's the same characters here, written in different ways.
>> 
>> No, it?s not. Its the same diphthong (a sound) written with different letters.
> 
> I think this may well be the *historically* correct analysis. And that may have some influence on how to encode this, but it shouldn't be dominant.

Well, Martin, maybe you?re comfortable with shifting goalposts, but we have used historically correct analysis to identify characters in the past and to continue with this precedent is consistent with good practice. 

> What's most important is (past and) *current use*. If the distinction is an orthographic one (e.g. different words being written with different shapes), then that's definitely a good indication for splitting.

It *is* an orthographic one. For one thing, the 1859 glyphs look NOTHING LIKE the 1855 glyphs. 


> On the other hand, if fonts (before/outside Unicode) only include one variant at the time, if people read over the variant without much ado, if people would be surprised to find both corresponding variants in one and the same text (absent font variations), if there are examples where e.g. the variant is adjusted in quotes from texts that used the 'old' variant inside a text with the 'new' variants, and so on, then all these would be good indications that this is, for actual usage purposes, just a font difference, and should therefore best be handled as such.

Um, yeah. Why have Unicode at all? I mean people in Georgia were happy with ASCII-based font hacks. Lots of people are still using them. Sure, people put up with the unification of Coptic and Greek. 

Just font differences. Yeah. 

> The closes to the current case that I was able to find was the German ?. It has roots in both an ss and an sz (to be precise, an ?s and an ?z) ligature (see https://en.wikipedia.org/wiki/?). And indeed in some fonts, its right part looks more like an s, and in other fonts more like a z (and in lower case, more often like an s, but in upper case, much more like a (cursive) Z). Nevertheless, there is only one character (or two if you count upper case) encoded, because anything else would be highly confusing to virtually all users.

The situation of the Deseret diphthong letters isn?t anything like German ?. Yes, you can analyse it as something like ?s and ??, but THOSE LOOK VERY NEARLY ALIKE.

Ignoring the stroke of SHORT I which is the same for all the Deseret letters being discussed, we have EW represented by ?? and ?? (which look nothing alike) and OI represented by ?? and ?? (which look nothing alike).

A unification of these as ?glyph variants? is perverse and not consistent with the way we have encoded things in the past.

> What is right for Deseret has to be decided by and for Deseret users, rather than by script historians.

Odd. That view doesn?t seem to be applicable to CJK unification.

Michael

From everson at evertype.com  Thu Mar 23 08:48:37 2017
From: everson at evertype.com (Michael Everson)
Date: Thu, 23 Mar 2017 13:48:37 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
Message-ID: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>

On 23 Mar 2017, at 06:28, David Starner <prosfilaes at gmail.com> wrote:

> > Does "?ussia" require a new Latin letter because the way R was written has a different origin than the normal R?
> 
> But it doesn?t. It?s the Latin letter R turned backwards by a designer for a logo. We wouldn?t encode that, because it?s a logo.
> 
> What logo?

Oh, sorry. ?Toys ? Us? which is what I saw when I saw your ??ussia?.

> I honestly don't know what logo you're talking about, but a quick Google search confirms it's used outside of a logo. I was thinking of http://www.sjgames.com/gurps/books/Russia/img/cover_lg.jpg which actually doesn't use the reversed R, but uses other Cyrillic characters. 

Decorative display type and font play on book covers is a very different thing from the development of the Deseret alphabet we are discussing here. 

>> We don?t encode diphthongs. We encode the elements of writing systems. The ?idea? here is represented by one ligature of ?? + ?? (1855 EW), one ligature of ?? + ?? (1859 EW), one ligature of ?? + ?? (1855 OI), and one ligature of ?? + ?? (1859 OI).
> 
> If they're ligatures, they should be encoded as ligatures; if they're indivisible characters, then their glyph forms are of less interest.

We don?t encode ligatures. We encode letters which are historically derived from ligation. That?s what the existing EW and OI are, and that?s what the 1859 revised letters were.

>> Those ligatures are not glyph variants of one another. You might as well say that ? and ? are glyph variants of one another.
> 
> ? and ? have contrasting use; they're used in the same text in distinct ways.

That happens to be the case, but the analogy has to do with the origin of the ligatures. 

> Note that n and v? are considered glyph variants of each other, because v? is used in Sutterlin in exactly the places that n is used in typewritten versions of the text.

It?s n and ? in S?tterlin, not n and v?. 

> ? is not ? even when they are printed in fonts that make it nearly impossible to tell them apart. It has nothing to do with the glyphs or how those glyphs were created, it's because they're used in different ways. 

It was an analogy about the structural development of the ligated letters. 

> The example of Sutterlin strikes me as quite relevant here; characters get all sorts of weird shapes in handwriting. Sometimes they end up immortalized in printing, and then they usually get encoded. Usually not.

Again: The source of 1855 EW and OI uses *different* letters than the 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to see. This isn?t random or even systematic natural development of handwriting styles. It was a principled revision done on the basis of phonetic analysis. English diphthongs EW and OI were first represented by ligatures representing [?u?] and [??], and then later by ligatures representing [??] and [???]. 

Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters. 

Michael Everson

From prosfilaes at gmail.com  Thu Mar 23 17:03:02 2017
From: prosfilaes at gmail.com (David Starner)
Date: Thu, 23 Mar 2017 22:03:02 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
Message-ID: <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>

On Thu, Mar 23, 2017 at 6:54 AM Michael Everson <everson at evertype.com>
wrote:

> Again: The source of 1855 EW and OI uses *different* letters than the 1859
> EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to
> see. This isn?t random or even systematic natural development of
> handwriting styles. It was a principled revision done on the basis of
> phonetic analysis. English diphthongs EW and OI were first represented by
> ligatures representing [?u?] and [??], and then later by ligatures
> representing [??] and [???].
>

Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of
principled revision going on all the time in the world's scripts that
doesn't get recorded by Unicode, and this goes double for young constructed
scripts, where people are playing around with them.


> Indeed I would say to John Jenkins and Ken Beesley that the richness of
> the history of the Deseret alphabet would be impoverished by treating the
> 1859 letters as identical to the 1855 letters.
>

And yet the richness of the history of the Latin alphabet is not
impoverished by treating
https://commons.wikimedia.org/wiki/File:I_littera_in_manuscripto.jpg (a
monocase Latin cursive) as identical to part of the modern Latin-script
alphabet, which besides casing, has split the i/j and u/v on the basis of
phonetic analysis?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170323/94243845/attachment.html>

From duerst at it.aoyama.ac.jp  Fri Mar 24 06:34:41 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Fri, 24 Mar 2017 20:34:41 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
Message-ID: <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>

On 2017/03/23 22:48, Michael Everson wrote:

> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters.

Well, I might be completely wrong, but John Jenkins may be the person on 
this list closest to an actual user of Deseret (John, please correct me 
if I'm wrong one way or another).

It may be that actual users of Deseret read these character variants the 
same way most of us would read serif vs. sans-serif variants: I.e. 
unless we are designers or typographers, we don't actually consciously 
notice the difference. If that's the case, it would be utterly annoying 
to these actual users to have to make a distinction between two 
characters where there actually is none.

The richness of the history of the Deseret alphabet can still be 
preserved e.g. with different fonts the same way we have thousands of 
different fonts for Latin and many other scripts that show a lot of rich 
history.

Regards,   Martin.

From duerst at it.aoyama.ac.jp  Fri Mar 24 06:41:14 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Fri, 24 Mar 2017 20:41:14 +0900
Subject: Standaridized variation sequences for the Deseret alphabet?
In-Reply-To: <540B0347-175E-4B73-9420-2E6410202995@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <3c6c18f5-44f2-4ef9-9be9-9426d02fd3c1@it.aoyama.ac.jp>
 <540B0347-175E-4B73-9420-2E6410202995@evertype.com>
Message-ID: <c2cab090-09c6-a875-9cde-3c1258872bb5@it.aoyama.ac.jp>

On 2017/03/23 22:32, Michael Everson wrote:

>> What is right for Deseret has to be decided by and for Deseret users, rather than by script historians.
>
> Odd. That view doesn?t seem to be applicable to CJK unification.

Well, it may not seem to you, but actually it is. I have had a lot of 
discussions with Japanese and others about Han unification (mostly in 
the '90ies), and have studied the history and principles of Han 
unification in quite some detail.

To summarize it, Han unification unifies very much exactly those cases 
where an average user, in average texts, would consider two forms "the 
same" (i.e. exchangeable). Exceptions are due to the round trip rule. It 
also separates very much exactly those cases where an average user, for 
average texts, may not consider two forms equivalent.

If necessary, I can go into further details, but I would have to dig 
quite deeply for some of the sources.

Regards,   Martin.

From everson at evertype.com  Fri Mar 24 09:37:51 2017
From: everson at evertype.com (Michael Everson)
Date: Fri, 24 Mar 2017 14:37:51 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
Message-ID: <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>

On 24 Mar 2017, at 11:34, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> On 2017/03/23 22:48, Michael Everson wrote:
> 
>> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters.
> 
> Well, I might be completely wrong, but John Jenkins may be the person on this list closest to an actual user of Deseret (John, please correct me if I'm wrong one way or another).

He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark).

> It may be that actual users of Deseret read these character variants the same way most of us would read serif vs. sans-serif variants: I.e. unless we are designers or typographers, we don't actually consciously notice the difference.

I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received. 

> If that's the case, it would be utterly annoying to these actual users to have to make a distinction between two characters where there actually is none.

Actually neither of the ligature-letters are used in our Carrollian Deseret volumes. 

> The richness of the history of the Deseret alphabet can still be preserved e.g. with different fonts the same way we have thousands of different fonts for Latin and many other scripts that show a lot of rich history.

You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do. I?m also aware of what principles we have used for determining character identity.

I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters. 

Michael Everson


From everson at evertype.com  Fri Mar 24 11:11:53 2017
From: everson at evertype.com (Michael Everson)
Date: Fri, 24 Mar 2017 16:11:53 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
Message-ID: <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>

On 23 Mar 2017, at 22:03, David Starner <prosfilaes at gmail.com> wrote:
> On Thu, Mar 23, 2017 at 6:54 AM Michael Everson <everson at evertype.com> wrote:
>> Again: The source of 1855 EW and OI uses *different* letters than the 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or to see. This isn?t random or even systematic natural development of handwriting styles. It was a principled revision done on the basis of phonetic analysis. English diphthongs EW and OI were first represented by ligatures representing [?u?] and [??], and then later by ligatures representing [??] and [???].
> 
> Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of principled revision going on all the time in the world's scripts that doesn't get recorded by Unicode, and this goes double for young constructed scripts, where people are playing around with them.

What?s your point? S?tterlin didn?t invent new letters. Both n and u look a lot alike, and so the latter was marked with a breve, but in the 15th-century Cornish manuscript I was working with at the British Library last week both n and u look a lot alike. This has nothing to do with the origin or identity of two sets of letters used for diphthongs in Deseret. 

>> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters.
> 
> And yet the richness of the history of the Latin alphabet is not impoverished by treating https://commons.wikimedia.org/wiki/File:I_littera_in_manuscripto.jpg (a monocase Latin cursive) as identical to part of the modern Latin-script alphabet, which besides casing, has split the i/j and u/v on the basis of phonetic analysis?

Your question has, again, nothing to do with the matter in hand. While it is true that the shapes of the Latin letters in that manuscript differ from the shapes which we use today, their identity as letters (and their Old Italic and Phoenician forerunners) is not in question. Inscriptional Latin from that same period is still quite familiar to us. That i and j are distinguished in that handwritten text isn?t surprising. Centuries later in Europe the j graph was extremely common in numbers (as in xiij ?13?). It?s true that it wasn?t until 1524 that i and j were specifically distinguished *as* separate letters in Italy; this distinction was formally made in English in 1633. But this isn?t analogous to the ligature-based letters used for diphthongs in Deseret.

And we *can* distinguish i and j in that Latin text, because we have separate characters encoded for it. And we *have* encoded many other Latin ligature-based letters and sigla of various kinds for the representation of medieval European texts. Indeed, that?s just a stronger argument for distinguishing the ligature-based letters for Deseret, I think.

Michael Everson

From verdy_p at wanadoo.fr  Fri Mar 24 12:31:04 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 24 Mar 2017 18:31:04 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
Message-ID: <CAGa7JC3nfm9yk4bbj=D8T21Cuh_TdCumOHTW3v0-6nQF3PHa3A@mail.gmail.com>

2017-03-24 17:11 GMT+01:00 Michael Everson <everson at evertype.com>:

> On 23 Mar 2017, at 22:03, David Starner <prosfilaes at gmail.com> wrote:
> > On Thu, Mar 23, 2017 at 6:54 AM Michael Everson <everson at evertype.com>
> wrote:
> >> Again: The source of 1855 EW and OI uses *different* letters than the
> 1859 EW and OI do. This wasn?t accidental. It?s not hard to puzzle out or
> to see. This isn?t random or even systematic natural development of
> handwriting styles. It was a principled revision done on the basis of
> phonetic analysis. English diphthongs EW and OI were first represented by
> ligatures representing [?u?] and [??], and then later by ligatures
> representing [??] and [???].
> >
> > Sutterlin was created by Ludwig S?tterlin in 1915. There's lots of
> principled revision going on all the time in the world's scripts that
> doesn't get recorded by Unicode, and this goes double for young constructed
> scripts, where people are playing around with them.
>
> What?s your point? S?tterlin didn?t invent new letters. Both n and u look
> a lot alike, and so the latter was marked with a breve, but in the
> 15th-century Cornish manuscript I was working with at the British Library
> last week both n and u look a lot alike. This has nothing to do with the
> origin or identity of two sets of letters used for diphthongs in Deseret.
>

There's a counter example of precedent for the German umlaut which was
unfortunately unified with the diaeresis, even if its origin (and still its
current semantic) is that of a combining letter e and where it does not
play the phonetic role of a diaresis (i.e. the separation of two vowels to
avoid creating digrams for a single phonem represented by pairs of letters).

So "?" in German is cognate to the "ae" digram, similar to the "ai" digram
used in French (or to the "?" ligature used other languages, sometimes as a
distinct letter of their basic alphabet), it contains no phonetic diaeresis
as there's a single phonem, and no diphtong (like "a?" in French where this
is a true diaeresis to break the interpretation as the digram "ai").
Same remark for "?" in German cognate to the digram "oe" (or the ligatured
letter "?" in other languages or the variant "?" in Nordic languages), and
"?" cognate to "ue".

But Unicode just prefered to keep the roundtrip compatiblity with earlier
8-bit encodings (including existing ISO 8859 and DIN standards) so that "?"
in German and French also have the same canonical decomposition even if the
diacritic is a diaeresis in French and an umlaut in German, with different
semantics and origins.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170324/83dcd3a3/attachment.html>

From doug at ewellic.org  Fri Mar 24 13:33:44 2017
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 24 Mar 2017 11:33:44 -0700
Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences for
 the Desert =?UTF-8?Q?alphabet=3F=29?=
Message-ID: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> But Unicode just prefered to keep the roundtrip compatiblity with
> earlier 8-bit encodings (including existing ISO 8859 and DIN
> standards) so that "?" in German and French also have the same
> canonical decomposition even if the diacritic is a diaeresis in French
> and an umlaut in German, with different semantics and origins.

Was this only about compatibility, or perhaps also that the two signs
look identical and that disunifying them would have caused endless
confusion and misuse among users?

--
Doug Ewell | Thornton, CO, US | ewellic.org


From haberg-1 at telia.com  Fri Mar 24 14:23:52 2017
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Fri, 24 Mar 2017 20:23:52 +0100
Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences
 for the Desert alphabet?)
In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
Message-ID: <9C0619FD-9DB9-43C0-AFC7-74564446AC03@telia.com>


> On 24 Mar 2017, at 19:33, Doug Ewell <doug at ewellic.org> wrote:
> 
> Philippe Verdy wrote:
> 
>> But Unicode just prefered to keep the roundtrip compatiblity with
>> earlier 8-bit encodings (including existing ISO 8859 and DIN
>> standards) so that "?" in German and French also have the same
>> canonical decomposition even if the diacritic is a diaeresis in French
>> and an umlaut in German, with different semantics and origins.
> 
> Was this only about compatibility, or perhaps also that the two signs
> look identical and that disunifying them would have caused endless
> confusion and misuse among users?

The Swedish letters ??? are simplified ligatures, and not diacritic marks. For ??, in handwritten script style, a tilde, the same as Spanish ?, which is also a simplified ligature.


From verdy_p at wanadoo.fr  Fri Mar 24 14:34:53 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 24 Mar 2017 20:34:53 +0100
Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences
 for the Desert alphabet?)
In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC1UzE1cY5k7Xa-9oFAAH2g0izBDu5tsbgQxzadwQ8LR+A@mail.gmail.com>

Given the history of characters and the initial desire to be forward
compatible with previous ISO standards, I am convinced that there was no
other choice than preserving the unification, otherwise it would have been
impossible to reliably remap the zillions documents and databases or
applications that were using ISO8859, and other related Windows, MacOS and
IBM codepages for OEMs or for EBCDIC. And with the developement of Internet
and the disire in both Unicode and ISO 10646 to leave the first page of
code points in the UCS and ISO8859-1 fully compatible code for code (and
the fact that there was no variant of ISO8859-1 standardized for Germany,
Switzerland, Austria, Belgium and Luxembourg, that did not request it
(causing nightmares notably in the last three countries, and a lot of
legacy softwares on Windows and MacOS needing such bijective mapping;
finally the Unicode Consortium initially was developed separately from the
IUSO standard and merged later, and at that time, Microsofot and IBM were
the most active members and did not want to introduce incompatibilities and
causing troubles for other vendors).
Later there was a clear statement to keep the basic character properties,
stable, and it became impossisble to change the canonical equivalences
(after the bad experience found when mlerging efforts between Unicode and
ISO notably for enconding Hangul, and a strong initial resistance by China
that wanted to develop its own GB standard).
Encoding stability is now a rule that will be extremely hard to break.

Note: umlauts and diaeresis have not always looked the same, confusion
started lately between both during the middle of the 20th century and the
starting development of computing. It would have been impossible to reach a
large adoption of the UCS without such compromizes (and it took additional
years after both projects joined their efforts, before ISO finally closed
its working group on legacy 8-bit character sets, and stopped accepting any
new variants; ISO 8859-15 was one of the last failed attempt to standardize
a new 8-bit encoding, that finally almost nobody really used as they no
longer needed it; China resigned as well and finalized the roundtrip
mapping of its GB 18030 competing encoding with the UCS, so mappings for GB
18030 no longer needs new updates: any new encoding in the UCS is
immediately encoded as well in GB without modifying any line of code or
data, and any software or document compatiblle with the UCS should be
imediately compatible with the GB 18030 standard required in PR China; I
don't know if Hong Kong authorities made the same statement for its HKCS
standard before it reunified with China, or if Taiwan made a similar
decision; however Japan is adding new encodings in its JIS standard, pushed
by national vendors, and the UCS still has delays for accepting these
additions and not all is accepted, but in this area, there's a local
subcommity constantly negociating with Asian vendors and reporting its
efforts to Unicode and ISO).

About umlauts and diaeresis I'm not sure they were always looking the same.
If we try to encode old German, Hungarian or Czech texts, we may find some
discrepencies or ambiguities (but there's still no mechanism to distinguish
when an umlaut is really desired and a diaeresis is destired instead if
they don't look the same in historic script variants). We cannot encode
these using "variants" but possibly we may be using some combining controls
such as CGJ (encoded after the precombined letter or after the base
letter+diaresis, because of canonical equivalences it cannot be in the
middle). Or may be, only for historic texts, we could add a combining
lowercase e as an alternative to the existing diaeresis.


2017-03-24 19:33 GMT+01:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy wrote:
>
> > But Unicode just prefered to keep the roundtrip compatiblity with
> > earlier 8-bit encodings (including existing ISO 8859 and DIN
> > standards) so that "?" in German and French also have the same
> > canonical decomposition even if the diacritic is a diaeresis in French
> > and an umlaut in German, with different semantics and origins.
>
> Was this only about compatibility, or perhaps also that the two signs
> look identical and that disunifying them would have caused endless
> confusion and misuse among users?
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170324/c140651c/attachment.html>

From richard.wordingham at ntlworld.com  Sat Mar 25 09:09:10 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sat, 25 Mar 2017 14:09:10 +0000
Subject: Status of Thai Angkhandiao
Message-ID: <20170325140910.35f01687@JRWUBU2>

Thai has two identical or very similar punctuation-like characters,
'paiyan noi' (?????????), definitely encoded as ? U+0E2F THAI CHARACTER
PAIYANNOI, and 'angkhan diao' (often transliterated 'angkhandeaw')
(????????????).  Paiyan noi is an abbreviation mark, historically the
same in name as ? U+17D8 KHMER SIGN BEYYAL, which however corresponds in
form and meaning to the Thai sequence 'paiyan yai' - ???.  Angkhandiao
is historically a single danda, contrasting with the double danda
U+0E5A THAI CHARACTER ANGKHANKHU.  (They are both very little used in
modern Thai.)

One piece of evidence that paiyannoi and angkhandiao are two separate
characters is that ISO 11940 uses different glyphs for them and
prescribes different transliterations for them:

? U+01C0 LATIN LETTER DENTAL CLICK for angkhandiao
? U+01C1 LATIN LETTER LATERAL CLICK for U+0E5A THAI CHARACTER ANGKHANKHU
? U+01C2 LATIN LETTER ALVEOLAR CLICK for U+0E2F THAI CHARACTER PAIYANNOI

(I would have said that U+0964 DEVANAGARI DANDA and U+0965 DEVANAGARI
DOUBLE DANDA would have been better for the first two, but these are
declared (Script_Extensions property) not to be used as part of the
Latin script, though I thought they were used for Sanskrit.)

Has Unicode ever ruled on whether U+0E2F includes angkhandiao?

Richard. 


From prosfilaes at gmail.com  Sat Mar 25 17:15:28 2017
From: prosfilaes at gmail.com (David Starner)
Date: Sat, 25 Mar 2017 22:15:28 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
Message-ID: <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>

On Fri, Mar 24, 2017 at 9:17 AM Michael Everson <everson at evertype.com>
wrote:

> And we *can* distinguish i and j in that Latin text, because we have
> separate characters encoded for it. And we *have* encoded many other Latin
> ligature-based letters and sigla of various kinds for the representation of
> medieval European texts. Indeed, that?s just a stronger argument for
> distinguishing the ligature-based letters for Deseret, I think.
>

And I'd argue that a good theoretical model of the Latin script makes ?, ?
and a? the same character, distinguished only by the font. This is
complicated by combining characters mostly identified by glyph, and the
fact that while ? and a? may be the same character across time, there are
people wanting to distinguish them in the same text today, and in both
cases the theoretical falls to the practical. In this case, there are no
combining character issues and there's nobody needing to use the two forms
in the same text.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170325/ffc141dc/attachment.html>

From verdy_p at wanadoo.fr  Sat Mar 25 21:24:18 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 26 Mar 2017 04:24:18 +0200
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
Message-ID: <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>

2017-03-25 23:15 GMT+01:00 David Starner <prosfilaes at gmail.com>:

> On Fri, Mar 24, 2017 at 9:17 AM Michael Everson <everson at evertype.com>
> wrote:
>
>> And we *can* distinguish i and j in that Latin text, because we have
>> separate characters encoded for it. And we *have* encoded many other Latin
>> ligature-based letters and sigla of various kinds for the representation of
>> medieval European texts. Indeed, that?s just a stronger argument for
>> distinguishing the ligature-based letters for Deseret, I think.
>>
>
> And I'd argue that a good theoretical model of the Latin script makes ?, ?
> and a? the same character, distinguished only by the font. This is
> complicated by combining characters mostly identified by glyph, and the
> fact that while ? and a? may be the same character across time, there are
> people wanting to distinguish them in the same text today, and in both
> cases the theoretical falls to the practical. In this case, there are no
> combining character issues and there's nobody needing to use the two forms
> in the same text.
>

Thats a good point: any disunification requires showing examples of
contrasting uses. Now depending on individual publications, authors would
use one character or the other according to their choice, and the encoding
will respect it. If we need further unification for matching texts in the
samer language across periods of time or authors, collation (UCA) can
provide help: this is already what it does in modern German with the digram
"ae" and the letter "?" which are orthographic variants not distinguished
by the language but by authors' preference.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170326/82ef544b/attachment.html>

From duerst at it.aoyama.ac.jp  Sun Mar 26 03:12:27 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Sun, 26 Mar 2017 17:12:27 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
Message-ID: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>

On 2017/03/26 11:24, Philippe Verdy wrote:

> Thats a good point: any disunification requires showing examples of
> contrasting uses.

Fully agreed. We haven't yet heard of any contrasting uses for the 
letter shapes we are discussing.

> Now depending on individual publications, authors would
> use one character or the other according to their choice, and the encoding
> will respect it. If we need further unification for matching texts in the
> samer language across periods of time or authors, collation (UCA) can
> provide help: this is already what it does in modern German with the digram
> "ae" and the letter "?" which are orthographic variants not distinguished
> by the language but by authors' preference.

Well, in most cases, but not e.g. for names. Goethe is not spelled G?the.

Regards,   Martin.

From wl at gnu.org  Sun Mar 26 03:17:48 2017
From: wl at gnu.org (Werner LEMBERG)
Date: Sun, 26 Mar 2017 10:17:48 +0200 (CEST)
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
References: <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
Message-ID: <20170326.101748.844147132286739377.wl@gnu.org>


> Well, in most cases, but not e.g. for names. Goethe is not spelled
> G?the.

Have a look into `Grimmsches W?rterbuch' to see the opposite :-)


    Werner


From eik at iki.fi  Sun Mar 26 04:07:15 2017
From: eik at iki.fi (Erkki I Kolehmainen)
Date: Sun, 26 Mar 2017 12:07:15 +0300
Subject: VS: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
Message-ID: <000301d2a610$5cf1e550$16d5aff0$@fi>

I tend to agree with Martin, Philippe and others in questioning the disunification.

Sincerely,
Erkki I. Kolehmainen

-----Alkuper?inen viesti-----
L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Martin J. D?rst
L?hetetty: 26. maaliskuuta 2017 11:12
Vastaanottaja: verdy_p at wanadoo.fr; David Starner
Kopio: Michael Everson; unicode Unicode Discussion
Aihe: Re: Standaridized variation sequences for the Desert alphabet?

On 2017/03/26 11:24, Philippe Verdy wrote:

> Thats a good point: any disunification requires showing examples of 
> contrasting uses.

Fully agreed. We haven't yet heard of any contrasting uses for the letter shapes we are discussing.

> Now depending on individual publications, authors would use one 
> character or the other according to their choice, and the encoding 
> will respect it. If we need further unification for matching texts in 
> the samer language across periods of time or authors, collation (UCA) 
> can provide help: this is already what it does in modern German with 
> the digram "ae" and the letter "?" which are orthographic variants not 
> distinguished by the language but by authors' preference.

Well, in most cases, but not e.g. for names. Goethe is not spelled G?the.

Regards,   Martin.


From duerst at it.aoyama.ac.jp  Sun Mar 26 04:37:49 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Sun, 26 Mar 2017 18:37:49 +0900
Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences
 for the Desert alphabet?)
In-Reply-To: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
Message-ID: <b95632bf-0836-32e7-c408-7a6008d64318@it.aoyama.ac.jp>

On 2017/03/25 03:33, Doug Ewell wrote:
> Philippe Verdy wrote:
>
>> But Unicode just prefered to keep the roundtrip compatiblity with
>> earlier 8-bit encodings (including existing ISO 8859 and DIN
>> standards) so that "?" in German and French also have the same
>> canonical decomposition even if the diacritic is a diaeresis in French
>> and an umlaut in German, with different semantics and origins.
>
> Was this only about compatibility, or perhaps also that the two signs
> look identical and that disunifying them would have caused endless
> confusion and misuse among users?

I'm not sure to what extent this was explicitly discussed when Unicode 
was created. The fact that the first 256 code points are identical to 
those in ISO-8859-1 was used as a big selling point when Unicode was 
first introduced. It may well have been that for Unicode, there was no 
discussion at all in this area, because ISO-8859-1 was already so well 
established.

And for ISO-8859-1, space was an important concern. Ideally, both 
Islandic and Turkish (and the letters missed for French) would have been 
covered, but that wasn't possible. Disunifying diaeresis and umlaut 
would have been an unaffordable luxury.

The above reasons mask any inherent reasons for why diaeresis and umlaut 
would have been unified or not if the decision had been argued purely 
"on the merit". But having used both German and French, and e.g. looking 
at the situation in Switzerland, where it was important to be able to 
write both French and German on the same typewriter, I would definitely 
argue that disunifying them would have caused endless
confusion and errors among users.

Also, it was argued a few mails ago that diaeresis and umlaut don't look 
exactly the same. I remember well that when Apple introduced its first 
laser printers, there were widespread complaints that the fonts (was it 
Helvetica, Times Roman, and Palatino?) unified away the traditional 
differences in the cuts of these typefaces for different languages.

So to quite some extent, in the relevant period (i.e. 1970ies/80ies), 
the differences between diaeresis and umlaut may be due to design 
differences in the cuts for different languages (e.g. French and 
German). Nobody would have disunified some basic letters because they 
may have looked slightly different in cuts for different languages, and 
so people may also have been just fine with unifying diaeresis and 
umlaut. (German fonts e.g. may have contained a '?' for use e.g. with 
"Citro?n", but the dots on that '?' will have been the same shape as 
'?', '?', and '?' umlauts for design consistency, and the other way 
round for French).

Regards,   Martin.

From everson at evertype.com  Sun Mar 26 08:06:41 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 14:06:41 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
Message-ID: <B40652BA-8239-4A0F-BEB8-2CDC16B40754@evertype.com>

On 25 Mar 2017, at 22:15, David Starner <prosfilaes at gmail.com> wrote:
> 
> And I'd argue that a good theoretical model of the Latin script makes ?, ? and a? the same character, distinguished only by the font. 

Fortunately for the users of our standard, we don?t do this. 

> This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. 

I?m fairly sure that a person citing a medieval document using a? may very well also need to write this alongside Swedish or German using ?. 

Michael Everson

From everson at evertype.com  Sun Mar 26 08:15:03 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 14:15:03 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
Message-ID: <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>


> On 26 Mar 2017, at 09:12, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
>> Thats a good point: any disunification requires showing examples of
>> contrasting uses.
> 
> Fully agreed.

The default position is NOT ?everything is encoded unified until disunified?. The characters in question have different and undisputed origins, undisputed. We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.)

> We haven't yet heard of any contrasting uses for the letter shapes we are discussing.

Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters. Please try to remember that. (It?s a bit shocking to have to remind people of this. 

Michael Everson

From everson at evertype.com  Sun Mar 26 08:18:37 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 14:18:37 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <000301d2a610$5cf1e550$16d5aff0$@fi>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
Message-ID: <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>

On 26 Mar 2017, at 10:07, Erkki I Kolehmainen <eik at iki.fi> wrote:
> 
> I tend to agree with Martin, Philippe and others in questioning the disunification.

You may, but you give no evidence or discussion about it, so...

In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised. The origin of all of the characters as ligatures of other characters isn?t questioned. The right thing to do is to add the missing characters, not to invalidate any font that uses the 1855 characters by claiming that the 1855 and 1859 characters are ?the same?. 

Michael Everson

From prosfilaes at gmail.com  Sun Mar 26 08:32:07 2017
From: prosfilaes at gmail.com (David Starner)
Date: Sun, 26 Mar 2017 13:32:07 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <B40652BA-8239-4A0F-BEB8-2CDC16B40754@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <B40652BA-8239-4A0F-BEB8-2CDC16B40754@evertype.com>
Message-ID: <CAMZ=zj5m83c1L3A1zPELVL4qY0gqSeLXEExqX8fTsGTRxc_shQ@mail.gmail.com>

On Sun, Mar 26, 2017 at 6:12 AM Michael Everson <everson at evertype.com>
wrote:

> On 25 Mar 2017, at 22:15, David Starner <prosfilaes at gmail.com> wrote:
> >
> > And I'd argue that a good theoretical model of the Latin script makes ?,
> ? and a? the same character, distinguished only by the font.
>
> Fortunately for the users of our standard, we don?t do this.
>

You've yet to come up with users to whom these Deseret letters are relevant.

I?m fairly sure that a person citing a medieval document using a? may very
> well also need to write this alongside Swedish or German using ?.
>

I'm fairly sure that a person citing an early 20th century Germany document
may well feel the need to cite it in Fraktur. In both cases, I believe
that's going above and beyond the identity of the characters involved, but
in your case, people do contrast the a? with ?, and the user case has been
made. Show me the users who want to use these Deseret letters contrastingly.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170326/ed2a2cbd/attachment.html>

From everson at evertype.com  Sun Mar 26 08:37:11 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 14:37:11 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj5m83c1L3A1zPELVL4qY0gqSeLXEExqX8fTsGTRxc_shQ@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <B40652BA-8239-4A0F-BEB8-2CDC16B40754@evertype.com>
 <CAMZ=zj5m83c1L3A1zPELVL4qY0gqSeLXEExqX8fTsGTRxc_shQ@mail.gmail.com>
Message-ID: <01091990-405C-46B6-B5F6-F89CD42BA820@evertype.com>

On 26 Mar 2017, at 14:32, David Starner <prosfilaes at gmail.com> wrote:

>>> And I'd argue that a good theoretical model of the Latin script makes ?, ? and a? the same character, distinguished only by the font.
>> 
>> Fortunately for the users of our standard, we don?t do this.
> 
> You've yet to come up with users to whom these Deseret letters are relevant.

You might imagine it takes time to identify problems and address them. 

>> I?m fairly sure that a person citing a medieval document using a? may very well also need to write this alongside Swedish or German using ?.
> 
> I'm fairly sure that a person citing an early 20th century Germany document may well feel the need to cite it in Fraktur.

Fraktur is a whole-font substitition (modulo the ligatures). This is not the same thing as an editor choosing w or ?. Imagine if we had unified those two. After all, they both represent the same sound, right?

(Shudder.)

> In both cases, I believe that's going above and beyond the identity of the characters involved, but in your case, people do contrast the a? with ?, and the user case has been made. Show me the users who want to use these Deseret letters contrastingly.

Do try to be less dismissive. Firstly, *I* have published entire books in Deseret and so I myself have a legitimate interest. In the second, Iam in fact beginning discussions with relevant experts.

Michael Everson

From asmusf at ix.netcom.com  Sun Mar 26 10:45:15 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 08:45:15 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
Message-ID: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170326/a4416cce/attachment.html>

From everson at evertype.com  Sun Mar 26 10:47:51 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 16:47:51 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
Message-ID: <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com>


> On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate.

He refers to the shape of the diacritical marks. 

Michael Everson

From asmusf at ix.netcom.com  Sun Mar 26 10:59:42 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 08:59:42 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com>
Message-ID: <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com>

On 3/26/2017 8:47 AM, Michael Everson wrote:
>> On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:
>>
>> The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate.
> He refers to the shape of the diacritical marks.

I see the issue: the font selected on my end made the "e" look like an 
"o", which completely changed my understanding of what he tried to 
communicate.

A./
>
> Michael Everson
>


From asmusf at ix.netcom.com  Sun Mar 26 11:02:22 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 09:02:22 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
 <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
Message-ID: <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>

On 3/26/2017 6:18 AM, Michael Everson wrote:
> On 26 Mar 2017, at 10:07, Erkki I Kolehmainen <eik at iki.fi> wrote:
>> I tend to agree with Martin, Philippe and others in questioning the disunification.
> You may, but you give no evidence or discussion about it, so...
>
> In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised.

Calling them "characters" is pre-judging the issue, don't you think?

We know that these are different shapes, but that they stand for the 
same text elements.

A./

> The origin of all of the characters as ligatures of other characters isn?t questioned. The right thing to do is to add the missing characters, not to invalidate any font that uses the 1855 characters by claiming that the 1855 and 1859 characters are ?the same?.
>
> Michael Everson
>


From everson at evertype.com  Sun Mar 26 11:20:04 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 17:20:04 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
Message-ID: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>

On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> The priority in encoding has to be with allowing distinctions in modern texts, or distinctions that matter to modern users of historic writing systems. Beyond that, theoretical analysis of typographical evolution can give some interesting insight, but I would be in the camp that does not accord them a status as primary rationale for encoding decisions.

Our rationales are NOT ranked in the way you suggest. A variety of criteria are applied. 

> Thus, critical need for contrasting use of the glyph distinctions would have to be established before it makes sense to discuss this further.

Precedent for such needs is well-established. Consider the Latin Extended-D block. Sometimes it is editorial preference, and that?s not even always universal. 

> I see no principled objection to having a font choice result in a noticeable or structural glyph variation for only a few elements of an alphabet. We have handle-a vs. bowl-a as well as hook-g vs. loop-g in Latin, and fonts routinely select one or the other.

Well, Asmus, we encode a and ? as well as g and ? and ?. And we do not consider ? and ? and ? to be things that ought to be distinguished by variation selectors. (I am of course well aware of IPA usage.) Whole-font switching is well understood. But character origin has always been taken into account. Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.)

> (It is only for usage outside normal text that the distinction between these forms matters). 

What?s ?normal? text? ?Normal? text in Latin probably doesn?t use the characters from the Latin Extended-D block. 

> While the Deseret forms are motivated by their pronunciation, I'm not necessarily convinced that the distinction has any practical significance that is in any way different than similar differences in derivation (e.g. for long s-s or long-s-z for German esszett).

One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not. 

> In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance). 

Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts. 

> This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction.

Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions. 

>> This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases             the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text. 
> 
> huh?

He?s wrong there, as I pointed out. A text in German may write an older Clavieru?bung in a citation alongside the normal spelling Klavier?bung. The choice of spelling is key.

Michael Everson

From everson at evertype.com  Sun Mar 26 11:20:42 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 17:20:42 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <7575DD18-B4DC-45F8-B2E2-33EF3515A39B@evertype.com>
 <6db8bbbb-0d1a-1f4c-24c7-a01409905f04@ix.netcom.com>
Message-ID: <3D06831A-7C66-44F3-8113-C8E2612B775F@evertype.com>


> On 26 Mar 2017, at 16:59, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> On 3/26/2017 8:47 AM, Michael Everson wrote:
>>> On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:
>>> 
>>> The latter is patent nonsense, because ? and a? are even less related to each other than "i" and "j"; never mind the fact that their forms are both based on the letter "a". Encoding and font choice should be seen as separate.
>> He refers to the shape of the diacritical marks.
> 
> I see the issue: the font selected on my end made the "e" look like an "o", which completely changed my understanding of what he tried to communicate.

Ah, yes.

M


From everson at evertype.com  Sun Mar 26 11:23:06 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 17:23:06 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
 <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
 <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>
Message-ID: <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com>

On 26 Mar 2017, at 17:02, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> On 3/26/2017 6:18 AM, Michael Everson wrote:
> 
>> In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised.
> 
> Calling them "characters" is pre-judging the issue, don't you think?

No, I don?t think so.

> We know that these are different shapes, but that they stand for the same text elements.

No, they don?t. Those diphthongs can also be represented in other ways in Deseret.

I?ve never accepted the view that ?everything is already encoded and everything new is a disunification? which seems to be a pretty common view. 

Michael Everson


From doug at ewellic.org  Sun Mar 26 12:19:08 2017
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 26 Mar 2017 11:19:08 -0600
Subject: Diaeresis vs. umlaut (was: Re: Standaridized variation sequences
 for the Desert alphabet?)
In-Reply-To: <CAGa7JC1UzE1cY5k7Xa-9oFAAH2g0izBDu5tsbgQxzadwQ8LR+A@mail.gmail.com>
References: <20170324113344.665a7a7059d7ee80bb4d670165c8327d.77d11fb2be.wbe@email03.godaddy.com>
 <CAGa7JC1UzE1cY5k7Xa-9oFAAH2g0izBDu5tsbgQxzadwQ8LR+A@mail.gmail.com>
Message-ID: <FFED434D24B94FD5AD2CC63CDF1FA472@DougEwell>

Philippe Verdy wrote:

> Or may be, only for historic texts, we could add a combining lowercase
> e as an alternative to the existing diaeresis.

Something like U+0364 COMBINING LATIN SMALL LETTER E, maybe?

--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Sun Mar 26 12:20:27 2017
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 26 Mar 2017 11:20:27 -0600
Subject: Standaridized variation sequences for the Desert alphabet?
Message-ID: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>

Michael Everson wrote:

> One practical consequence of changing the chart glyphs now, for
> instance, would be that it would invalidate every existing Deseret
> font. Adding new characters would not.

I thought the chart glyphs were not normative.

--
Doug Ewell | Thornton, CO, US | ewellic.org

From everson at evertype.com  Sun Mar 26 12:33:00 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 18:33:00 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
Message-ID: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>

On 26 Mar 2017, at 18:20, Doug Ewell <doug at ewellic.org> wrote:
> 
> Michael Everson wrote:
> 
>> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not.
> 
> I thought the chart glyphs were not normative.

Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead?

Michael. 

From asmusf at ix.netcom.com  Sun Mar 26 15:39:38 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 13:39:38 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
 <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
Message-ID: <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com>

On 3/26/2017 10:33 AM, Michael Everson wrote:
> On 26 Mar 2017, at 18:20, Doug Ewell <doug at ewellic.org> wrote:
>> Michael Everson wrote:
>>
>>> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not.
>> I thought the chart glyphs were not normative.
> Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead?

If there was a tradition of writing W like omega, then switching the 
chart glyphs to that alternative tradition would be something that is at 
least not inconceivable -- even if perhaps not advisable.

For letters, their primary identity is not given by their shape, but 
their position / function in the alphabet.

That's why making Gaelic style and Fraktur a font switch works at all, 
even if that is not perfect (viz, ligatures in Fraktur).

In the Deseret case, making this alternation a font choice would tend to 
preserve the content of all documents. Making this an encoding 
difference would indeed invalidate some documents.

Finally, if this was in major, modern use, adding these code points 
would have grave consequences for security.

A./


From richard.wordingham at ntlworld.com  Sun Mar 26 15:48:15 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Sun, 26 Mar 2017 21:48:15 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
 <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
Message-ID: <20170326214815.5bd7eadb@JRWUBU2>

On Sun, 26 Mar 2017 18:33:00 +0100
Michael Everson <everson at evertype.com> wrote:

> On 26 Mar 2017, at 18:20, Doug Ewell <doug at ewellic.org> wrote:

> > Michael Everson wrote:

> >> One practical consequence of changing the chart glyphs now, for
> >> instance, would be that it would invalidate every existing Deseret
> >> font. Adding new characters would not.  

> > I thought the chart glyphs were not normative.  

> Come on, Doug. The letter W is a ligature of V and V. But sure, the
> glyphs are only informative, so why don?t we use an OO ligature
> instead?

A script-stlye font might legitimately use a glyph that looks like a
small omega for U+0077 LATIN SMALL LETTER W.  Small omega, of course,
is an ?? ligature.  More to the point, a font may legitimately use the
same glyphs for U+0067 LATIN SMALL LETTER G and U+0261 LATIN SMALL
LETTER SCRIPT G.

A more serious issue is the multiple forms of U+014A LATIN CAPITAL
LETTER ENG, for which the underlying unity comes from their being the
capital form of U+014B LATIN SMALL LETTER ENG.

Are there not serious divergences with the shapes of the Syriac letters?

Richard.


From everson at evertype.com  Sun Mar 26 15:51:43 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 21:51:43 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
 <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
 <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com>
Message-ID: <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com>

On 26 Mar 2017, at 21:39, Asmus Freytag <asmusf at ix.netcom.com> wrote:

>> Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature instead?
> 
> If there was a tradition of writing W like omega, then switching the chart glyphs to that alternative tradition would be something that is at least not inconceivable -- even if perhaps not advisable.

You know, Asmus, no analogy is perfect. But mine was a discussion of letters derived from ligatures, and yours is just a random note about shape. 

> For letters, their primary identity is not given by their shape, but their position / function in the alphabet.

This isn?t really something you can turn into an axiom, much as you would like to. Position in the alphabet may very WIDELY from language to language. As can function. The Latin letter c can mean /k s t? ts ? ? ?/? 

> That's why making Gaelic style and Fraktur a font switch works at all, even if that is not perfect (viz, ligatures in Fraktur).

Font style isn?t the same thing in this context. The historical letters used to make the 1855 ligatures are *different* letters than those used for the 1859 ligatures. 

> In the Deseret case, making this alternation a font choice would tend to preserve the content of all documents.

No, since it?s a question of *spelling*. Some documents use a ligature-letter for the diphthong /ju?/. Some documents use two separate letters for the same diphthong. So there?s no ?standardized? spelling that works for all text that would be affected here. (Spelling for English wasn?t standardized anyway in historical Deseret texts and there is much variety.)

> Making this an encoding difference would indeed invalidate some documents.

Right now the 1859 characters aren?t representable. Deciding to change the chart glyphs to 1859 glyphs would just destabilize EVERY current Deseret font. That?s not something we should do. 

> Finally, if this was in major, modern use, adding these code points would have grave consequences for security.

Why? They?re not visually similar to the existing characters. So spoofing wouldn?t be an issue. 

Michael Everson

From everson at evertype.com  Sun Mar 26 15:56:14 2017
From: everson at evertype.com (Michael Everson)
Date: Sun, 26 Mar 2017 21:56:14 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <20170326214815.5bd7eadb@JRWUBU2>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
 <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
 <20170326214815.5bd7eadb@JRWUBU2>
Message-ID: <C18DAB0B-066F-478E-B378-26307227221E@evertype.com>

On 26 Mar 2017, at 21:48, Richard Wordingham <richard.wordingham at ntlworld.com> wrote:

>> Come on, Doug. The letter W is a ligature of V and V. But sure, the glyphs are only informative, so why don?t we use an OO ligature= instead?
> 
> A script-stlye font might legitimately use a glyph that looks like a small omega for U+0077 LATIN SMALL LETTER W.

As I said to Asmus, my analogy was about ligatures made from underlying letters. Yours doesn?t apply because it?s just talking about glyph shapes. 

> Small omega, of course, is an ?? ligature.

True. :-) Isn?t history wonderful?

> More to the point, a font may legitimately use the same glyphs for U+0067 LATIN SMALL LETTER G and U+0261 LATIN SMALL LETTER SCRIPT G.

A good font will still find a way to distinguish them. :-) 

> A more serious issue is the multiple forms of U+014A LATIN CAPITAL LETTER ENG, for which the underlying unity comes from their being the capital form of U+014B LATIN SMALL LETTER ENG.

We could have, and should have, solved this problem *long ago* by encoding LATIN CAPITAL LETTER AFRICAN ENG and LATIN SMALL LETTER AFRICAN ENG. 

> Are there not serious divergences with the shapes of the Syriac letters?

That is analogous to Roman/Gaelic/Fraktur. That analogy doesn?t apply to these Deseret characters; it?s not a whole-script gestalt. 

Michael Everson

From asmusf at ix.netcom.com  Sun Mar 26 16:16:15 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 14:16:15 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
Message-ID: <5bc936f5-7513-8258-8709-26ee6c41d7ea@ix.netcom.com>

On 3/26/2017 9:20 AM, Michael Everson wrote:
> On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> The priority in encoding has to be with allowing distinctions in modern texts, or distinctions that matter to modern users of historic writing systems. Beyond that, theoretical analysis of typographical evolution can give some interesting insight, but I would be in the camp that does not accord them a status as primary rationale for encoding decisions.
> Our rationales are NOT ranked in the way you suggest. A variety of criteria are applied.

And the way you weigh the criteria?
>
>> Thus, critical need for contrasting use of the glyph distinctions would have to be established before it makes sense to discuss this further.
> Precedent for such needs is well-established. Consider the Latin Extended-D block. Sometimes it is editorial preference, and that?s not even always universal.

I think the Latin Extended-D block may have its own problems.

However, Latin as a script caters to so many varied levels of users, 
from ordinary text to scholarly notations that it really cannot be used 
to settle this issue.
>
>> I see no principled objection to having a font choice result in a noticeable or structural glyph variation for only a few elements of an alphabet. We have handle-a vs. bowl-a as well as hook-g vs. loop-g in Latin, and fonts routinely select one or the other.
> Well, Asmus, we encode a and ? as well as g and ? and ?.
And we do that for reasons that are very different from preserving the 
early and possibly transient history of a minor script.
> And we do not consider ? and ? and ? to be things that ought to be distinguished by variation selectors. (I am of course well aware of IPA usage.)
Yes, and the absence of such usage in the current example makes all the 
difference.
> Whole-font switching is well understood. But character origin has always been taken into account. Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.)
Apparently not only in the standard, because they show as different in 
the plaintext view of this message.
>
>> (It is only for usage outside normal text that the distinction between these forms matters).
> What?s ?normal? text? ?Normal? text in Latin probably doesn?t use the characters from the Latin Extended-D block.
"ordinary" text, if you like, reflecting standard orthographies.

As opposed to notational systems.
>
>> While the Deseret forms are motivated by their pronunciation, I'm not necessarily convinced that the distinction has any practical significance that is in any way different than similar differences in derivation (e.g. for long s-s or long-s-z for German esszett).
> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not.
No, if we state that both glyphs are alternates for the same character 
*and if we decide, to _not_ add variation selectors* the choice is where 
it belongs: with the font maker.
>
>> In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance).
> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts.
If the underlying text element is the same, font switching can be the 
correct choice.
>
>> This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction.
> Character identity is not defined by any single criterion.
Make it the "primary" criterion then.
>   Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions.
Yes, and those other spellings are not affected.
>
>>> This is complicated by combining characters mostly identified by glyph, and the fact that while ? and a? may be the same character across time, there are people wanting to distinguish them in the same text today, and in both cases             the theoretical falls to the practical. In this case, there are no combining character issues and there's nobody needing to use the two forms in the same text.
>> huh?
> He?s wrong there, as I pointed out. A text in German may write an older Clavieru?bung in a citation alongside the normal spelling Klavier?bung. The choice of spelling is key.
That would have to be a very specialized text. But to claim that this 
needs to be possible in German in plaintext for the case of such a quote 
is more than a stretch. If there is a critical need for such texts *as 
plain text* in Deseret, that would be a curious fact, but perhaps decisive.

A./

From asmusf at ix.netcom.com  Sun Mar 26 16:20:26 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 14:20:26 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com>
References: <A64851216CC34B3AB93F23ED46F526A7@DougEwell>
 <7A61ADE5-3D12-453F-9E07-B05A53283515@evertype.com>
 <502c790d-3064-dbc1-e088-595be9a9dabe@ix.netcom.com>
 <4897ACE7-807F-42FD-AEC8-7150EC87CA0B@evertype.com>
Message-ID: <3a9ea98e-6615-688e-3ce9-b6a41a34ebc6@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170326/1c221bff/attachment.html>

From asmusf at ix.netcom.com  Sun Mar 26 16:30:04 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Sun, 26 Mar 2017 14:30:04 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
 <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
 <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>
 <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com>
Message-ID: <f2bac63a-7141-2f8f-fb3f-964e2b5e8f84@ix.netcom.com>

On 3/26/2017 9:23 AM, Michael Everson wrote:
> On 26 Mar 2017, at 17:02, Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> On 3/26/2017 6:18 AM, Michael Everson wrote:
>>
>>> In any case it?s not a disunification. Some characters are encoded; they were used to write diphthongs in 1855. These characters were abandoned by 1859, and other characters were devised.
>> Calling them "characters" is pre-judging the issue, don't you think?
> No, I don?t think so.

I really think it is.
>
>> We know that these are different shapes, but that they stand for the same text elements.
> No, they don?t. Those diphthongs can also be represented in other ways in Deseret.

Having alternative ways to represent these doesn't invalidate or affect 
my argument.
>
> I?ve never accepted the view that ?everything is already encoded and everything new is a disunification? which seems to be a pretty common view.

I would not say I aspire to the view you quote.

If you encode a certain shape, it may get used for a range of text 
elements. This would (de facto) encode these text elements via that 
shape. If it is later felt that the given shape should not be used for 
the full range of text elements, then you could say that the "implicit" 
unification based on the usage (or, if you will, "fallback usage") was 
mistaken and should be better handled by two (or more) shapes. This 
represents a "de-facto" disunification.

However, where I part from your description is the "everything is 
already encoded". That would not be the case anywhere a range of text 
elements cannot be represented at all. Your statement also implies a 
"correctly encoded" or "successfully encoded" which is different from 
"there's an encoding that some people use as a fallback", which, if 
disunification should prove proper later on, would be a better way of 
describing what was the original situation.

Perhaps the point is subtle, but it is important.

In the current case, you have the opposite, to wit, the text elements 
are unchanged, but you would like to add alternate code elements to 
represent what are, ultimately, the same text elements. That's not 
disunification, but dual encoding.

A./

From jameskasskrv at gmail.com  Sun Mar 26 23:58:51 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Sun, 26 Mar 2017 20:58:51 -0800
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <f2bac63a-7141-2f8f-fb3f-964e2b5e8f84@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
 <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
 <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>
 <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com>
 <f2bac63a-7141-2f8f-fb3f-964e2b5e8f84@ix.netcom.com>
Message-ID: <CABPY6Z2qJEXQBnzwy4g_5i4ThP+ohhh=fbVJZ_QaTPF=OjiAjg@mail.gmail.com>

Asmus Freytag wrote,

> In the current case, you have the opposite,
> to wit, the text elements are unchanged, but
> you would like to add alternate code elements
> to represent what are, ultimately, the same
> text elements. That's not disunification, but
> dual encoding.

If spelling a word with an x+y string versus a z+y string represents
two different spellings of the same word, then hand printing the same
word with either an x/y ligature versus a z/y ligature also represents
two different spellings of the same word.

Best regards,

James Kass

From duerst at it.aoyama.ac.jp  Mon Mar 27 00:42:40 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 27 Mar 2017 14:42:40 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
Message-ID: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>

On 2017/03/26 22:15, Michael Everson wrote:
>
>> On 26 Mar 2017, at 09:12, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>>
>>> Thats a good point: any disunification requires showing examples of
>>> contrasting uses.
>>
>> Fully agreed.
>
> The default position is NOT ?everything is encoded unified until disunified?.

Neither it's "everything is encoded separately unless it's unified".


> The characters in question have different and undisputed origins, undisputed.

If you change that to the somewhat more neutral "the shapes in question 
have different and undisputed origins", then I'm with you. I actually 
have said as much (in different words) in an earlier post.


> We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.)

Fine. I (and others) have also given quite a few analogies, none of them 
perfect, but most if not all of them pertinent.


>> We haven't yet heard of any contrasting uses for the letter shapes we are discussing.
>
> Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters.

Sorry, but where did I say that it's the only criterion? I don't think 
it's the only criterion. On the other hand, I also don't think that 
historical origin is or should be the only criterion.

Unfortunately, much of what you wrote gave me the impression that you 
may think that historical origin is the only criterion, or a criterion 
that trumps all others. If you don't think so, it would be good if you 
could confirm this. If you think so, it would be good to know why.


> Please try to remember that. (It?s a bit shocking to have to remind people of this.

You don't have to remind me, at least. I have mentioned "usability for 
average users in average contexts" and "contrasting use" as criteria, 
and I have also in earlier mail acknowledged history as a (not the) 
criterion, and have mentioned legacy/roundtrip issues. I'm sure there 
are others.


Regards,   Martin.

From duerst at it.aoyama.ac.jp  Mon Mar 27 02:05:12 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 27 Mar 2017 16:05:12 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
Message-ID: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>

On 2017/03/27 01:20, Michael Everson wrote:
> On 26 Mar 2017, at 16:45, Asmus Freytag <asmusf at ix.netcom.com> wrote:

> Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.)

"apparently", maybe. Let's for a moment leave aside the radicals 
themselves, which are to a large extent artificial constructs. Let's 
look at the actual characters with these radicals (e.g. U+6709,... for 
MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 
10646. There are some exceptions, but in most cases, the G/J/K columns 
show no difference (i.e. always the ? shape, with two horizontal bars), 
whereas the H/T/V columns show the ? shape (two downwards slanted bars) 
for the "MEAT" radical and the ? shape for the moon radical. So whether 
these radicals have identical glyphs depends on typographic 
tradition/font/... In Japan, many people may be rather unaware of the 
difference, whereas in Taiwan, it may be that school children get 
drilled on the difference.


> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not.

Independent of whether the chart glyphs get changed, couldn't we just 
add a note "also # in some fonts" (where # is the other variant). That 
would make sure that nobody could claim "this font is wrong" based on 
the charts. (Even if a general claim that the chart glyphs aren't 
normative applies to all charts anyway.)


>> In fact, it would seem that if a Deseret text was encoded in one of the two systems, changing to a different font would have the attractive property of preserving the content of the text (while not preserving the appearance).
>
> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts.

Well, yes, rejected many times in cases where that was appropriate. But 
also accepted many times, in cases that we may not even remember, 
because they may not even have been made explicitly. Because in such 
cases, the focus may not be on a change to one or a few letter shapes, 
but the focus may be on a change of the overall style, which induces a 
change of letter shape in some letters. The roman/italic a/? and g/? 
distinctions (the later code points only used to show the distinction in 
plain text, which could as well be done descriptively), as well as a 
large number of distinctions in Han fonts, come to my mind. I'm quite 
sure other scripts have similar phenomena.


>> This, in a nutshell, is the criterion for making something a font difference vs. an encoding distinction.
>
> Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions.

This is interesting information. You are saying that in actual practice, 
there is a choice between writing ???? (two letters for a diphthong) and 
writing ??. In the same location, is ???? (the base for the historically 
later shape variant of ??; please note that this may actually be written 
????; there's some inconsistency in order between the above cited 
sentence and the text below copied from an earlier mail) also used as a 
spelling variant? Overall, we may have up to four variants, of which 
three are currently explicitly supported in Unicode. Are all of these 
used as spelling variants? Is the choice of variant up to the author 
(for which variants), or is it the editor or printer who makes the 
choice (for which variants)? And what informs this choice? If we have 
any historic metal types, are there examples where a font contains both 
ligature variants?

(Please note that because ??, ??, and ?? are available as individual 
letters, it's very difficult to think about the two-letter sequences as 
anything else than spellings, but that doesn't necessarily carry over to 
the ligatures.)

And then the same questions, with parallel (or not parallel) answers, 
for ??/??/??.

Regards,    Martin.


Text copied from earlier mail by Michael:

 >>>>
1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the 
diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], 
that is, [?] + [o?] = [?u?], that is, [ju].

2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? 
SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], 
that is, [?] + [?] = [??], that is, [??].

That?s encoded. Now evidently, the glyphs for the 1859 substitutions are 
as follows:

1. The 1859 glyph for EW is evidently a ligature of the glyph for the 
diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], 
that is, [?] + [?] = [??], that is, [ju].

2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? 
LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that 
is, [??] + [?] = [???], that is, [??].
 >>>

From jameskasskrv at gmail.com  Mon Mar 27 03:04:44 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Mon, 27 Mar 2017 00:04:44 -0800
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
Message-ID: <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>

Martin J. D?rst responded to Michael Everson,

> Overall, we may have up to four variants, of which
> three are currently explicitly supported in Unicode.

Yes.

> Are all of these used as spelling variants?

Is there another possible use?

> Is the choice of variant up to the author (for which
> variants), or is it the editor or printer who makes
> the choice (for which variants)?

The author, see below.

> And what informs this choice?

Personal preference and/or spelling reform as well as whether the
material was machine printed or hand written.

> If we have any historic metal types, are there
> examples where a font contains both ligature
> variants?

Apparently not.

John H. Jenkins mentioned early in this thread that these ligatures
weren't used in printed materials and were not part of the official
Deseret set.  They were only used in manuscript.

Best regards,

James Kass


From jameskasskrv at gmail.com  Mon Mar 27 03:23:39 2017
From: jameskasskrv at gmail.com (James Kass)
Date: Mon, 27 Mar 2017 00:23:39 -0800
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
Message-ID: <CABPY6Z3thP6iJ8wtkBgWRi-E=pbTBYLE4kKvQ9Bk_4xGDQfPDA@mail.gmail.com>

Martin J. D?rst responded to Michael Everson,

> Unfortunately, much of what you wrote gave me the
> impression that you may think that historical origin
> is the only criterion, or a criterion that trumps all
> others. If you don't think so, it would be good if you
> could confirm this. If you think so, it would be good
> to know why.

Historical origin is always a good starting point.

The importance of history cannot be overstated.  Without it, the other
criteria would not exist.

Historical origin wouldn't override evidence of contrasting use in
this case because such evidence would be "icing on the cake".

> ... I have mentioned "usability for average users in
> average contexts" and "contrasting use" as criteria,
> and I have also in earlier mail acknowledged history
> as a (not the) criterion, and have mentioned legacy/
> roundtrip issues. I'm sure there are others.

Adding a few historic letters should seldom have any effect on
"usability for average users in average contexts".  Whether it does in
this case remains to be seen.

Legacy and roundtrip issues are important because
backwards-compatibility supports history.  Concerns in this case
appear to be hypothetical.

Best regards,

James Kass


From duerst at it.aoyama.ac.jp  Mon Mar 27 03:29:21 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 27 Mar 2017 17:29:21 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
Message-ID: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>

On 2017/03/24 23:37, Michael Everson wrote:
> On 24 Mar 2017, at 11:34, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>>
>> On 2017/03/23 22:48, Michael Everson wrote:
>>
>>> Indeed I would say to John Jenkins and Ken Beesley that the richness of the history of the Deseret alphabet would be impoverished by treating the 1859 letters as identical to the 1855 letters.
>>
>> Well, I might be completely wrong, but John Jenkins may be the person on this list closest to an actual user of Deseret (John, please correct me if I'm wrong one way or another).
>
> He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark).

Great to know. Given that, I'd assume that you'd take his input a bit 
more serious. Here's what he wrote:

 >>>>
My own take on this is "absolutely not." This is a font issue, pure and 
simple. There is no dispute as to the identity of the characters in 
question, just their appearance.

In any event, these two letters were never part of the "standard" 
Deseret Alphabet used in printed materials. To the extent they were 
used, it was in hand-written material only, where you're going to see a 
fair amount of variation anyway. There were also two recensions of the 
DA used in printed materials which are materially different, and those 
would best be handled via fonts.

It isn't unreasonable to suggest we change the glyphs we use in the 
Standard. Ken Beesley and I have have discussed the possibility, and we 
both feel that it's very much on the table.
 >>>>


>> It may be that actual users of Deseret read these character variants the same way most of us would read serif vs. sans-serif variants: I.e. unless we are designers or typographers, we don't actually consciously notice the difference.
>
> I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received.

That's fine, and not disputed at all. That's exactly why I'm looking for 
input from other people.

As an analogy, assume we had a famous type designer coming to this list 
and request that we encode old-style digits separately from roman 
digits, e.g. arguing that this might simplify the production of fonts.

We would understand this request, but we would still deny it because 
based on our day-to-day use of digits, we would understand that at large 
(i.e. for the average user) the convenience of having only one code 
point for a given digit weights stronger than the convenience of 
separate code points for the type designer.

We are looking for similar input from "average users" for Deseret.


>> If that's the case, it would be utterly annoying to these actual users to have to make a distinction between two characters where there actually is none.
>
> Actually neither of the ligature-letters are used in our Carrollian Deseret volumes.

Ok. That means that these don't provide any information on the 
discussion at hand (whether to unify or disunify the ligature shapes).


>> The richness of the history of the Deseret alphabet can still be preserved e.g. with different fonts the same way we have thousands of different fonts for Latin and many other scripts that show a lot of rich history.
>
> You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do.

Great. So you know that present-day font technology would allow us to 
handle the different shapes in at least any of the following ways:

1) Separate characters for separate shapes, both shapes in same font
2) Variant selectors, one or both shapes in same font
3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font
4) Font selection, different fonts for different shapes

Does that knowledge in any way suggest one particular solution?


> I?m also aware of what principles we have used for determining character identity.

Which, as we have been working out in other mails, are indeed a 
collection of principles, one of which is history of shape derivation.


> I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters.

One of the principles of CJK unification is that minor differences are 
ignored if they are not semantically relevant. For CJK, 'minor' is 
important, because otherwise, many users wouldn't be able to recognize 
the shapes as having the same semantics/usage.

The qualification 'minor' is less important for an alphabet. In general, 
the more established and well-known an alphabet is, the wider the 
variations of glyph shapes that may be tolerated. The question I'm 
trying to get an answer for for Deseret is whether current actual script 
users see the shape variation as just substitutable glyphs of the same 
letter, or inherently different letters.

The answer to this question is not the *only* criterion for deciding 
whether to encode further Deseret letters, but I think it's an important 
criterion. And the answer that John has given seems to point in a very 
clear direction for this question.


Regards,   Martin.

From tfujiwar at redhat.com  Mon Mar 27 04:00:21 2017
From: tfujiwar at redhat.com (Takao Fujiwara)
Date: Mon, 27 Mar 2017 18:00:21 +0900
Subject: different version of common/annotations/ja.xml
Message-ID: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>

Hi,

Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation?
http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml

That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml.
E.g.
                <annotation cp="?">? | ?? | ????</annotation>

instead of

                <annotation cp="?">??? | ???? | ????</annotation>

I think the committed version is useful without input method and it follows other languages.

Thanks,
Fujiwara

From jcb+unicode at inf.ed.ac.uk  Mon Mar 27 04:14:17 2017
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Mon, 27 Mar 2017 10:14:17 +0100 (BST)
Subject: Standaridized variation sequences for the Desert alphabet?
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
Message-ID: <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>

While I hesitate to dive in to this argument, Martin makes one comment
where I think a point of principle arises:

On 2017-03-27, =?UTF-8?Q?Martin_J._D=c3=bcrst?= <duerst at it.aoyama.ac.jp> wrote:
[Michael wrote]
>> You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do.
>
> Great. So you know that present-day font technology would allow us to 
> handle the different shapes in at least any of the following ways:
>
> 1) Separate characters for separate shapes, both shapes in same font
> 2) Variant selectors, one or both shapes in same font
> 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font
> 4) Font selection, different fonts for different shapes
>
> Does that knowledge in any way suggest one particular solution?


As I've observed before, the intention is that we are stuck with
Unicode for as long as our civilization endures, be that 5000 years or
50 years.

I contend, therefore, that no decision about Unicode should take into
account any ephemeral considerations such as this year's electronic
font technology, and that therefore it's not even useful to mention
them.

All you should need to say is "these letters are too insignificant to
merit encoding, and those who believe they need to be able to
distinguish them in plain text will just have to use other means, such
as ZWJ with the components of the ligature".

(I'm not saying that's my view, by the way - I'm more of a splitter
than a lumper, and on the basis of this thread, I'm probably on the
"encode" side.)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From mark at macchiato.com  Mon Mar 27 04:48:25 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Mon, 27 Mar 2017 11:48:25 +0200
Subject: different version of common/annotations/ja.xml
In-Reply-To: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
Message-ID: <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>

By "committed strings", you mean the hiragana phonetic reading?

Mark

On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <tfujiwar at redhat.com>
wrote:

> Hi,
>
> Do you have any chances to create a different version of ja.xml of the
> Japanese emoji annotation?
> http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>
> That file includes Hiragana only but I'd need another file which has the
> committed strings, likes ja_convert.xml.
> E.g.
>                <annotation cp="?">? | ?? | ????</annotation>
>
> instead of
>
>                <annotation cp="?">??? | ???? | ????</annotation>
>
> I think the committed version is useful without input method and it
> follows other languages.
>
> Thanks,
> Fujiwara
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/c98b9370/attachment.html>

From tfujiwar at redhat.com  Mon Mar 27 05:04:27 2017
From: tfujiwar at redhat.com (Takao Fujiwara)
Date: Mon, 27 Mar 2017 19:04:27 +0900
Subject: different version of common/annotations/ja.xml
In-Reply-To: <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
 <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
Message-ID: <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>

On 03/27/17 18:48, Mark Davis ??-san wrote:
> By "committed strings", you mean the hiragana phonetic reading?

Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion.
After users select one of the converted strings, the converted strings are committed on the text.
I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method.

Fujiwara

>
> Mark
> //////
>
> On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com>> wrote:
>
>     Hi,
>
>     Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation?
>     http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>     <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>
>
>     That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml.
>     E.g.
>                    <annotation cp="?">? | ?? | ????</annotation>
>
>     instead of
>
>                    <annotation cp="?">??? | ???? | ????</annotation>
>
>     I think the committed version is useful without input method and it follows other languages.
>
>     Thanks,
>     Fujiwara
>
>


From everson at evertype.com  Mon Mar 27 06:39:56 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 12:39:56 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CABPY6Z2qJEXQBnzwy4g_5i4ThP+ohhh=fbVJZ_QaTPF=OjiAjg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <000301d2a610$5cf1e550$16d5aff0$@fi>
 <D1F98D46-0A20-46EF-B616-6CB032FF8409@evertype.com>
 <c9552a36-07ec-255c-dfef-7f30775480e7@ix.netcom.com>
 <6AE9BA94-1298-4D9F-B2C2-57BDEEDF1D6C@evertype.com>
 <f2bac63a-7141-2f8f-fb3f-964e2b5e8f84@ix.netcom.com>
 <CABPY6Z2qJEXQBnzwy4g_5i4ThP+ohhh=fbVJZ_QaTPF=OjiAjg@mail.gmail.com>
Message-ID: <AD920F78-057F-4D40-9738-093B81A395EC@evertype.com>

On 27 Mar 2017, at 05:58, James Kass <jameskasskrv at gmail.com> wrote:
> 
> Asmus Freytag wrote,
> 
>> In the current case, you have the opposite, to wit, the text elements are unchanged, but you would like to add alternate code elements
>> to represent what are, ultimately, the same text elements. That's not disunification, but dual encoding.
> 
> If spelling a word with an x+y string versus a z+y string represents two different spellings of the same word, then hand printing the same
> word with either an x/y ligature versus a z/y ligature also represents two different spellings of the same word.

Asmus also changes the terms of the discussion by introducing the vague and undefined term ?text element?. 

Michael Everson

From everson at evertype.com  Mon Mar 27 07:07:19 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 13:07:19 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
Message-ID: <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>

On 27 Mar 2017, at 06:42, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

>> The default position is NOT ?everything is encoded unified until disunified?.
> 
> Neither it's "everything is encoded separately unless it's unified?.

These Deseret letters aren?t encoded. For my part I wasn?t made aware of them in 2004 when they were written about. My view is ?Ah, here?s something. is it encoded? No. Is it a glyph variant of something encoded? No."

>> The characters in question have different and undisputed origins, undisputed.
> 
> If you change that to the somewhat more neutral "the shapes in question have different and undisputed origins", then I'm with you. I actually have said as much (in different words) in an earlier post.

And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before.

>> We?ve encoded one pair; evidently this pair was deprecated and another pair was devised. The letters wynn and w are also used for the same thing. They too have different origins and are encoded separately. The letters yogh and ezh have different origins and are encoded separately. (These are not perfect analogies, but they are pertinent.)
> 
> Fine. I (and others) have also given quite a few analogies, none of them perfect, but most if not all of them pertinent.

The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care. No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this:

https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg

And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there. The situation in Deseret is different.

Other analogies had to do with normal shape variation, not shapes derived from underlying ligatures. Analogies are never perfect but I don?t think the ones offered were pertinent.

Underlying ligature difference is indicative of character identity. Particularly when two resulting ligatures are SO different from one another as to be unrecognizable. And that is the case with EW on the left and OI on the right here: 
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg

The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters. 

>>> We haven't yet heard of any contrasting uses for the letter shapes we are discussing.
>> 
>> Contrasting use is NOT the only criterion we apply when establishing the characterhood of characters.
> 
> Sorry, but where did I say that it's the only criterion? I don't think it's the only criterion. On the other hand, I also don't think that historical origin is or should be the only criterion.

Neither do I, but it has been a very clear precedent for many character distinctions and that is useful precedent. 

> Unfortunately, much of what you wrote gave me the impression that you may think that historical origin is the only criterion, or a criterion that trumps all others. If you don't think so, it would be good if you could confirm this. If you think so, it would be good to know why.

Character origin is intimately related to character identity. Even where superficial similarity is concerned; I had to prove character origin for the disunification of YOGH from EZH long long ago and I?ve done the same over and over again for many characters and even full scripts. Sometimes characters are used and then become disused. MOST of the Bamum characters we have encoded aren?t in modern use today, but they were encoded for historical concerns. 

>> Please try to remember that. (It?s a bit shocking to have to remind people of this.
> 
> You don't have to remind me, at least. I have mentioned "usability for average users in average contexts" and "contrasting use" as criteria, and I have also in earlier mail acknowledged history as a (not the) criterion, and have mentioned legacy/roundtrip issues. I'm sure there are others.

I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters. 

Michael Everson

From everson at evertype.com  Mon Mar 27 07:59:40 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 13:59:40 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
Message-ID: <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>

On 27 Mar 2017, at 08:05, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

>> Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.)
> 
> "apparently", maybe. Let's for a moment leave aside the radicals themselves, which are to a large extent artificial constructs.

I do stipulate not being a CJK expert. But those are indeed different due to their origins, however similar their shapes are. 

> Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ? shape, with two horizontal bars), whereas the H/T/V columns show the ? shape (two downwards slanted bars) for the "MEAT" radical and the ? shape for the moon radical. So whether these radicals have identical glyphs depends on typographic tradition/font/?

They are still always very similar, right?

> In Japan, many people may be rather unaware of the difference, whereas in Taiwan, it may be that school children get drilled on the difference.

That?s interesting. 

>> One practical consequence of changing the chart glyphs now, for instance, would be that it would invalidate every existing Deseret font. Adding new characters would not.
> 
> Independent of whether the chart glyphs get changed, couldn't we just add a note "also # in some fonts" (where # is the other variant). 

Well, no. First, ALL fonts currently use the 1855 letterforms based on ligatures ???? and ????, so a decree that those code positions would 

Second, the letterforms resulting from the ligations are just nothing alike 

> That would make sure that nobody could claim "this font is wrong" based on the charts. (Even if a general claim that the chart glyphs aren't normative applies to all charts anyway.)

As James Kass said: "If spelling a word with an x+y string versus a z+y string represents two different spellings of the same word, then hand printing the same word with either an x/y ligature versus a z/y ligature also represents two different spellings of the same word."

>> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts.
> 
> Well, yes, rejected many times in cases where that was appropriate. But also accepted many times, in cases that we may not even remember, because they may not even have been made explicitly.

Do come up with examples if you have any. 

> Because in such cases, the focus may not be on a change to one or a few letter shapes, but the focus may be on a change of the overall style, which induces a change of letter shape in some letters.

To be honest I really don?t follow this reasoning. https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg isn?t just some ?glyph variation?. They are entirely different glyphs with entirely different origins. I can think of no instance where we have "unified? such wildly different glyphs. 

> The roman/italic a/? and g/? distinctions (the later code points only used to show the distinction in plain text, which could as well be done descriptively),

Aa and ?? are used contrastively for different sounds in some languages and in the IPA. ?? is not, to my knowledge, used contrastively with Gg (except that ? can only mean /?/, while orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is reasonably analogous to ?? and <lig>????</lig> being used for /ju?/.

> as well as a large number of distinctions in Han fonts, come to my mind. I'm quite sure other scripts have similar phenomena.

Again, spelling of all kinds varies greatly in Deseret texts. I?ll try with another example using some Latin glyphs. ?Poison? can be written ???????????? POIZ?N in Deseret, or it can be written ?????????? P?Z?N or it can be written ??<????>?????? P?Z?N. That?s three different spellings, not two. (I used O with a bar to mimic the bar of Deseret SHORT I ??). 

>> Character identity is not defined by any single criterion. Moreover, in Deseret, it is not the case that all texts which contain the diphthong /ju?/ or /??/ write it using EW ?? or OI ??. Many write them as Y + U ???? and O + I ????. So the choice is one of *spelling*, and spelling has always been a primary criterion for such decisions.
> 
> This is interesting information. You are saying that in actual practice, there is a choice between writing ???? (two letters for a diphthong) and writing ??. In the same location, is ???? (the base for the historically later shape variant of ??; please note that this may actually be written ????;

No, that?s not correct. Poison can be written with ???? or it can be written with ?? (in origin a ligature of ????) or it can be written with <lig>????</lig>. Unligated, the three spellings would be different: ???????????? /po?z?n/ and ???????????? /p??z?n/ and ???????????? /p???z?n/. Despite this, with the ligatures, the pronunciation would be /po?z?n/ whether ???????????? or ?????????? or ??<????>??????. 

> there's some inconsistency in order between the above cited sentence and the text below copied from an earlier mail) also used as a spelling variant?

I don?t think so.

> Overall, we may have up to four variants,

No, we don?t. See above. And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ??<????>??. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/. 

> of which three are currently explicitly supported in Unicode.

The characters <????> and <????> are not encoded. 

> Are all of these used as spelling variants?

In principle, what I have shown above is accurate. I can?t do a corpus search for actual examples. 

> Is the choice of variant up to the author (for which variants), or is it the editor or printer who makes the choice (for which variants)?

In a handwritten manuscript obviously the choice is the author?s. As to historical printing, printers may have 

> And what informs this choice? If we have any historic metal types, are there examples where a font contains both ligature variants?

Ken Beesley have samples of a metal font (the 1857 St Luois punches) which had both ?? and ????; I don?t know what other sorts were in that font. 

> (Please note that because ??, ??, and ?? are available as individual letters, it's very difficult to think about the two-letter sequences as anything else than spellings, but that doesn't necessarily carry over to the ligatures.)

See above. 

> And then the same questions, with parallel (or not parallel) answers, for ??/??/??.

See above.

Michael Everson

> Regards,    Martin.
> 
> 
> Text copied from earlier mail by Michael:
> 
> >>>>
> 1. The 1855 glyph for ?? EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? LONG OO [u?], that is, [?] + [o?] = [?u?], that is, [ju].
> 
> 2. The 1855 glyph for ?? OI is evidently a ligature of the glyph for ?? SHORT AH [?] and the diagonal stroke of the glyph for ?? SHORT I [?], that is, [?] + [?] = [??], that is, [??].
> 
> That?s encoded. Now evidently, the glyphs for the 1859 substitutions are as follows:
> 
> 1. The 1859 glyph for EW is evidently a ligature of the glyph for the diagonal stroke of the glyph for ?? SHORT I [?] and ?? SHORT OO [?], that is, [?] + [?] = [??], that is, [ju].
> 
> 2. The 1859 glyph for OI is evidently a ligature of the glyph for ?? LONG AH [??] and the diagonal stroke of the glyph for SHORT I [?], that is, [??] + [?] = [???], that is, [??].
> >>>


From everson at evertype.com  Mon Mar 27 08:02:00 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 14:02:00 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
Message-ID: <F40A44C3-0D6E-4DDD-8757-D015D92D143C@evertype.com>

On 27 Mar 2017, at 09:04, James Kass <jameskasskrv at gmail.com> wrote:

> John H. Jenkins mentioned early in this thread that these ligatures weren't used in printed materials and were not part of the official Deseret set.  They were only used in manuscript.

Not quite true. Such detail will be for the proposal.

Michael

From everson at evertype.com  Mon Mar 27 08:49:54 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 14:49:54 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
Message-ID: <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com>

On 27 Mar 2017, at 09:29, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

>> He is. He transcribes texts into Deseret. I?ve published three of them (Alice, Looking-Glass, and Snark).
> 
> Great to know. Given that, I'd assume that you'd take his input a bit more serious.

I?m discussing it now, offline, with him and Ken.

> Here's what he wrote:
> 
> >>>>
> My own take on this is "absolutely not." This is a font issue, pure and simple. There is no dispute as to the identity of the characters in question, just their appearance.

That begs the whole question of character identity. He?s simply saying what you and Asmus also said. But when you dig into it further, there?s more to the story, as we have found out. 

> In any event, these two letters were never part of the "standard" Deseret Alphabet used in printed materials. To the extent they were used, it was in hand-written material only, where you're going to see a fair amount of variation anyway. There were also two recensions of the DA used in printed materials which are materially different, and those would best be handled via fonts.

There was indeed type cut for these. What?s not found is a full alphabet chart showing some of the ligated letters, but that?s a different question.

> It isn't unreasonable to suggest we change the glyphs we use in the Standard. Ken Beesley and I have have discussed the possibility, and we both feel that it's very much on the table.
> >>>>

Now that further research has been done, I?ll be discussing this with John and Ken with regard to putting together a proposal which will support the two ligating letterform characters as well as some other historical Deseret characters, some used in an important English-Hopi lexicon which was recently published. (I await my copy of that.)

>> I am a designer and typographer, and I?ve worked rather extensively with a variety of Deseret fonts for my publications. They have been well-received.
> 
> That's fine, and not disputed at all. That's exactly why I'm looking for input from other people.

Well, all right, but I didn?t use either ?? or ?? in my editions apart from the entry in the chart in the front matter. 

> As an analogy, assume we had a famous type designer coming to this list and request that we encode old-style digits separately from roman digits, e.g. arguing that this might simplify the production of fonts.

I don?t see how this analogy could possibly apply. Once again the 1859 ligature-characters look nothing at all like the 1855 one, which speaks to their unique identity as characters. 

Moreover, encoded digits are used by billions of people daily.

> We would understand this request, but we would still deny it because based on our day-to-day use of digits, we would understand that at large (i.e. for the average user) the convenience of having only one code point for a given digit weights stronger than the convenience of separate code points for the type designer.

I?m not suggesting encoding characters for ?convenience?. I?m suggesting that there is a character-identity issue here, based both on the origin of the characters and of their vasty different appearance from other characters encoded in the standard. 

> We are looking for similar input from "average users" for Deseret.

The encoding of historic characters is for ?expert users? working with historical material, not necessarily ?average users? who might be composing blog entries. 

>> Actually neither of the ligature-letters are used in our Carrollian Deseret volumes.
> 
> Ok. That means that these don't provide any information on the discussion at hand (whether to unify or disunify the ligature shapes).

I didn?t even know about the 1859 ligatures until this week. All this proves is that John didn?t use any ligatures when he transcribed the texts. 

>> You know, Martin, I *have* been doing this for the last two decades. I?m well aware of what a font is and can do.
> 
> Great. So you know that present-day font technology would allow us to handle the different shapes in at least any of the following ways:
> 
> 1) Separate characters for separate shapes, both shapes in same font

We shouldn?t do that for shapes so different and with clearly different origins.

> 2) Variant selectors, one or both shapes in same font

Pseudo-encoding, useful for subtle variation but not for something as big as this. I am not an enemy of variation selectors. In fact I?m preparing a nice proposal for some standardized sequences. It would not apply here, because they glyph identity of the letters is too distinct. 

> 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font

Font trickery. Not portable. Not supported by most apps. 

> 4) Font selection, different fonts for different shapes

We really don?t do this just for one or two characters in a script. 

> Does that knowledge in any way suggest one particular solution?

None of this discussion has convinced me that these letters are variants of existing characters. 

>> I?m also aware of what principles we have used for determining character identity.
> 
> Which, as we have been working out in other mails, are indeed a collection of principles, one of which is history of shape derivation.

That and spelling. The only counterargument seems to be ?they are diphthongs? but we don?t encode sounds, we encode the elements of writings systems. The 1859 ligated letterforms are not in any way glyph variants of the 1855 ligated letterforms. They?re completely different letterforms, having only the diagonal stroke of the ?? in common.

>> I saw your note about CJK. Unification there typically has something to do with character origin and similarity. The Deseret diphthong letters are clearly based on ligatures of *different* characters.
> 
> One of the principles of CJK unification is that minor differences are ignored if they are not semantically relevant. For CJK, 'minor' is important, because otherwise, many users wouldn't be able to recognize the shapes as having the same semantics/usage.

These would not be unified according to CJK principles:
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg 

> The qualification 'minor' is less important for an alphabet. In general, the more established and well-known an alphabet is, the wider the variations of glyph shapes that may be tolerated. The question I'm trying to get an answer for for Deseret is whether current actual script users see the shape variation as just substitutable glyphs of the same letter, or inherently different letters.
> 
> The answer to this question is not the *only* criterion for deciding whether to encode further Deseret letters, but I think it's an important criterion. And the answer that John has given seems to point in a very clear direction for this question.

John?s view was a first statement before many questions were asked and before research into the matter had commenced, really. 

I?ll get back to you after working with John and Ken some more. 

Michael Everson

From alastair at alastairs-place.net  Mon Mar 27 09:04:17 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Mon, 27 Mar 2017 15:04:17 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>
Message-ID: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>

On 27 Mar 2017, at 10:14, Julian Bradfield <jcb+unicode at inf.ed.ac.uk> wrote:
> 
> I contend, therefore, that no decision about Unicode should take into
> account any ephemeral considerations such as this year's electronic
> font technology, and that therefore it's not even useful to mention
> them.

I?d disagree with that, for two reasons:

1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now.  If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text).

2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future.  There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed.

I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems.

All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here.  That?s very much not my field.

Kind regards,

Alastair.

--
http://alastairs-place.net


From alastair at alastairs-place.net  Mon Mar 27 09:32:55 2017
From: alastair at alastairs-place.net (Alastair Houghton)
Date: Mon, 27 Mar 2017 15:32:55 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <7B64B1F5-862B-4659-B9DB-A7454EE714A0@evertype.com>
Message-ID: <3F53B8D8-31EB-41EC-B672-1AD03A415C67@alastairs-place.net>

On 27 Mar 2017, at 14:49, Michael Everson <everson at evertype.com> wrote:

>> 3) Font features (e.g. 1855 vs. 1859) to select shapes in the same font
> 
> Font trickery. Not portable. Not supported by most apps. 

I wouldn?t describe it as ?trickery? or ?not portable?.  Features like stylistic alternates are part of the OpenType specification, and actually have quite widespread support in Mac software (check out the Typography panel, which you can get to from the system Font Panel).  On Windows and Linux, support is more limited, though software that uses the newer DirectWrite or Pango APIs to render text should find it straightforward enough.

I don?t know how this bears on the discussion about Deseret (that?s outside my area of expertise), but as a software developer I?d certainly *prefer* to see font features used (rather than, say, assigning a new code point or using variation selectors) where the primary difference is in the rendering rather than the meaning.

Kind regards,

Alastair.

--
http://alastairs-place.net


From irgendeinbenutzername at gmail.com  Mon Mar 27 09:44:59 2017
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Mon, 27 Mar 2017 16:44:59 +0200
Subject: Encoding of old compatibility characters
Message-ID: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>

I?ve recently developed an interest in old legacy text encodings and
noticed that there are various characters in several sets that don?t have a
Unicode equivalent. I had already started research into these encodings to
eventually prepare a proposal until I realised I should probably ask on the
mailing list first whether it is likely the UTC will be interested in those
characters before I waste my time on a project that won?t achieve anything
in the end.

The character sets in question are ATASCII, PETSCII, the ZX80 set, the
Atari ST set, and the TI calculator sets. So far I?ve only analyzed the
ZX80 set in great detail, revealing 32 characters not in the UCS. Most
characters are pseudo-graphics, simple pictographs or inverted variants of
other characters.

Now, one of Unicode?s declared goals is to enable round-trip compatibility
with legacy encodings. We?ve accumulated a lot of weird stuff over the
years in the pursuit of this goal. So it would be natural to assume that
the unencoded characters from the mentioned sets would also be eligible for
inclusion in the UCS. On the other hand, those encodings are for the most
part older than Unicode and so far there seems to have been little interest
in them from the UTC or WG2, or any of their contributors. Something tells
me that if these character sets were important enough to consider for
inclusion, they would have been encoded a long time ago along with all the
other stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc.

Obviously the character sets in question don?t receive much use nowadays
(and some weren?t even that relevant in their time, either), which leads to
me wonder whether further putting work into this proposal would be worth it.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/9a0328f8/attachment.html>

From everson at evertype.com  Mon Mar 27 09:51:07 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 15:51:07 +0100
Subject: Encoding of old compatibility characters
In-Reply-To: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
Message-ID: <9DAB9816-CC38-4FA0-82A1-B1BF2BFFECDD@evertype.com>

On 27 Mar 2017, at 15:44, Charlotte Buff <irgendeinbenutzername at gmail.com> wrote:
> 
> I?ve recently developed an interest in old legacy text encodings and noticed that there are various characters in several sets that don?t have a Unicode equivalent. I had already started research into these encodings to eventually prepare a proposal until I realised I should probably ask on the mailing list first whether it is likely the UTC will be interested in those characters before I waste my time on a project that won?t achieve anything in the end.

It?s hard to say without knowing what the characters are. 

Michael Everson

From irgendeinbenutzername at gmail.com  Mon Mar 27 10:48:16 2017
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Mon, 27 Mar 2017 17:48:16 +0200
Subject: Encoding of old compatibility characters
Message-ID: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>

> It?s hard to say without knowing what the characters are.

For the ZX80, the missing characters include five block elements (top and
bottom halfs of MEDIUM SHADE, as well as their inverse counterparts), and
inverse/negative squared variants of European digits and the following
symbols: " ? $ : ? ( ) - + * / = < > ; , .
Negative squared digits may be unifiable with negative circled digits.

ATASCII includes inverse variants of box drawing characters. I have to
check whether some other pictographs are unifiable with existing characters.

PETSCII includes some box drawings and vertical scan lines that are
probably not unifiable.

Atari ST includes two simple pictographs that were used as graphical UI
elements. They look like a negative, low diagonal stroke and a negative
diamond respectively. It also has six characters that together form logos
which I wasn?t going to propose.

TI calculators include a single character for a superscript minus 1. I
don?t have a lot of information available about this set at the moment.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/63a09851/attachment.html>

From jenkins at apple.com  Mon Mar 27 10:56:16 2017
From: jenkins at apple.com (John H. Jenkins)
Date: Mon, 27 Mar 2017 09:56:16 -0600
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
Message-ID: <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>


> On Mar 27, 2017, at 2:04 AM, James Kass <jameskasskrv at gmail.com> wrote:
> 
>> 
>> If we have any historic metal types, are there
>> examples where a font contains both ligature
>> variants?
> 
> Apparently not.
> 
> John H. Jenkins mentioned early in this thread that these ligatures
> weren't used in printed materials and were not part of the official
> Deseret set.  They were only used in manuscript.
> 

This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode. 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/02f71fb4/attachment.html>

From everson at evertype.com  Mon Mar 27 11:03:36 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 17:03:36 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
Message-ID: <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com>

On 27 Mar 2017, at 16:56, John H. Jenkins <jenkins at apple.com> wrote:

>> John H. Jenkins mentioned early in this thread that these ligatures weren't used in printed materials and were not part of the official Deseret set.  They were only used in manuscript.
> 
> This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode.

The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI <????>. Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont.

Michael Everson


From jenkins at apple.com  Mon Mar 27 11:07:25 2017
From: jenkins at apple.com (John H. Jenkins)
Date: Mon, 27 Mar 2017 10:07:25 -0600
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
Message-ID: <D94673B0-155F-49BF-9E29-6FF5E43BFB21@apple.com>


> On Mar 27, 2017, at 9:56 AM, John H. Jenkins <jenkins at apple.com> wrote:
> 
> 
>> On Mar 27, 2017, at 2:04 AM, James Kass <jameskasskrv at gmail.com <mailto:jameskasskrv at gmail.com>> wrote:
>> 
>>> 
>>> If we have any historic metal types, are there
>>> examples where a font contains both ligature
>>> variants?
>> 
>> Apparently not.
>> 
>> John H. Jenkins mentioned early in this thread that these ligatures
>> weren't used in printed materials and were not part of the official
>> Deseret set.  They were only used in manuscript.
>> 
> 
> This is correct. Neither of the nineteenth century metal types included the letters in question. Nor were they included in any electronic fonts that I'm aware of before they were included in Unicode. 
> 

This should teach me to double-check before posting. Apparently, the earlier typeface *did* include all forty letters; it just didn't use these two. I don't know what glyphs were used.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/b8794c6b/attachment-0001.html>

From everson at evertype.com  Mon Mar 27 11:20:19 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 17:20:19 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <D94673B0-155F-49BF-9E29-6FF5E43BFB21@apple.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
 <D94673B0-155F-49BF-9E29-6FF5E43BFB21@apple.com>
Message-ID: <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com>

On 27 Mar 2017, at 17:07, John H. Jenkins <jenkins at apple.com> wrote:

> This should teach me to double-check before posting.

The research is a lot of fun. Can?t wait till I get Ken?s book next week.

> Apparently, the earlier typeface *did* include all forty letters; it just didn't use these two. I don't know what glyphs were used.

What I understood is that typefaces included the letters but there?s no *chart* that contains both 1859 letters. 

Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??<????>, ?few? as ??<????>, ?truefully? [sic] as ????<????>????????, and ?you? as ??<????>. 

Fascinating stuff. 

Michael Everson

From markus.icu at gmail.com  Mon Mar 27 11:49:19 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 27 Mar 2017 09:49:19 -0700
Subject: Encoding of old compatibility characters
In-Reply-To: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
References: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
Message-ID: <CAN49p6pkzA4K3PowM2CppwwEUw4Q44aPJ3=PKFkMxHMjPXoZiQ@mail.gmail.com>

I think the interest has been low because very few documents survive in
these encodings, and even fewer documents using not-already-encoded symbols.

In my opinion, this is a good use of the Private Use Area among a very
small group of people.
See also https://en.wikipedia.org/wiki/ConScript_Unicode_Registry

Best regards,
markus
?
PS: I had a ZX 81, then a Commodore 64, then an Atari ST, and at school
used a Commodore PET...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/0f3d6181/attachment.html>

From everson at evertype.com  Mon Mar 27 11:49:48 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 17:49:48 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>
 <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>
Message-ID: <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com>

On 27 Mar 2017, at 15:04, Alastair Houghton <alastair at alastairs-place.net> wrote:

> 1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now.  If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text).

Nothing?s easier than representing encoded characters. :-) 

> 2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future.  There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed.

Sorry, but typographic control of that sort is grand for typesetting, where you can select ranges of text and language-tag it (assuming your program accepts and supports all the language tags you might need (which they don?t)) and you can select fonts which have all the trickery baked into them (hardly any do) and then? can you use this in file names? In your plain-text databases? In your text messages?

> I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems.

They exist. And can be useful for some things. I think that historic origin of the Deseret diphthong letters and the importance these options have for the study of Deseret orthographic choices throughout the early period of its use.

> All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here.  That?s very much not my field.

Michael Everson

From gwalla at gmail.com  Mon Mar 27 12:08:33 2017
From: gwalla at gmail.com (Garth Wallace)
Date: Mon, 27 Mar 2017 17:08:33 +0000
Subject: Encoding of old compatibility characters
In-Reply-To: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
References: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
Message-ID: <CA+p4_H3suC4Kb9v_p7HwbeQYUA==m7WYhqfDX0=-iCscz7qBxQ@mail.gmail.com>

Apple IIs also had inverse-video letters, and some had "MouseText"
pseudographics used to simulate a Mac-like GUI in text mode.

I know that a couple of fonts from Kreative put these in the PUA and
Nishiki-Teki follows their lead.

On Mon, Mar 27, 2017 at 9:25 AM Charlotte Buff <
irgendeinbenutzername at gmail.com> wrote:

> > It?s hard to say without knowing what the characters are.
>
> For the ZX80, the missing characters include five block elements (top and
> bottom halfs of MEDIUM SHADE, as well as their inverse counterparts), and
> inverse/negative squared variants of European digits and the following
> symbols: " ? $ : ? ( ) - + * / = < > ; , .
> Negative squared digits may be unifiable with negative circled digits.
>
> ATASCII includes inverse variants of box drawing characters. I have to
> check whether some other pictographs are unifiable with existing characters.
>
> PETSCII includes some box drawings and vertical scan lines that are
> probably not unifiable.
>
> Atari ST includes two simple pictographs that were used as graphical UI
> elements. They look like a negative, low diagonal stroke and a negative
> diamond respectively. It also has six characters that together form logos
> which I wasn?t going to propose.
>
> TI calculators include a single character for a superscript minus 1. I
> don?t have a lot of information available about this set at the moment.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/6f628a3c/attachment.html>

From everson at evertype.com  Mon Mar 27 12:16:15 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 18:16:15 +0100
Subject: Encoding of old compatibility characters
In-Reply-To: <CA+p4_H3suC4Kb9v_p7HwbeQYUA==m7WYhqfDX0=-iCscz7qBxQ@mail.gmail.com>
References: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
 <CA+p4_H3suC4Kb9v_p7HwbeQYUA==m7WYhqfDX0=-iCscz7qBxQ@mail.gmail.com>
Message-ID: <9AA9A227-EE3D-4B06-B3F4-F2CB606351E7@evertype.com>

On 27 Mar 2017, at 18:08, Garth Wallace <gwalla at gmail.com> wrote:
> 
> Apple IIs also had inverse-video letters, and some had "MouseText" pseudographics used to simulate a Mac-like GUI in text mode.
> 
> I know that a couple of fonts from Kreative put these in the PUA and Nishiki-Teki follows their lead.

I think it?s better to be inclusive rather than exclusive. PUA isn?t stable, and marginal as this stuff may be, we stuff encoded that is far more marginal? nothing more frustrating than expecting something and finding it missing. 

Michael Everson

From kenwhistler at att.net  Mon Mar 27 12:18:03 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Mon, 27 Mar 2017 10:18:03 -0700
Subject: Encoding of old compatibility characters
In-Reply-To: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
Message-ID: <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net>


On 3/27/2017 7:44 AM, Charlotte Buff wrote:
> Now, one of Unicode?s declared goals is to enable round-trip 
> compatibility with legacy encodings. We?ve accumulated a lot of weird 
> stuff over the years in the pursuit of this goal. So it would be 
> natural to assume that the unencoded characters from the mentioned 
> sets [ATASCII, PETSCII, the ZX80 set, the Atari ST set, and the TI 
> calculator sets] would also be eligible for inclusion in the UCS.

Actually, it wouldn't be.

The original goal was to ensure round-trip compatibility with 
*important* legacy character encodings, *for which there was a need to 
convert legacy data, and/or an ongoing need to representation of text 
for interchange*.

 From Unicode 1.0: "The Unicode standard includes the character content 
of all major International Standards approved and published before 
December 31, 1990... [long list ensues] ... and from various industry 
standards in common use (such as code pages and character sets from 
Adobe, Apple, IBM, Lotus, Microsoft, WordPerfect, Xerox and others)."

Even as long ago as 1990, artifacts such as the Atari ST set were 
considered obsolete antiquities, and did not rise to the level of the 
kind of character listings that we considered when pulling together the 
original repertoire.

And there are several observations to be made about the "weird stuff" we 
have accumulated over the years in the pursuit of compatibility. A lot 
of stuff that was made up out of whole cloth, rather than being 
justified by existing, implemented character sets used in information 
interchange at the time, came from the 1991/1992 merger process between 
the Unicode Standard and the ISO/IEC 10646 drafts. That's how Unicode 
acquired blocks full of Arabic ligatures, for example.

Other, subsequent additions of small (or even largish) sets of oddball 
"characters" that don't fit the prototypical sets of characters for 
scripts and/or well-behaved punctuation and symbols, typically have come 
in with argued cases for the continued need in current text interchange, 
for complete coverage. For example, that is how we ended up filling out 
Zapf dingbats with some glyph pieces that had been omitted in the 
initial repertoire for that block. More recently, of course, the 
continued importance of Wingdings and Webdings font encodings on the 
Windows platform led the UTC to filling out the set of graphical 
dingbats to cover those sets. And of course, we first started down the 
emoji track because of the need to interchange text originating from 
widely deployed Japanese carrier sets implemented as extensions to 
Shift-JIS.

I don't think the early calculator character sets, or sets for the Atari 
ST and similar early consumer computer electronics fit the bill, 
precisely because there isn't a real text data interchange case to be 
made for character encoding. Many of the elements you have mentioned, 
for example, like the inverse/negative squared versions of letters and 
symbols, are simply idiosyncratic aspects of the UI for the devices, in 
an era when font generators were hard coded and very primitive indeed.

Documenting these early uses, and pointing out parts of the UI and 
character usage that aren't part of the character repertoire in the 
Unicode Standard seems an interesting pursuit to me. But absent a true 
textual data interchange issue for these long-gone, obsolete devices, I 
don't really see a case to be made for spending time in the UTC defining 
a bunch of compatibility characters to encode for them.

--Ken


From everson at evertype.com  Mon Mar 27 12:18:34 2017
From: everson at evertype.com (Michael Everson)
Date: Mon, 27 Mar 2017 18:18:34 +0100
Subject: Encoding of old compatibility characters
In-Reply-To: <CAN49p6pkzA4K3PowM2CppwwEUw4Q44aPJ3=PKFkMxHMjPXoZiQ@mail.gmail.com>
References: <CAKLR3AqfTpjLN+==4JyWOZqP0sX6BwA_B5svsM+_ynwB9TyQWw@mail.gmail.com>
 <CAN49p6pkzA4K3PowM2CppwwEUw4Q44aPJ3=PKFkMxHMjPXoZiQ@mail.gmail.com>
Message-ID: <4A85FDE2-C9BF-4C72-B7C5-05FC9A477DC8@evertype.com>

On 27 Mar 2017, at 17:49, Markus Scherer <markus.icu at gmail.com> wrote:
> 
> I think the interest has been low because very few documents survive in these encodings, and even fewer documents using not-already-encoded symbols.

That doesn?t mean that the few people who may need the characters now or in the centuries to come shouldn?t have them. If we?ve encoded some characters like these for compatibility, it?s only fair to be thorough. 

> In my opinion, this is a good use of the Private Use Area among a very small group of people.

I?d say not, since they?d be using some encoded characters and having to augment it with some PUA characters.

> See also https://en.wikipedia.org/wiki/ConScript_Unicode_Registry

That?s not for this sort of thing at all at all. The UCS is for this sort of thing.

Michael Everson

> ?PS: I had a ZX 81, then a Commodore 64, then an Atari ST, and at school used a Commodore PET...

Lucky man. :-)

From kojiishi at gmail.com  Mon Mar 27 12:25:36 2017
From: kojiishi at gmail.com (Koji Ishii)
Date: Tue, 28 Mar 2017 02:25:36 +0900
Subject: different version of common/annotations/ja.xml
In-Reply-To: <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
 <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
 <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>
Message-ID: <CAN9ydbUe-16bdyjLhuw_nov1ZS3Udtaca9ydyWdCN_Hk-_Wqyg@mail.gmail.com>

I think he meant Kanji/Han ideographic by "committed string".

2017-03-27 19:04 GMT+09:00 Takao Fujiwara <tfujiwar at redhat.com>:

> On 03/27/17 18:48, Mark Davis ??-san wrote:
>
>> By "committed strings", you mean the hiragana phonetic reading?
>>
>
> Hiragana is used to the raw text of the phonetic reading by the Japanese
> input method before the conversion.
> After users select one of the converted strings, the converted strings are
> committed on the text.
> I mean the major conversion of ja.xml is useful instead of remembering the
> raw text as the converted result in the input method.
>
> Fujiwara
>
>
>> Mark
>> //////
>>
>> On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <tfujiwar at redhat.com
>> <mailto:tfujiwar at redhat.com>> wrote:
>>
>>     Hi,
>>
>>     Do you have any chances to create a different version of ja.xml of
>> the Japanese emoji annotation?
>>     http://unicode.org/cldr/trac/browser/tags/latest/common/anno
>> tations/ja.xml
>>     <http://unicode.org/cldr/trac/browser/tags/latest/common/ann
>> otations/ja.xml>
>>
>>     That file includes Hiragana only but I'd need another file which has
>> the committed strings, likes ja_convert.xml.
>>     E.g.
>>                    <annotation cp="?">? | ?? | ????</annotation>
>>
>>     instead of
>>
>>                    <annotation cp="?">??? | ???? | ????</annotation>
>>
>>     I think the committed version is useful without input method and it
>> follows other languages.
>>
>>     Thanks,
>>     Fujiwara
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/2e17987e/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 13:44:06 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 27 Mar 2017 20:44:06 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <7bc3c6ed-05bb-f419-7bb1-7e7ed02780b8@att.net>
Message-ID: <CAGa7JC22D9SNaphDZGghqYjYGH0ZdyoV44URZ8bM_UQ0gCPb+A@mail.gmail.com>

TI caculators are not antique tools, and when I see how most calculators
for Android or Windows 10 are now, they are not as usable as the scientific
calculators we had in the past.

I know at least one excellent calculator that works with Android and
Windows and finally has the real look and feel of a true calculator, and
that display correct labels and excellent formulas (with the conventional
2D layout), my favorite is now "HyperCalc" (it has a free version and a
paid version). The Android version is a bit more advanced. The paid version
has only a few additional features not so needed (such as themes). The
interface is clear, and there are several input modes for expressions. When
you look at the default Calculator of Windows 10 it has never been worse
than what it is now (it was much better in Windows 7 or before, even if it
had many limitations).

Also entering expressions in Excel is really antique, and many functions
have stupid limitations (in addition, spreadsheets are not even portable
across versions of Office or don't render the same, and sometimes
unexpectedly produce different results).

But this is not at all a problem of character encoding: we don't need
Unicode at all to create a convenient UI in such applications. Even with a
web_based interface, you can do a lot with HTML canvas and SVG and have a
scalable UI without having to use dirty text tricks or using PUA fonts.


2017-03-27 19:18 GMT+02:00 Ken Whistler <kenwhistler at att.net>:

>
> On 3/27/2017 7:44 AM, Charlotte Buff wrote:
>
>> Now, one of Unicode?s declared goals is to enable round-trip
>> compatibility with legacy encodings. We?ve accumulated a lot of weird stuff
>> over the years in the pursuit of this goal. So it would be natural to
>> assume that the unencoded characters from the mentioned sets [ATASCII,
>> PETSCII, the ZX80 set, the Atari ST set, and the TI calculator sets] would
>> also be eligible for inclusion in the UCS.
>>
>
> Actually, it wouldn't be.
>
> The original goal was to ensure round-trip compatibility with *important*
> legacy character encodings, *for which there was a need to convert legacy
> data, and/or an ongoing need to representation of text for interchange*.
>
> From Unicode 1.0: "The Unicode standard includes the character content of
> all major International Standards approved and published before December
> 31, 1990... [long list ensues] ... and from various industry standards in
> common use (such as code pages and character sets from Adobe, Apple, IBM,
> Lotus, Microsoft, WordPerfect, Xerox and others)."
>
> Even as long ago as 1990, artifacts such as the Atari ST set were
> considered obsolete antiquities, and did not rise to the level of the kind
> of character listings that we considered when pulling together the original
> repertoire.
>
> And there are several observations to be made about the "weird stuff" we
> have accumulated over the years in the pursuit of compatibility. A lot of
> stuff that was made up out of whole cloth, rather than being justified by
> existing, implemented character sets used in information interchange at the
> time, came from the 1991/1992 merger process between the Unicode Standard
> and the ISO/IEC 10646 drafts. That's how Unicode acquired blocks full of
> Arabic ligatures, for example.
>
> Other, subsequent additions of small (or even largish) sets of oddball
> "characters" that don't fit the prototypical sets of characters for scripts
> and/or well-behaved punctuation and symbols, typically have come in with
> argued cases for the continued need in current text interchange, for
> complete coverage. For example, that is how we ended up filling out Zapf
> dingbats with some glyph pieces that had been omitted in the initial
> repertoire for that block. More recently, of course, the continued
> importance of Wingdings and Webdings font encodings on the Windows platform
> led the UTC to filling out the set of graphical dingbats to cover those
> sets. And of course, we first started down the emoji track because of the
> need to interchange text originating from widely deployed Japanese carrier
> sets implemented as extensions to Shift-JIS.
>
> I don't think the early calculator character sets, or sets for the Atari
> ST and similar early consumer computer electronics fit the bill, precisely
> because there isn't a real text data interchange case to be made for
> character encoding. Many of the elements you have mentioned, for example,
> like the inverse/negative squared versions of letters and symbols, are
> simply idiosyncratic aspects of the UI for the devices, in an era when font
> generators were hard coded and very primitive indeed.
>
> Documenting these early uses, and pointing out parts of the UI and
> character usage that aren't part of the character repertoire in the Unicode
> Standard seems an interesting pursuit to me. But absent a true textual data
> interchange issue for these long-gone, obsolete devices, I don't really see
> a case to be made for spending time in the UTC defining a bunch of
> compatibility characters to encode for them.
>
> --Ken
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/8879878c/attachment.html>

From doug at ewellic.org  Mon Mar 27 14:17:20 2017
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 27 Mar 2017 12:17:20 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>

announcements at Unicode dot org wrote:

> ? and new regional flags for England, Scotland, and Wales.

It's not clear from this text, nor from the table in Section C.1.1 of
the draft, what the status is of flag emoji tag sequences other than the
three above.

I read the relevant section a couple of times and could not figure out
how a "standard sequence" differs from a non-standard one, or how
ordinary users are supposed to know the difference. The term "standard
sequence" appears nowhere in the draft except as a table header.

Vendors always have the option of supporting or not supporting a glyph
for any code point or sequence -- note 4 in Section C.1 and the second
sentence in C.1.1 both reinforce this long-standing principle -- so
there must be something more here.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org ??


From verdy_p at wanadoo.fr  Mon Mar 27 15:30:58 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 27 Mar 2017 22:30:58 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>

2017-03-27 21:17 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> announcements at Unicode dot org wrote:
>
> > ? and new regional flags for England, Scotland, and Wales.
>
> It's not clear from this text, nor from the table in Section C.1.1 of
> the draft, what the status is of flag emoji tag sequences other than the
> three above.
>

Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the
codes used do not specify clearly about which region code standard they are
refering to. We just see that it's an ISO3166-1 country/territory code
followed directly (without separator) by sequences of letter/digits, all of
them converted to RIS and surrounded by a the same initial emeoji code and
the DEL from RIS.

The problem is how to choose the codes for the letter/digits in the second
part, if they ever come from ISO3166-2 after dropping the hypen separator
(this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB) or
somewhere else.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/21eb6f93/attachment.html>

From kenwhistler at att.net  Mon Mar 27 15:34:09 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Mon, 27 Mar 2017 13:34:09 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
Message-ID: <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>


On 3/27/2017 12:17 PM, Doug Ewell wrote:
> announcements at Unicode dot org wrote:
>
>> ? and new regional flags for England, Scotland, and Wales.
> It's not clear from this text, nor from the table in Section C.1.1 of
> the draft, what the status is of flag emoji tag sequences other than the
> three above.
>
> I read the relevant section a couple of times and could not figure out
> how a "standard sequence" differs from a non-standard one, or how
> ordinary users are supposed to know the difference. The term "standard
> sequence" appears nowhere in the draft except as a table header.

The terminology is still a bit in flux, which is why the text of UTS #51 
is still under review, before being finalized at the UTC meeting in May.

But the data for Emoji 5.0 is final, and there are precisely 3 "emoji 
tag sequences" in the relevant data file:

http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt

As for how "users" are supposed to know the difference. Well, they 
don't. What matters is that the data file that the "implementers" will 
use has these 3 emoji tag sequences in it, so that is quite likely what 
everybody will see added to their phones. The "users" will just see 3 
more flags. And if they want a flag of California (or whatever), then 
they need to badger the platform vendors, who will then come back to the 
Emoji SC, saying, "Help! We need to add a flag of California, or people 
won't buy our phones!" And if a flag of California (or Pomerania or ...) 
then gets added to the list of emoji tag sequences in a future version 
of the data, there is a good chance that the "users" will then see the 
difference, because that flag will appear on their phones eventually.

Anybody could *attempt* to convey a flag of Pomerania (a rather handsome 
black gryphon on a yellow background, btw) with an emoji tag sequence 
right now, I suppose. Good luck on any input support or actual 
interoperability or availability in any font on any standard platform, 
however. You'd just get fallback display. If conveying flags of 
Pomerania is in your near term future, I'd advise sticking to images. ;-)

--Ken

>
> Vendors always have the option of supporting or not supporting a glyph
> for any code point or sequence -- note 4 in Section C.1 and the second
> sentence in C.1.1 both reinforce this long-standing principle -- so
> there must be something more here.
>


From verdy_p at wanadoo.fr  Mon Mar 27 15:39:10 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 27 Mar 2017 22:39:10 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
Message-ID: <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>

Note also that ISO3166-2 is far from being stable, and this could
contradict Unicode encoding stability: it would then be required to ensure
this stability by only allowing sequences that are effectively registered
in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt
(independantly of the registration ins ISO3166-2), and nothing is said if
ever ISO3166-2 obsoletes some codes and then some years later decide to
reassign these codes to new entities: it should not be possible to do the
same thing in Emoji sequences, and specific assignments will need to be
made in the Unicode database.

Note also that most rencetly created administrative divisions do not really
adopt any flag, but if flags are used they may be reusing flags from older
historic entities... or they could adopt only a logo (with legal
protection, not really suitable from encoding in the UCS as it won't be
possible to define any "representative glyph" without asking for permission
to the relevant authorities for displaying some design, possibly simplified)

We still lack an encoding standard for vexillologists. And for now only
"Flags of the World" proposes some encoding (not based strictly and only on
ISO3166). I think that the UTC should try contacting authors of Flags of
the World and seek for advice there: we are speaking here about regional
flags (we can exclude some graphical variants such as civil vs. navy flags
vs honorific flags)


2017-03-27 22:30 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

>
>
> 2017-03-27 21:17 GMT+02:00 Doug Ewell <doug at ewellic.org>:
>
>> announcements at Unicode dot org wrote:
>>
>> > ? and new regional flags for England, Scotland, and Wales.
>>
>> It's not clear from this text, nor from the table in Section C.1.1 of
>> the draft, what the status is of flag emoji tag sequences other than the
>> three above.
>>
>
> Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the
> codes used do not specify clearly about which region code standard they are
> refering to. We just see that it's an ISO3166-1 country/territory code
> followed directly (without separator) by sequences of letter/digits, all of
> them converted to RIS and surrounded by a the same initial emeoji code and
> the DEL from RIS.
>
> The problem is how to choose the codes for the letter/digits in the second
> part, if they ever come from ISO3166-2 after dropping the hypen separator
> (this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB)
> or somewhere else.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/76c1ec3c/attachment.html>

From kenwhistler at att.net  Mon Mar 27 16:19:38 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Mon, 27 Mar 2017 14:19:38 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
Message-ID: <33d67dda-3758-34ef-f5a3-a5ff4f669843@att.net>


On 3/27/2017 1:39 PM, Philippe Verdy wrote:
> Note also that ISO3166-2 is far from being stable, and this could 
> contradict Unicode encoding stability: it would then be required to 
> ensure this stability by only allowing sequences that are effectively 
> registered in 
> http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt 
> (independantly of the registration ins ISO3166-2), and nothing is said 
> if ever ISO3166-2 obsoletes some codes and then some years later 
> decide to reassign these codes to new entities: it should not be 
> possible to do the same thing in Emoji sequences, and specific 
> assignments will need to be made in the Unicode database.
>

These emoji tag sequences don't derive their stability from ISO 3166-2.

The emoji tag sequences depend on: CLDR Unicode Locale Identifiers, and 
more specifically, for these subregions, on the unicode_subdivision_id:

http://unicode.org/reports/tr35/index.html#unicode_subdivision_id

And the data for that is here:

http://unicode.org/repos/cldr/tags/latest/common/validity/subdivision.xml

The stability for such tags is baked into the CLDR repository, as I 
understand it.

By the way, if anybody is looking, Pomerania is there: "plpm" among the 
4925 other valid unicode_subdivision_id values. So:

Flag of Pomerania = 1F3F4 E0070 E006C E0070 E006D E007F

But alas, that is not a *valid*  emoji tag sequence (yet), so no soup 
for you!

--Ken


From richard.wordingham at ntlworld.com  Mon Mar 27 16:32:25 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Mon, 27 Mar 2017 22:32:25 +0100
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>
Message-ID: <20170327223225.73cb528e@JRWUBU2>

On Mon, 27 Mar 2017 13:34:09 -0700
Ken Whistler <kenwhistler at att.net> wrote:

> And if a flag of
> California (or Pomerania or ...) then gets added to the list of emoji
> tag sequences in a future version of the data, there is a good chance
> that the "users" will then see the difference, because that flag will
> appear on their phones eventually.

Indeed, why isn't the flag of Texas there already so as to terminate
the abuse of <U+1F1E8, U+1F1F1>.  Technically, at least, it has the
justification of being a formerly independent country, though I don't
know that they have any national teams.

Is anyone working on the issue of flags for the whole of Ireland?
Different sports have their own 'national' flags.

Pomerania will be a bit tricky, as it isn't any recent administrative
division.

Richard.

From doug at ewellic.org  Mon Mar 27 16:39:53 2017
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 27 Mar 2017 14:39:53 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170327143953.665a7a7059d7ee80bb4d670165c8327d.9232a4d4ec.wbe@email03.godaddy.com>

Ken Whistler wrote:

> As for how "users" are supposed to know the difference. Well, they
> don't. What matters is that the data file that the "implementers" will
> use has these 3 emoji tag sequences in it, so that is quite likely
> what everybody will see added to their phones. The "users" will just
> see 3 more flags.

So, no provision for a UI like the one I'm building, to let users select
a region or subdivision and generate the corresponding sequence? Mmh.
Well, anyway.

> And if they want a flag of California (or whatever), then they need to
> badger the platform vendors, who will then come back to the Emoji SC,
> saying, "Help! We need to add a flag of California, or people won't
> buy our phones!"

The way nobody will buy their phones unless they support all 5 skin
tones for all 3 flavors of "vampire" or "elf" or "fairy" or "person in
lotus position"? Those are also generative mechanisms, but not limited
to just a couple of combinations deemed worthy.

If flags have to be added one by one, a lot of them (including the
really useful ones, like California and Bavaria) will probably never
happen.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From frederic.grosshans at gmail.com  Mon Mar 27 16:46:34 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Mon, 27 Mar 2017 23:46:34 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
Message-ID: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>

An example of a legacy character successfully  encoded recently is ? 
U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2.
It came from the Soviet standard GOST 10859-64 and the German standard 
ALCOR. And was proposed by Leo Broukhis in this proposal 
http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a 
discussion on this mailing list here 
http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where 
Ken Whistler was already sceptical about the usefulness of this encoding.


Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit :
> I?ve recently developed an interest in old legacy text encodings and 
> noticed that there are various characters in several sets that don?t 
> have a Unicode equivalent. I had already started research into these 
> encodings to eventually prepare a proposal until I realised I should 
> probably ask on the mailing list first whether it is likely the UTC 
> will be interested in those characters before I waste my time on a 
> project that won?t achieve anything in the end.
>
> The character sets in question are ATASCII, PETSCII, the ZX80 set, the 
> Atari ST set, and the TI calculator sets. So far I?ve only analyzed 
> the ZX80 set in great detail, revealing 32 characters not in the UCS. 
> Most characters are pseudo-graphics, simple pictographs or inverted 
> variants of other characters.
>
> Now, one of Unicode?s declared goals is to enable round-trip 
> compatibility with legacy encodings. We?ve accumulated a lot of weird 
> stuff over the years in the pursuit of this goal. So it would be 
> natural to assume that the unencoded characters from the mentioned 
> sets would also be eligible for inclusion in the UCS. On the other 
> hand, those encodings are for the most part older than Unicode and so 
> far there seems to have been little interest in them from the UTC or 
> WG2, or any of their contributors. Something tells me that if these 
> character sets were important enough to consider for inclusion, they 
> would have been encoded a long time ago along with all the other stuff 
> in Block Elements, Box Drawings, Miscellaneous Symbols etc.
>
> Obviously the character sets in question don?t receive much use 
> nowadays (and some weren?t even that relevant in their time, either), 
> which leads to me wonder whether further putting work into this 
> proposal would be worth it.


From doug at ewellic.org  Mon Mar 27 16:50:54 2017
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 27 Mar 2017 14:50:54 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> We still lack an encoding standard for vexillologists. And for now
> only "Flags of the World" proposes some encoding (not based strictly
> and only on ISO3166). I think that the UTC should try contacting
> authors of Flags of the World and seek for advice there: we are
> speaking here about regional flags (we can exclude some graphical
> variants such as civil vs. navy flags vs honorific flags)

As Philippe knows, because he and I had this discussion in 2012 and
again in 2013:

- I have already contacted FOTW.
- They have no such encoding, except 3166-1 for countries and the 2-by-3
  information code, and they have never proposed one.
- I think such a standard would be a great idea, but
- I don't think this is any of UTC's business and I'll bet they agree.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From verdy_p at wanadoo.fr  Mon Mar 27 16:53:13 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 27 Mar 2017 23:53:13 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170327223225.73cb528e@JRWUBU2>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>
 <20170327223225.73cb528e@JRWUBU2>
Message-ID: <CAGa7JC0OyWumh-MKN9X31qqWcZfVJ6C-BwqXNfTWVKYJLXyD9g@mail.gmail.com>

And the new region of Normandie still has no formal code, but it reuses a
flag that was used by one of the two former regions.
Technically I don't see that as a problem except that people may want to
display that flag using the code for the former region and semantically
this is different (and also different from the former Duchy before it was
partly annexed by France and left the Channel Islands to the new English
Crown in the Middle Age.
If we are concerned only by encoding modern entities, anyway if these
sequences are encoded, there will be nobody to restrict their reuse for
past entities (jsut kike Unicode cannot rule against the use of a capital
Greek Alpha replacing a Capital Latin A, or the fancy use of Latin for
"ASCII art", as Unicode does not encode orthographies or languages).
Once a sequence is registered, even if it is intended to represent a modern
entity, anyone will be using them as they want. This gives also a hint
about why encoding stability will be important. But as we know, the
regional or national entities are changing their flags and sometimes
reusing former flags from other entities. Sooner or later, there will be
confusion.
I would suggest that if renderers have the capability of rendering colorful
flags and provide an UI, at least they should be also rendering some hints,
notably the underlying code or a name if available, using for example
mousehover events to explain these flags and their intended usage: if a
former flag is reused by another entity, that new entity should have its
own encoding and the former flags should not be affected (its displayed
hint should still indicate a reference to their former meaning).

2017-03-27 23:32 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Mon, 27 Mar 2017 13:34:09 -0700
> Ken Whistler <kenwhistler at att.net> wrote:
>
> > And if a flag of
> > California (or Pomerania or ...) then gets added to the list of emoji
> > tag sequences in a future version of the data, there is a good chance
> > that the "users" will then see the difference, because that flag will
> > appear on their phones eventually.
>
> Indeed, why isn't the flag of Texas there already so as to terminate
> the abuse of <U+1F1E8, U+1F1F1>.  Technically, at least, it has the
> justification of being a formerly independent country, though I don't
> know that they have any national teams.
>
> Is anyone working on the issue of flags for the whole of Ireland?
> Different sports have their own 'national' flags.
>
> Pomerania will be a bit tricky, as it isn't any recent administrative
> division.
>
> Richard.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/5e60e961/attachment.html>

From pedberg at apple.com  Mon Mar 27 16:54:04 2017
From: pedberg at apple.com (Peter Edberg)
Date: Mon, 27 Mar 2017 14:54:04 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <8DB7A85C-3892-4208-A609-86709564D4D8@mac.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <8DB7A85C-3892-4208-A609-86709564D4D8@mac.com>
Message-ID: <5925AB8F-03F4-4FC1-9571-827CB5555AF2@apple.com>

(this time from the correct account)

Philippe and others,
http://www.unicode.org/reports/tr51/tr51-11.html#valid-emoji-tag-sequences <http://www.unicode.org/reports/tr51/tr51-11.html#valid-emoji-tag-sequences> refers to
CLDR data for the list of valid subregion sequences, see
http://unicode.org/reports/tr35/index.html#Validity <http://unicode.org/reports/tr35/index.html#Validity>

CLDR data will maintain stable sequences in the event that ISO 3166-2 data changes.

- Peter E

> On Mar 27, 2017, at 1:39 PM, Philippe Verdy <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>> wrote:
> 
> Note also that ISO3166-2 is far from being stable, and this could contradict Unicode encoding stability: it would then be required to ensure this stability by only allowing sequences that are effectively registered in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt <http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt> (independantly of the registration ins ISO3166-2), and nothing is said if ever ISO3166-2 obsoletes some codes and then some years later decide to reassign these codes to new entities: it should not be possible to do the same thing in Emoji sequences, and specific assignments will need to be made in the Unicode database.
> 
> Note also that most rencetly created administrative divisions do not really adopt any flag, but if flags are used they may be reusing flags from older historic entities... or they could adopt only a logo (with legal protection, not really suitable from encoding in the UCS as it won't be possible to define any "representative glyph" without asking for permission to the relevant authorities for displaying some design, possibly simplified)
> 
> We still lack an encoding standard for vexillologists. And for now only "Flags of the World" proposes some encoding (not based strictly and only on ISO3166). I think that the UTC should try contacting authors of Flags of the World and seek for advice there: we are speaking here about regional flags (we can exclude some graphical variants such as civil vs. navy flags vs honorific flags)
> 
> 
> 2017-03-27 22:30 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr <mailto:verdy_p at wanadoo.fr>>:
> 
> 
> 2017-03-27 21:17 GMT+02:00 Doug Ewell <doug at ewellic.org <mailto:doug at ewellic.org>>:
> announcements at Unicode dot org wrote:
> 
> > ? and new regional flags for England, Scotland, and Wales.
> 
> It's not clear from this text, nor from the table in Section C.1.1 of
> the draft, what the status is of flag emoji tag sequences other than the
> three above.
> 
> Right, we've got them encoded as [GBENG], [GBSCT] and [GBWLS], but the codes used do not specify clearly about which region code standard they are refering to. We just see that it's an ISO3166-1 country/territory code followed directly (without separator) by sequences of letter/digits, all of them converted to RIS and surrounded by a the same initial emeoji code and the DEL from RIS.
> 
> The problem is how to choose the codes for the letter/digits in the second part, if they ever come from ISO3166-2 after dropping the hypen separator (this is the case here, see https://en.wikipedia.org/wiki/ISO_3166-2:GB <https://en.wikipedia.org/wiki/ISO_3166-2:GB>) or somewhere else.
> 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/32755c33/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 16:54:53 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 27 Mar 2017 23:54:53 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com>
References: <20170327145054.665a7a7059d7ee80bb4d670165c8327d.2b1ba2ec33.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC33yc=pV1pMhQf0FpqHNc==9sVW1HnusyWb5hw=TPG0aA@mail.gmail.com>

So it's up to the UTC to create this encoding: this new relase is a start
for a new vexillology registry (within encoded sequences) which creates a
new standard for them.

2017-03-27 23:50 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Philippe Verdy wrote:
>
> > We still lack an encoding standard for vexillologists. And for now
> > only "Flags of the World" proposes some encoding (not based strictly
> > and only on ISO3166). I think that the UTC should try contacting
> > authors of Flags of the World and seek for advice there: we are
> > speaking here about regional flags (we can exclude some graphical
> > variants such as civil vs. navy flags vs honorific flags)
>
> As Philippe knows, because he and I had this discussion in 2012 and
> again in 2013:
>
> - I have already contacted FOTW.
> - They have no such encoding, except 3166-1 for countries and the 2-by-3
>   information code, and they have never proposed one.
> - I think such a standard would be a great idea, but
> - I don't think this is any of UTC's business and I'll bet they agree.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/f581da34/attachment.html>

From frederic.grosshans at gmail.com  Mon Mar 27 17:05:28 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Tue, 28 Mar 2017 00:05:28 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
Message-ID: <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com>

Another example, about to be encoded, it the GOUP MARK, used on old IBM 
computers (proposal: ML threads: 
http://www.unicode.org/mail-arch/unicode-ml/y2015-m01/0040.html , and 
http://unicode.org/mail-arch/unicode-ml/y2007-m05/0367.html )

Le 27/03/2017 ? 23:46, Fr?d?ric Grosshans a ?crit :
> An example of a legacy character successfully  encoded recently is ? 
> U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2.
> It came from the Soviet standard GOST 10859-64 and the German standard 
> ALCOR. And was proposed by Leo Broukhis in this proposal 
> http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a 
> discussion on this mailing list here 
> http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where 
> Ken Whistler was already sceptical about the usefulness of this encoding.
>
>
> Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit :
>> I?ve recently developed an interest in old legacy text encodings and 
>> noticed that there are various characters in several sets that don?t 
>> have a Unicode equivalent. I had already started research into these 
>> encodings to eventually prepare a proposal until I realised I should 
>> probably ask on the mailing list first whether it is likely the UTC 
>> will be interested in those characters before I waste my time on a 
>> project that won?t achieve anything in the end.
>>
>> The character sets in question are ATASCII, PETSCII, the ZX80 set, 
>> the Atari ST set, and the TI calculator sets. So far I?ve only 
>> analyzed the ZX80 set in great detail, revealing 32 characters not in 
>> the UCS. Most characters are pseudo-graphics, simple pictographs or 
>> inverted variants of other characters.
>>
>> Now, one of Unicode?s declared goals is to enable round-trip 
>> compatibility with legacy encodings. We?ve accumulated a lot of weird 
>> stuff over the years in the pursuit of this goal. So it would be 
>> natural to assume that the unencoded characters from the mentioned 
>> sets would also be eligible for inclusion in the UCS. On the other 
>> hand, those encodings are for the most part older than Unicode and so 
>> far there seems to have been little interest in them from the UTC or 
>> WG2, or any of their contributors. Something tells me that if these 
>> character sets were important enough to consider for inclusion, they 
>> would have been encoded a long time ago along with all the other 
>> stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc.
>>
>> Obviously the character sets in question don?t receive much use 
>> nowadays (and some weren?t even that relevant in their time, either), 
>> which leads to me wonder whether further putting work into this 
>> proposal would be worth it.
>
>


From doug at ewellic.org  Mon Mar 27 17:08:27 2017
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 27 Mar 2017 15:08:27 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170327150827.665a7a7059d7ee80bb4d670165c8327d.2fd8210960.wbe@email03.godaddy.com>

Ken Whistler wrote:

> By the way, if anybody is looking, Pomerania is there: "plpm" among
> the 4925 other valid unicode_subdivision_id values. So:
>
> Flag of Pomerania = 1F3F4 E0070 E006C E0070 E006D E007F
>
> But alas, that is not a *valid* emoji tag sequence (yet), so no soup
> for you!

This is a major letdown, after almost two years following the progress
of flag tag sequences, to find that the arguments that "these three
flags are special because they appear in international sports" have won
the day and the others are demoted to "non-standard." That was never
implied in any of the published UTC documents before.

I've collected well over 800 subdivision flags, and I'm sure there are
hundreds more, each with its own proud constituency. Vendors don't want
to bother adding a glyph for Saskatchewan or Neuqu?n or Yamagata? They
don't have to; they never had to. But now they're essentially being told
not to.

This was the only aspect of emoji I had the slightest interest in. Boo.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Mon Mar 27 17:10:17 2017
From: doug at ewellic.org (Doug Ewell)
Date: Mon, 27 Mar 2017 15:10:17 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170327151017.665a7a7059d7ee80bb4d670165c8327d.fb76d7ed31.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> So it's up to the UTC to create this encoding: this new relase is a
> start for a new vexillology registry (within encoded sequences) which
> creates a new standard for them.

Fine. If you think you can persuade UTC that this is within their scope,
go ahead. Let us know how that works out.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From jr at qsm.co.il  Mon Mar 27 17:43:17 2017
From: jr at qsm.co.il (Jonathan Rosenne)
Date: Mon, 27 Mar 2017 22:43:17 +0000
Subject: Encoding of old compatibility characters
In-Reply-To: <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <2ef83d81-509d-e2f4-7a99-158ca63d48c4@gmail.com>
Message-ID: <HE1PR0701MB183341891CAF39E95C327D9B84330@HE1PR0701MB1833.eurprd07.prod.outlook.com>

GROUP MARK

Best Regards,

Jonathan Rosenne
-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Fr?d?ric Grosshans
Sent: Tuesday, March 28, 2017 1:05 AM
To: unicode
Subject: Re: Encoding of old compatibility characters

Another example, about to be encoded, it the GOUP MARK, used on old IBM computers (proposal: ML threads: 
http://www.unicode.org/mail-arch/unicode-ml/y2015-m01/0040.html , and http://unicode.org/mail-arch/unicode-ml/y2007-m05/0367.html )

Le 27/03/2017 ? 23:46, Fr?d?ric Grosshans a ?crit :
> An example of a legacy character successfully  encoded recently is ?
> U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2.
> It came from the Soviet standard GOST 10859-64 and the German standard 
> ALCOR. And was proposed by Leo Broukhis in this proposal 
> http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a 
> discussion on this mailing list here 
> http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where 
> Ken Whistler was already sceptical about the usefulness of this encoding.
>
>
> Le 27/03/2017 ? 16:44, Charlotte Buff a ?crit :
>> I?ve recently developed an interest in old legacy text encodings and 
>> noticed that there are various characters in several sets that don?t 
>> have a Unicode equivalent. I had already started research into these 
>> encodings to eventually prepare a proposal until I realised I should 
>> probably ask on the mailing list first whether it is likely the UTC 
>> will be interested in those characters before I waste my time on a 
>> project that won?t achieve anything in the end.
>>
>> The character sets in question are ATASCII, PETSCII, the ZX80 set, 
>> the Atari ST set, and the TI calculator sets. So far I?ve only 
>> analyzed the ZX80 set in great detail, revealing 32 characters not in 
>> the UCS. Most characters are pseudo-graphics, simple pictographs or 
>> inverted variants of other characters.
>>
>> Now, one of Unicode?s declared goals is to enable round-trip 
>> compatibility with legacy encodings. We?ve accumulated a lot of weird 
>> stuff over the years in the pursuit of this goal. So it would be 
>> natural to assume that the unencoded characters from the mentioned 
>> sets would also be eligible for inclusion in the UCS. On the other 
>> hand, those encodings are for the most part older than Unicode and so 
>> far there seems to have been little interest in them from the UTC or 
>> WG2, or any of their contributors. Something tells me that if these 
>> character sets were important enough to consider for inclusion, they 
>> would have been encoded a long time ago along with all the other 
>> stuff in Block Elements, Box Drawings, Miscellaneous Symbols etc.
>>
>> Obviously the character sets in question don?t receive much use 
>> nowadays (and some weren?t even that relevant in their time, either), 
>> which leads to me wonder whether further putting work into this 
>> proposal would be worth it.
>
>


From markus.icu at gmail.com  Mon Mar 27 18:33:43 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 27 Mar 2017 16:33:43 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <bd3a71b5-0ad4-8d2f-859d-3d2d40832a9a@att.net>
Message-ID: <CAN49p6q2gB52rCp=D4xbooxQBb1oqYY0=yNCgpCw4msvPK=s7g@mail.gmail.com>

On Mon, Mar 27, 2017 at 1:34 PM, Ken Whistler <kenwhistler at att.net> wrote:

> Anybody could *attempt* to convey a flag of Pomerania (a rather handsome
> black gryphon on a yellow background, btw) with an emoji tag sequence right
> now, I suppose.


I suppose not. Since it's bound to ISO 3166 subdivision codes (possibly
with CLDR additions), it would have to be "demv" for
https://en.wikipedia.org/wiki/Mecklenburg-Vorpommern or codes for adjacent
regions in Poland.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/49f25e8d/attachment.html>

From markus.icu at gmail.com  Mon Mar 27 18:35:18 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 27 Mar 2017 16:35:18 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
Message-ID: <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>

On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> Note also that ISO3166-2 is far from being stable, and this could
> contradict Unicode encoding stability: it would then be required to ensure
> this stability by only allowing sequences that are effectively registered
> in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt
> (independantly of the registration ins ISO3166-2), and nothing is said if
> ever ISO3166-2 obsoletes some codes and then some years later decide to
> reassign these codes to new entities: it should not be possible to do the
> same thing in Emoji sequences, and specific assignments will need to be
> made in the Unicode database.
>

The emoji sequences are stable. Please read
http://www.unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
and follow the links to the CLDR spec and data.

Let SD be the result of mapping each character in the tag_spec to a
character in [0-9a-z] by subtracting 0xE0000.


   1. SD must then be a specification as per [CLDR
      <http://www.unicode.org/reports/tr51/proposed.html#CLDR>] of either a
      Unicode subdivision_id
      <http://unicode.org/reports/tr35/index.html#unicode_subdivision_id> (
      data
      <http://www.unicode.org/repos/cldr/tags/latest/common/validity/subdivision.xml>)
      or a 3-digit unicode_region_subtag
      <http://unicode.org/reports/tr35/index.html#unicode_region_subtag> (
      data
      <http://www.unicode.org/repos/cldr/tags/latest/common/validity/region.xml>),
      and
      2. SD must have CLDR idStatus equal to "regular" or "deprecated".


markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/30ad43a1/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 18:58:24 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 01:58:24 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
Message-ID: <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>

This only describes the sequences encoded with 2 characters, not the newer
longer sequences for flags of subnational regions. the
unicode_region_subtag data does not contain anything about the flags for
the first 3 regions in GB.

2017-03-28 1:35 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> Note also that ISO3166-2 is far from being stable, and this could
>> contradict Unicode encoding stability: it would then be required to ensure
>> this stability by only allowing sequences that are effectively registered
>> in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt
>> (independantly of the registration ins ISO3166-2), and nothing is said if
>> ever ISO3166-2 obsoletes some codes and then some years later decide to
>> reassign these codes to new entities: it should not be possible to do the
>> same thing in Emoji sequences, and specific assignments will need to be
>> made in the Unicode database.
>>
>
> The emoji sequences are stable. Please read http://www.unicode.org/
> reports/tr51/proposed.html#valid-emoji-tag-sequences and follow the links
> to the CLDR spec and data.
>
> Let SD be the result of mapping each character in the tag_spec to a
> character in [0-9a-z] by subtracting 0xE0000.
>
>
>    1. SD must then be a specification as per [CLDR
>       <http://www.unicode.org/reports/tr51/proposed.html#CLDR>] of either
>       a Unicode subdivision_id
>       <http://unicode.org/reports/tr35/index.html#unicode_subdivision_id>
>        (data
>       <http://www.unicode.org/repos/cldr/tags/latest/common/validity/subdivision.xml>)
>       or a 3-digit unicode_region_subtag
>       <http://unicode.org/reports/tr35/index.html#unicode_region_subtag> (
>       data
>       <http://www.unicode.org/repos/cldr/tags/latest/common/validity/region.xml>),
>       and
>       2. SD must have CLDR idStatus equal to "regular" or "deprecated".
>
>
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/501fcf81/attachment.html>

From prosfilaes at gmail.com  Mon Mar 27 19:04:06 2017
From: prosfilaes at gmail.com (David Starner)
Date: Tue, 28 Mar 2017 00:04:06 +0000
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
Message-ID: <CAMZ=zj4J4gDd1QncBSMRrw=MyKqVysVv-F+WzshPCz0G6UZiwg@mail.gmail.com>

On Mon, Mar 27, 2017 at 1:34 AM Martin J. D?rst <duerst at it.aoyama.ac.jp>
wrote:

> The qualification 'minor' is less important for an alphabet. In general,
> the more established and well-known an alphabet is, the wider the
> variations of glyph shapes that may be tolerated.
>

My problem with that is that a new script is likely to have wider variation
in properties. It invites people to tinker, with the possibility that any
new changes have a chance to become popular. And variants that show up in
Latin script, like http://www.gutenberg.org/files/20130/20130-h/20130-h.htm ,
don't tend to get encoded unless they have serious support.

When the discussion of the Hopi-English dictionary comes up, I'm reminded
that the Siouian alphabet for Latin,
https://commons.wikimedia.org/wiki/File:BAE-Siouan_Alphabet.png , was
rejected for encoding, at least on this list, because it was only used in
one set of publications that were distributed to every major library in the
US, unlike the Hopi dictionary that was stuck in an archive somewhere.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/bd9d0ab4/attachment.html>

From markus.icu at gmail.com  Mon Mar 27 19:06:40 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 27 Mar 2017 17:06:40 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
Message-ID: <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>

On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> This only describes the sequences encoded with 2 characters, not the newer
> longer sequences for flags of subnational regions. the
> unicode_region_subtag data does not contain anything about the flags for
> the first 3 regions in GB.
>

Please read again what I quoted, and do follow the links.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/c8a9cee6/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 19:06:36 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 02:06:36 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
Message-ID: <CAGa7JC2jCbwsKCipOAeV=Yi4XAhV_hGA5yFeg-e6qzSdimcTnQ@mail.gmail.com>

Also these yellow statements from the initial proposal are contradicting
what is now published in TR51: "UN" and "EU" are accepted even if they are
"macroregions", not satisfying the quoted condition 2 in the proposed
update.

2017-03-28 1:58 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> This only describes the sequences encoded with 2 characters, not the newer
> longer sequences for flags of subnational regions. the
> unicode_region_subtag data does not contain anything about the flags for
> the first 3 regions in GB.
>
> 2017-03-28 1:35 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
>
>> On Mon, Mar 27, 2017 at 1:39 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> Note also that ISO3166-2 is far from being stable, and this could
>>> contradict Unicode encoding stability: it would then be required to ensure
>>> this stability by only allowing sequences that are effectively registered
>>> in http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt
>>> (independantly of the registration ins ISO3166-2), and nothing is said if
>>> ever ISO3166-2 obsoletes some codes and then some years later decide to
>>> reassign these codes to new entities: it should not be possible to do the
>>> same thing in Emoji sequences, and specific assignments will need to be
>>> made in the Unicode database.
>>>
>>
>> The emoji sequences are stable. Please read
>> http://www.unicode.org/reports/tr51/proposed.html#valid-
>> emoji-tag-sequences and follow the links to the CLDR spec and data.
>>
>> Let SD be the result of mapping each character in the tag_spec to a
>> character in [0-9a-z] by subtracting 0xE0000.
>>
>>
>>    1. SD must then be a specification as per [CLDR
>>       <http://www.unicode.org/reports/tr51/proposed.html#CLDR>] of
>>       either a Unicode subdivision_id
>>       <http://unicode.org/reports/tr35/index.html#unicode_subdivision_id>
>>        (data
>>       <http://www.unicode.org/repos/cldr/tags/latest/common/validity/subdivision.xml>)
>>       or a 3-digit unicode_region_subtag
>>       <http://unicode.org/reports/tr35/index.html#unicode_region_subtag>
>>       (data
>>       <http://www.unicode.org/repos/cldr/tags/latest/common/validity/region.xml>),
>>       and
>>       2. SD must have CLDR idStatus equal to "regular" or "deprecated".
>>
>>
>> markus
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/05f88ad4/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 19:09:26 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 02:09:26 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
Message-ID: <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>

I followed the links. Check your links, you are referencing the proposal,
and this contradicts the published version 4.0 of TR51. Where is stability ?

2017-03-28 2:06 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Mon, Mar 27, 2017 at 4:58 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> This only describes the sequences encoded with 2 characters, not the
>> newer longer sequences for flags of subnational regions. the
>> unicode_region_subtag data does not contain anything about the flags for
>> the first 3 regions in GB.
>>
>
> Please read again what I quoted, and do follow the links.
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/9579a940/attachment.html>

From everson at evertype.com  Mon Mar 27 19:19:06 2017
From: everson at evertype.com (Michael Everson)
Date: Tue, 28 Mar 2017 01:19:06 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAMZ=zj4J4gDd1QncBSMRrw=MyKqVysVv-F+WzshPCz0G6UZiwg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <CAMZ=zj4J4gDd1QncBSMRrw=MyKqVysVv-F+WzshPCz0G6UZiwg@mail.gmail.com>
Message-ID: <EAB53A50-86A8-4605-8EE1-C4B2BF2A6FD7@evertype.com>

I?ll look into whatever you?re on about the other ?minor? script, but with regard to what you?ve said below, I?m fairly sure I encoded the missing characters there. I believe it was A7AE and A7B0, capital letters turned K and T used in that orthography. There is a problem with turned P and p in that orthography, though, but no one has ever chosen to look at that. But apart from dealing with the turned p, I do not believe it?s correct to say that that alphabet was ?rejected?. 

Oh, there is a problem with the turned cedilla above; that seems to be missing too. 

> On 28 Mar 2017, at 01:04, David Starner <prosfilaes at gmail.com> wrote:
> 
> When the discussion of the Hopi-English dictionary comes up, I'm reminded that the Siouian alphabet for Latin, https://commons.wikimedia.org/wiki/File:BAE-Siouan_Alphabet.png , was rejected for encoding, at least on this list, because it was only used in one set of publications that were distributed to every major library in the US, unlike the Hopi dictionary that was stuck in an archive somewhere. 


From mark at kli.org  Mon Mar 27 19:22:04 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Mon, 27 Mar 2017 20:22:04 -0400
Subject: Encoding of old compatibility characters
In-Reply-To: <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
Message-ID: <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>

On 03/27/2017 05:46 PM, Fr?d?ric Grosshans wrote:
> An example of a legacy character successfully  encoded recently is ? 
> U+23E8 DECIMAL EXPONENT SYMBOL, encoded in Unicode 5.2.
> It came from the Soviet standard GOST 10859-64 and the German standard 
> ALCOR. And was proposed by Leo Broukhis in this proposal 
> http://www.unicode.org/L2/L2008/08030r-subscript10.pdf . It follows a 
> discussion on this mailing list here 
> http://www.unicode.org/mail-arch/unicode-ml/y2008-m01/0123.html, where 
> Ken Whistler was already sceptical about the usefulness of this encoding. 
Aw, but ? is awesome!  It's much cooler-looking and more visually 
understandable than "e" for exponent notation.  In some code I've been 
playing around with I support it as a valid alternative to "e".

~mark

From markus.icu at gmail.com  Mon Mar 27 19:28:10 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Mon, 27 Mar 2017 17:28:10 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
Message-ID: <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>

On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I followed the links. Check your links, you are referencing the proposal,
> and this contradicts the published version 4.0 of TR51. Where is stability ?
>

Of course I am pointing to the proposal. The version of TR 51 under review
adds a mechanism that didn't exist before. It's an addition, not a
contradiction. Once it's there it will be stable.
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170327/7a61a147/attachment.html>

From verdy_p at wanadoo.fr  Mon Mar 27 20:38:44 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 03:38:44 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
Message-ID: <CAGa7JC1soECa5YJ4=ZzJMDr=+8W3kOr2PFWTHZ0721YLXrq59Q@mail.gmail.com>

I try to summarize the situation for France, There are some missing codes

 France m?tropolitaine (deprecated: [fx]):
   D?partements m?tropolitains:
     [fr01~19 fr2a~b fr21~68 fr70-95] (unchanged)
     [fr6d]  Rh?ne (d?partement)                      (missing, included in
[fr69]?)
   Statuts particuliers:
     [fr69]  Rh?ne (circonscription d?partementale)
     [fr6m]  M?tropole de Lyon                        (missing, included in
[fr69]?)
   R?gions m?tropolitaines:
     [frara] Auvergne-Rh?ne-Alpes     (new)
              - Auvergne              (former)        (deprecated: [frc])
              - Rh?ne-Alpes           (former)        (deprecated: [frv])
     [frbfc] Bourgogne-Franche-Comt?  (new)
              - Bourgogne             (former)        (deprecated: [frd])
              - Franche-Comt?         (former)        (deprecated: [fri])
     [frbre] Bretagne                 (unchanged)     (deprecated: [fre])
     [frcor] Corse (collectivit? territoriale de)     (deprecated: [frh])
     [frcvl] Centre-Val de Loire                      (deprecated: [frf])
     [frges] Grand-Est                (new)
              - Alsace                (former)        (deprecated: [fra])
              - Champagne-Ardenne     (former)        (deprecated: [frg])
              - Franche-Comt?         (former)        (deprecated: [frm])
     [frhdf] Hauts-de-France          (new)
              - Nord-Pas-de-Calais    (former)        (deprecated: [fro])
              - Picardie              (former)        (deprecated: [frs])
     [fridf] ?le-de-France                            (deprecated: [frj])
     [frnaq] Nouvelle-Aquitaine       (new)
              - Aquitaine             (former)        (deprecated: [frb])
              - Limousin              (former)        (deprecated: [frl)
              - Poitou-Charentes      (former)        (deprecated: [frt])
     [frnor] Normandie                (new)
              - Basse-Normandie       (former)        (deprecated: [frp])
              - Haute-Normandie       (former)        (deprecated: [frq])
     [frocc] Occitanie                (new)
              - Languedoc-Roussillon  (former)        (deprecated: [frk])
              - Midi-Pyr?n?es         (former)        (deprecated: [frn])
     [frpac] Provence-Alpes-Cote d'Azur               (deprecated: [fru])
     [frpdl] Pays de la Loire                         (deprecated: [frr])
 D?partements/r?gions d'outre-mer (DOM/ROM):
     [gp]    Guadeloupe (d?partement)                 (deprecated: [frgp])
     [frgua] Guadeloupe (r?gion)
     [mq]    Martinique (d?partement)                 (deprecated: [frmq])
     [frmar] Martinique (ancienne r?gion)             (missing?)
     [gf]    Guyane     (d?partement)                 (deprecated: [frgf])
     [frguy] Guyane     (ancienne r?gion)             (missing?)
     [yt]    Mayotte    (d?partement)                 (deprecated: [fryt])
     [frmay] Mayotte    (ancienne collectivit?)
     [re]    La R?union (d?partement)                 (deprecated: [frre])
     [frlre] La R?union (r?gion)
 Autres outre-mers:
   Collectivit?s d'outre-mer (COM):
     [bl] Saint-Barth?lemy                            (deprecated: [frbl])
     [mf] Saint-Martin (partie fran?aise)             (deprecated: [frmf])
     [pf] Polyn?sie fran?aise                         (deprecated: [frpf])
     [pm] Saint-Pierre-et-Miquelon                    (deprecated: [frpm])
     [tf] Terres australes et antarctiques fran?aises (deprecated: [frtf])
     [wf] Wallis-et-Futuna                            (deprecated: [frwf])
   Statuts particuliers:
     [nc]  Nouvelle-Cal?donie                         (deprecated: [frnc])
     [cp]  Clipperton                                 (deprecated: [frcp])


2017-03-28 2:28 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:

> On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> I followed the links. Check your links, you are referencing the proposal,
>> and this contradicts the published version 4.0 of TR51. Where is stability ?
>>
>
> Of course I am pointing to the proposal. The version of TR 51 under review
> adds a mechanism that didn't exist before. It's an addition, not a
> contradiction. Once it's there it will be stable.
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/7de0af0d/attachment-0001.html>

From tfujiwar at redhat.com  Tue Mar 28 00:46:59 2017
From: tfujiwar at redhat.com (Takao Fujiwara)
Date: Tue, 28 Mar 2017 14:46:59 +0900
Subject: different version of common/annotations/ja.xml
In-Reply-To: <CAN9ydbUe-16bdyjLhuw_nov1ZS3Udtaca9ydyWdCN_Hk-_Wqyg@mail.gmail.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
 <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
 <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>
 <CAN9ydbUe-16bdyjLhuw_nov1ZS3Udtaca9ydyWdCN_Hk-_Wqyg@mail.gmail.com>
Message-ID: <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com>

It would be combinations of Hiragana, Katakana, Kanji.

On 03/28/17 02:25, Koji Ishii-san wrote:
> I think he meant Kanji/Han ideographic by "committed string".
>
> 2017-03-27 19:04 GMT+09:00 Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com>>:
>
>     On 03/27/17 18:48, Mark Davis ??-san wrote:
>
>         By "committed strings", you mean the hiragana phonetic reading?
>
>
>     Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion.
>     After users select one of the converted strings, the converted strings are committed on the text.
>     I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method.
>
>     Fujiwara
>
>
>         Mark
>         //////
>
>         On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com> <mailto:tfujiwar at redhat.com
>         <mailto:tfujiwar at redhat.com>>> wrote:
>
>             Hi,
>
>             Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation?
>             http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>
>             <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>>
>
>             That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml.
>             E.g.
>                            <annotation cp="?">? | ?? | ????</annotation>
>
>             instead of
>
>                            <annotation cp="?">??? | ???? | ????</annotation>
>
>             I think the committed version is useful without input method and it follows other languages.
>
>             Thanks,
>             Fujiwara
>
>
>
>


From mark at macchiato.com  Tue Mar 28 00:57:52 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 07:57:52 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
Message-ID: <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>

To add to what Ken and Markus said: like many other identifiers, there are
a number of different categories.

   1. *Ill-formed: *"$1"
   2. *Well-formed, but not valid: *"usx". Is *syntactic* according to
   http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence,
   but is not *valid* according to
   http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
   .
   3. *Valid, but not recommended: "usca". *Corresponds to the valid
   Unicode subdivision code for California according to
   http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
   and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
   4. *Recommended:* "gbsct". Corresponds to the valid Unicode subdivision
   code for Scotland, and *is* listed in
   http://unicode.org/Public/emoji/5.0/.

 As Ken says, the terminology is a little bit in flux for term
'recommended'. TR51 is still open for comment, although we won't make any
changes that would invalidate http://unicode.org/Public/emoji/5.0/.

====

I would also encourage people to look at the slides on
http://unicode.org/emoji/, together with the speaker notes, since some of
those slides present this very issue. I'm sure the people on this list will
have some useful comments for improvements.

Another item: with Tayfun's help, we updated
http://unicode.org/press/emoji.html. If people have any feedback on other
articles that should be on that list, please let us know...

Mark

Mark

On Tue, Mar 28, 2017 at 2:28 AM, Markus Scherer <markus.icu at gmail.com>
wrote:

> On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> I followed the links. Check your links, you are referencing the proposal,
>> and this contradicts the published version 4.0 of TR51. Where is stability ?
>>
>
> Of course I am pointing to the proposal. The version of TR 51 under review
> adds a mechanism that didn't exist before. It's an addition, not a
> contradiction. Once it's there it will be stable.
> markus
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/8b4c4027/attachment.html>

From mark at macchiato.com  Tue Mar 28 01:12:17 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 08:12:17 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC1soECa5YJ4=ZzJMDr=+8W3kOr2PFWTHZ0721YLXrq59Q@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAGa7JC1soECa5YJ4=ZzJMDr=+8W3kOr2PFWTHZ0721YLXrq59Q@mail.gmail.com>
Message-ID: <CAJ2xs_E_bByD7Ra-=moN0MWUvTdV9XnvWun0-EJi=_+fLqfdvQ@mail.gmail.com>

(I'm sure you know this, Philippe, but a reminder for others: as far as the
Unicode projects go, discussions on this list have no effect unless they
are turned into a submission (UTC or Emoji proposal, CLDR or ICU ticket).)

If you see any problems in the CLDR data, please file a ticket at
http://unicode.org/cldr/trac/newticket. Please only include the problem
cases. (Note that it is *not* a goal for CLDR to include all ISO
subdivisions going back in time; just back to 2015-09. And even there, if
an ISO subdivision is introduced after the start of a CLDR version, but
retracted before that version releases, it won't be included. If retracted
in a later version, it is moved to the deprecated set.)

Mark

2017-03-28 3:38 GMT+02:00 Philippe Verdy <verdy_p at wanadoo.fr>:

> I try to summarize the situation for France, There are some missing codes
>
>  France m?tropolitaine (deprecated: [fx]):
>    D?partements m?tropolitains:
>      [fr01~19 fr2a~b fr21~68 fr70-95] (unchanged)
>      [fr6d]  Rh?ne (d?partement)                      (missing, included
> in [fr69]?)
>    Statuts particuliers:
>      [fr69]  Rh?ne (circonscription d?partementale)
>      [fr6m]  M?tropole de Lyon                        (missing, included
> in [fr69]?)
>    R?gions m?tropolitaines:
>      [frara] Auvergne-Rh?ne-Alpes     (new)
>               - Auvergne              (former)        (deprecated: [frc])
>               - Rh?ne-Alpes           (former)        (deprecated: [frv])
>      [frbfc] Bourgogne-Franche-Comt?  (new)
>               - Bourgogne             (former)        (deprecated: [frd])
>               - Franche-Comt?         (former)        (deprecated: [fri])
>      [frbre] Bretagne                 (unchanged)     (deprecated: [fre])
>      [frcor] Corse (collectivit? territoriale de)     (deprecated: [frh])
>      [frcvl] Centre-Val de Loire                      (deprecated: [frf])
>      [frges] Grand-Est                (new)
>               - Alsace                (former)        (deprecated: [fra])
>               - Champagne-Ardenne     (former)        (deprecated: [frg])
>               - Franche-Comt?         (former)        (deprecated: [frm])
>      [frhdf] Hauts-de-France          (new)
>               - Nord-Pas-de-Calais    (former)        (deprecated: [fro])
>               - Picardie              (former)        (deprecated: [frs])
>      [fridf] ?le-de-France                            (deprecated: [frj])
>      [frnaq] Nouvelle-Aquitaine       (new)
>               - Aquitaine             (former)        (deprecated: [frb])
>               - Limousin              (former)        (deprecated: [frl)
>               - Poitou-Charentes      (former)        (deprecated: [frt])
>      [frnor] Normandie                (new)
>               - Basse-Normandie       (former)        (deprecated: [frp])
>               - Haute-Normandie       (former)        (deprecated: [frq])
>      [frocc] Occitanie                (new)
>               - Languedoc-Roussillon  (former)        (deprecated: [frk])
>               - Midi-Pyr?n?es         (former)        (deprecated: [frn])
>      [frpac] Provence-Alpes-Cote d'Azur               (deprecated: [fru])
>      [frpdl] Pays de la Loire                         (deprecated: [frr])
>  D?partements/r?gions d'outre-mer (DOM/ROM):
>      [gp]    Guadeloupe (d?partement)                 (deprecated: [frgp])
>      [frgua] Guadeloupe (r?gion)
>      [mq]    Martinique (d?partement)                 (deprecated: [frmq])
>      [frmar] Martinique (ancienne r?gion)             (missing?)
>      [gf]    Guyane     (d?partement)                 (deprecated: [frgf])
>      [frguy] Guyane     (ancienne r?gion)             (missing?)
>      [yt]    Mayotte    (d?partement)                 (deprecated: [fryt])
>      [frmay] Mayotte    (ancienne collectivit?)
>      [re]    La R?union (d?partement)                 (deprecated: [frre])
>      [frlre] La R?union (r?gion)
>  Autres outre-mers:
>    Collectivit?s d'outre-mer (COM):
>      [bl] Saint-Barth?lemy                            (deprecated: [frbl])
>      [mf] Saint-Martin (partie fran?aise)             (deprecated: [frmf])
>      [pf] Polyn?sie fran?aise                         (deprecated: [frpf])
>      [pm] Saint-Pierre-et-Miquelon                    (deprecated: [frpm])
>      [tf] Terres australes et antarctiques fran?aises (deprecated: [frtf])
>      [wf] Wallis-et-Futuna                            (deprecated: [frwf])
>    Statuts particuliers:
>      [nc]  Nouvelle-Cal?donie                         (deprecated: [frnc])
>      [cp]  Clipperton                                 (deprecated: [frcp])
>
>
> 2017-03-28 2:28 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
>
>> On Mon, Mar 27, 2017 at 5:09 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> I followed the links. Check your links, you are referencing the
>>> proposal, and this contradicts the published version 4.0 of TR51. Where is
>>> stability ?
>>>
>>
>> Of course I am pointing to the proposal. The version of TR 51 under
>> review adds a mechanism that didn't exist before. It's an addition, not a
>> contradiction. Once it's there it will be stable.
>> markus
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/33836dfb/attachment.html>

From mark at macchiato.com  Tue Mar 28 01:20:04 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 08:20:04 +0200
Subject: different version of common/annotations/ja.xml
In-Reply-To: <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
 <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
 <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>
 <CAN9ydbUe-16bdyjLhuw_nov1ZS3Udtaca9ydyWdCN_Hk-_Wqyg@mail.gmail.com>
 <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com>
Message-ID: <CAJ2xs_H5xcu3qqZ3HHo9-cGAZEPMWsXJLxMF-qTg3cfpCbho-w@mail.gmail.com>

Ah, yes. Sorry for my confusion.

One main purpose for the short names is for TTS, and for that I think
people felt that the reading was more useful. However, it would probably be
better for the keywords to have the normal spelling. You might consider
filing a ticket at http://unicode.org/cldr/trac/newticket with a proposal
for change.

Mark

On Tue, Mar 28, 2017 at 7:46 AM, Takao Fujiwara <tfujiwar at redhat.com> wrote:

> It would be combinations of Hiragana, Katakana, Kanji.
>
> On 03/28/17 02:25, Koji Ishii-san wrote:
>
>> I think he meant Kanji/Han ideographic by "committed string".
>>
>> 2017-03-27 19:04 GMT+09:00 Takao Fujiwara <tfujiwar at redhat.com <mailto:
>> tfujiwar at redhat.com>>:
>>
>>     On 03/27/17 18:48, Mark Davis ??-san wrote:
>>
>>         By "committed strings", you mean the hiragana phonetic reading?
>>
>>
>>     Hiragana is used to the raw text of the phonetic reading by the
>> Japanese input method before the conversion.
>>     After users select one of the converted strings, the converted
>> strings are committed on the text.
>>     I mean the major conversion of ja.xml is useful instead of
>> remembering the raw text as the converted result in the input method.
>>
>>     Fujiwara
>>
>>
>>         Mark
>>         //////
>>
>>         On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <
>> tfujiwar at redhat.com <mailto:tfujiwar at redhat.com> <mailto:
>> tfujiwar at redhat.com
>>         <mailto:tfujiwar at redhat.com>>> wrote:
>>
>>             Hi,
>>
>>             Do you have any chances to create a different version of
>> ja.xml of the Japanese emoji annotation?
>>             http://unicode.org/cldr/trac/browser/tags/latest/common/anno
>> tations/ja.xml
>>         <http://unicode.org/cldr/trac/browser/tags/latest/common/ann
>> otations/ja.xml>
>>             <http://unicode.org/cldr/trac/browser/tags/latest/common/ann
>> otations/ja.xml
>>         <http://unicode.org/cldr/trac/browser/tags/latest/common/ann
>> otations/ja.xml>>
>>
>>             That file includes Hiragana only but I'd need another file
>> which has the committed strings, likes ja_convert.xml.
>>             E.g.
>>                            <annotation cp="?">? | ?? | ????</annotation>
>>
>>             instead of
>>
>>                            <annotation cp="?">??? | ???? |
>> ????</annotation>
>>
>>             I think the committed version is useful without input method
>> and it follows other languages.
>>
>>             Thanks,
>>             Fujiwara
>>
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/91ac42b7/attachment.html>

From duerst at it.aoyama.ac.jp  Tue Mar 28 01:32:03 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 15:32:03 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
 <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com>
Message-ID: <a701c03c-aa10-80c9-b67a-a6b99ad76019@it.aoyama.ac.jp>

On 2017/03/28 01:03, Michael Everson wrote:
> On 27 Mar 2017, at 16:56, John H. Jenkins <jenkins at apple.com> wrote:

> The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI <????>. Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont.

Good to have some actual examples. However, the example at hand does, as 
far as I understand it, not necessarily support separate encoding.

While it mixes 1855 and 1859, it contains only one of the ligature 
variants each. Indeed, it could be taken as support for the theory that 
the top and bottom row ligatures in 
https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg 
were used interchangeably, and that the 1857 St Louis punches just made 
one particular choice of glyph selection.

What would give a strong argument would be the *concurrent* existence of 
*corresponding* ligatures in the same font, or the concurrent (even 
better, contrasting) use of corresponding ligatures in the same text.

Regards,   Martin.

What's interesting (weird?) is that the "1859" OI <????> appears in 1857 
punches. Time travel? Or is the label "1859" a misnomer or just a 
convention?

From tfujiwar at redhat.com  Tue Mar 28 02:49:52 2017
From: tfujiwar at redhat.com (Takao Fujiwara)
Date: Tue, 28 Mar 2017 16:49:52 +0900
Subject: different version of common/annotations/ja.xml
In-Reply-To: <CAJ2xs_H5xcu3qqZ3HHo9-cGAZEPMWsXJLxMF-qTg3cfpCbho-w@mail.gmail.com>
References: <59446763-cb73-2400-a37e-92e474ea4c5d@redhat.com>
 <CAJ2xs_Hx-Sc2Qq06CAV5izDEXr9FXiaHMwGaOXD2ejJF4BwhVA@mail.gmail.com>
 <779695f0-45df-8322-1de8-a8f6928dc022@redhat.com>
 <CAN9ydbUe-16bdyjLhuw_nov1ZS3Udtaca9ydyWdCN_Hk-_Wqyg@mail.gmail.com>
 <57daeb9a-81b0-bd8b-615e-813795070efc@redhat.com>
 <CAJ2xs_H5xcu3qqZ3HHo9-cGAZEPMWsXJLxMF-qTg3cfpCbho-w@mail.gmail.com>
Message-ID: <85acb454-de31-51ce-369c-44c900c9a7c6@redhat.com>

Thanks, I will file that ticket.
I'd like to have another version of ja.xml for both TTS and non-TTS.

Fujiwara

On 03/28/17 15:20, Mark Davis ??-san wrote:
> Ah, yes. Sorry for my confusion.
>
> One main purpose for the short names is for TTS, and for that I think people felt that the reading was more useful. However, it would probably be
> better for the keywords to have the normal spelling. You might consider filing a ticket at http://unicode.org/cldr/trac/newticket with a proposal for
> change.
>
> Mark
> //////
>
> On Tue, Mar 28, 2017 at 7:46 AM, Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com>> wrote:
>
>     It would be combinations of Hiragana, Katakana, Kanji.
>
>     On 03/28/17 02:25, Koji Ishii-san wrote:
>
>         I think he meant Kanji/Han ideographic by "committed string".
>
>         2017-03-27 19:04 GMT+09:00 Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com> <mailto:tfujiwar at redhat.com
>         <mailto:tfujiwar at redhat.com>>>:
>
>             On 03/27/17 18:48, Mark Davis ??-san wrote:
>
>                 By "committed strings", you mean the hiragana phonetic reading?
>
>
>             Hiragana is used to the raw text of the phonetic reading by the Japanese input method before the conversion.
>             After users select one of the converted strings, the converted strings are committed on the text.
>             I mean the major conversion of ja.xml is useful instead of remembering the raw text as the converted result in the input method.
>
>             Fujiwara
>
>
>                 Mark
>                 //////
>
>                 On Mon, Mar 27, 2017 at 11:00 AM, Takao Fujiwara <tfujiwar at redhat.com <mailto:tfujiwar at redhat.com> <mailto:tfujiwar at redhat.com
>         <mailto:tfujiwar at redhat.com>> <mailto:tfujiwar at redhat.com <mailto:tfujiwar at redhat.com>
>                 <mailto:tfujiwar at redhat.com <mailto:tfujiwar at redhat.com>>>> wrote:
>
>                     Hi,
>
>                     Do you have any chances to create a different version of ja.xml of the Japanese emoji annotation?
>                     http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>
>                 <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>>
>                     <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>
>                 <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml
>         <http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml>>>
>
>                     That file includes Hiragana only but I'd need another file which has the committed strings, likes ja_convert.xml.
>                     E.g.
>                                    <annotation cp="?">? | ?? | ????</annotation>
>
>                     instead of
>
>                                    <annotation cp="?">??? | ???? | ????</annotation>
>
>                     I think the committed version is useful without input method and it follows other languages.
>
>                     Thanks,
>                     Fujiwara
>
>
>
>
>
>


From frederic.grosshans at gmail.com  Tue Mar 28 04:18:24 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Tue, 28 Mar 2017 11:18:24 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
Message-ID: <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>

Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit :
> Aw, but ? is awesome!  It's much cooler-looking and more visually 
> understandable than "e" for exponent notation.  In some code I've been 
> playing around with I support it as a valid alternative to "e". 

I Agree 1?3 times with you on this !

     Fr?d?ric


From verdy_p at wanadoo.fr  Tue Mar 28 04:33:41 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 11:33:41 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
Message-ID: <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>

Ideally a smart text renderer could as well display that glyph with a
leading multiplication sign (a mathematical middle dot) and implicitly
convert the following digits (and sign) as real superscript/exponent (using
contextual substitution/positioning like for Eastern Arabic/Urdu), without
necessarily writing the 10 base with smaller digits.
Without it, people will want to use 20? to mean it is the decimal number
twenty and not hexadecimal number thirty two.

2017-03-28 11:18 GMT+02:00 Fr?d?ric Grosshans <frederic.grosshans at gmail.com>
:

> Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit :
>
>> Aw, but ? is awesome!  It's much cooler-looking and more visually
>> understandable than "e" for exponent notation.  In some code I've been
>> playing around with I support it as a valid alternative to "e".
>>
>
> I Agree 1?3 times with you on this !
>
>     Fr?d?ric
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/6eabed9c/attachment.html>

From joan at montane.cat  Tue Mar 28 04:56:02 2017
From: joan at montane.cat (=?UTF-8?Q?Joan_Montan=C3=A9?=)
Date: Tue, 28 Mar 2017 11:56:02 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
Message-ID: <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>

2017-03-28 7:57 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> To add to what Ken and Markus said: like many other identifiers, there are
> a number of different categories.
>
>    1. *Ill-formed: *"$1"
>    2. *Well-formed, but not valid: *"usx". Is *syntactic* according to
>    http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence
>    <http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence>,
>    but is not *valid* according to http://unicode.org/reports/tr5
>    1/proposed.html#valid-emoji-tag-sequences
>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>    .
>    3. *Valid, but not recommended: "usca". *Corresponds to the valid
>    Unicode subdivision code for California according to
>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>    and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
>    4. *Recommended:* "gbsct". Corresponds to the valid Unicode
>    subdivision code for Scotland, and *is* listed in
>    http://unicode.org/Public/emoji/5.0/
>    <http://unicode.org/Public/emoji/5.0/>.
>
>  As Ken says, the terminology is a little bit in flux for term
> 'recommended'. TR51 is still open for comment, although we won't make any
> changes that would invalidate http://unicode.org/Public/emoji/5.0/.
>

Just two remarks

1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site)
arises something like chicken-egg problem. Vendors don't easily add new
subdivision-flags (because they aren't recommended), and Unicode doesn't
recommend new subdivision flags (because they aren't supported by vendors).

2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
valid, but not recommended, Unicode subdivisions codes eligible? For
instances, say, could someone adopt California, Texas, Pomerania, or
Catalonia flags?


Regards,
Joan Montan?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/f85fb3c1/attachment.html>

From mark at macchiato.com  Tue Mar 28 05:32:55 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 12:32:55 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
Message-ID: <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>

?Good questions.?

On Tue, Mar 28, 2017 at 11:56 AM, Joan Montan? <joan at montane.cat> wrote:

> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site)
> arises something like chicken-egg problem. Vendors don't easily add new
> subdivision-flags (because they aren't recommended), and Unicode doesn't
> recommend new subdivision flags (because they aren't supported by vendors).
>
?
That isn't really the case. In particular, vendors can propose adding
additional subdivisions to the recommended list. The UTC Consideration?s
<http://unicode.org/draft/emoji/selection.html#utc_consideration> would
come into play in assessing those proposals.? So it is certainly possible
for there to be (say) a flag of Texas or Catalonia appearing in an Emoji
6.0 release this year. Similarly, Microsoft could propose adding the ninja
cat ZWJ sequences.


> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
> valid, but not recommended, Unicode subdivisions codes eligible? For
> instances, say, could someone adopt California, Texas, Pomerania, or
> Catalonia flags?
>

?We only support the recommended list for adoptions.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/e22c880d/attachment.html>

From verdy_p at wanadoo.fr  Tue Mar 28 05:36:40 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 12:36:40 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
Message-ID: <CAGa7JC0JRY9JOhKNhOKxXbi3pHD=TSEQdbUBd5cx7-fMmaL-oQ@mail.gmail.com>

I note this in TR32
*3.2 Unicode Locale Identifier
<http://unicode.org/reports/tr35/index.html#Unicode_locale_identifier>*

EBNF
ABNF

unicode_locale_id
<http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
unicode_language_id
  (transformed_extensions
  unicode_locale_extensions?
| unicode_locale_extensions?
  transformed_extensions?) ; = unicode_language_id
  ([trasformed_extensions
  [unicode_locale_extensions]]
/ [unicode_locale_extensions
  [transformed_extensions]])

* first there's a typo in the ABNF syntax ("trasformed")
* the syntax is not strictly equivalent, or the ABNF is unnecessarily not
context-free

It should better be:

EBNF
ABNF

unicode_locale_id
<http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
unicode_language_id
 (transformed_extensions
  unicode_locale_extensions?
| unicode_locale_extensions
  transformed_extensions?)?; = unicode_language_id
 [transformed_extensions
  [unicode_locale_extensions]
/ unicode_locale_extensions
  [transformed_extensions]]


2017-03-28 11:56 GMT+02:00 Joan Montan? <joan at montane.cat>:

>
>
> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>
>> To add to what Ken and Markus said: like many other identifiers, there
>> are a number of different categories.
>>
>>    1. *Ill-formed: *"$1"
>>    2. *Well-formed, but not valid: *"usx". Is *syntactic* according to
>>    http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence
>>    <http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence>,
>>    but is not *valid* according to http://unicode.org/reports/tr5
>>    1/proposed.html#valid-emoji-tag-sequences
>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>    .
>>    3. *Valid, but not recommended: "usca". *Corresponds to the valid
>>    Unicode subdivision code for California according to
>>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta
>>    g-sequences
>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>    and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
>>    4. *Recommended:* "gbsct". Corresponds to the valid Unicode
>>    subdivision code for Scotland, and *is* listed in
>>    http://unicode.org/Public/emoji/5.0/
>>    <http://unicode.org/Public/emoji/5.0/>.
>>
>>  As Ken says, the terminology is a little bit in flux for term
>> 'recommended'. TR51 is still open for comment, although we won't make any
>> changes that would invalidate http://unicode.org/Public/emoji/5.0/.
>>
>
> Just two remarks
>
> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site)
> arises something like chicken-egg problem. Vendors don't easily add new
> subdivision-flags (because they aren't recommended), and Unicode doesn't
> recommend new subdivision flags (because they aren't supported by vendors).
>
> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
> valid, but not recommended, Unicode subdivisions codes eligible? For
> instances, say, could someone adopt California, Texas, Pomerania, or
> Catalonia flags?
>
>
> Regards,
> Joan Montan?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/0f7e3190/attachment.html>

From duerst at it.aoyama.ac.jp  Tue Mar 28 05:39:13 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 19:39:13 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
Message-ID: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>

Hello Michael, others,

On 2017/03/27 21:07, Michael Everson wrote:
> On 27 Mar 2017, at 06:42, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>
>>> The characters in question have different and undisputed origins, undisputed.
>>
>> If you change that to the somewhat more neutral "the shapes in question have different and undisputed origins", then I'm with you. I actually have said as much (in different words) in an earlier post.
>
> And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before.

I didn't say that you have to change words. I just said that I could 
agree to a slightly differently worded phrase.

And as for precedent, the fact that we have encoded a lot of characters 
in Unicode doesn't mean that we can encode more characters without 
checking each and every single case very carefully, as we are doing in 
this discussion.


> The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care.

Sorry, but that was exactly the point of this analogy. As to "can't 
tell", it's easy to ask somebody to look at an actual ? letter and say 
whether the right part looks more like an s or like a z. On the other 
hand, users of Deseret may or may not ignore the difference between the 
1855 and 1859 shapes when they read. Of course they will easily see 
different shapes, but what's important isn't the shapes, it's what they 
associate it with. If for them, it's just two shapes for one and the 
same 40th letter of the Deseret alphabet, then that is a strong 
suggestion for not encoding separately, even if the shapes look really 
different.


> No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this:
>
> https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg

Yes. And we are just starting to collect evidence for Deseret fonts.


> And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there.

Well, not to somebody used to it. But non-German users quite often use a 
Greek ? where they should use a ?, so it's no surprise people don't 
distinguish the ?s and ?z derived glyphs.


> The situation in Deseret is different.

The graphic difference is definitely bigger, so to an outsider, it's 
definitely quite impossible to identify the pairs of shapes. But that 
does in no way mean that these have to be seen as different characters 
(rather than just different glyphs) by insiders (actual users).

To use another analogy, many people these days (me included) would have 
difficulties identifying Fraktur letters, in particular if they show up 
just as individual letters. Similar for many fantasy fonts, and for 
people not very familiar with the Latin script.


> Underlying ligature difference is indicative of character identity. Particularly when two resulting ligatures are SO different from one another as to be unrecognizable. And that is the case with EW on the left and OI on the right here:
> https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg
>
> The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters.

The range of what can be a glyph variant is quite wide across scripts 
and font styles. Just that the shapes differ widely, or that the origin 
is different, doesn't make this conclusive.


> Character origin is intimately related to character identity.

In most cases, yes. But it's not a given conclusion.


> I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters.

No, your work wouldn't be impossible. It might be quite a bit more 
difficult, but not impossible. I have written papers about Han 
ideographs and Japanese text processing where I had to create my own 
fonts (8-bit, with mostly random assignments of characters because these 
were one-off jobs), or fake things with inline bitmap images (trying to 
get information on the final printer resolution and how many black 
pixels wide a stem or crossbar would have to be to avoid dropouts, and 
not being very successful).

I have heard the argument that some character variant is needed because 
of research, history,... quite a few times. If a character has indeed 
been historically used in a contrasting way, this is definitely a good 
argument for encoding. But if a character just looked somewhat different 
a few (hundreds of) years ago, that doesn't make such a good argument. 
Otherwise, somebody may want to propose new codepoints for Bodoni and 
Helvetica,...


Regards,    Martin.

From mark at macchiato.com  Tue Mar 28 05:49:39 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 12:49:39 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC0JRY9JOhKNhOKxXbi3pHD=TSEQdbUBd5cx7-fMmaL-oQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAGa7JC0JRY9JOhKNhOKxXbi3pHD=TSEQdbUBd5cx7-fMmaL-oQ@mail.gmail.com>
Message-ID: <CAJ2xs_H1duG-pa66tOWj_UA5Ei_Ud0eD8xar=z1Y1t3aXVPJLQ@mail.gmail.com>

?Thanks. Probably best as:

unicode_locale_id = unicode_language_id
                    ( transformed_extensions unicode_locale_extensions?
                    | unicode_locale_extensions transformed_extensions? )?
;?

even clearer would be two steps:

unicode_locale_id = unicode_language_id extensions? ;

extensions        = transformed_extensions unicode_locale_extensions?
                  | unicode_locale_extensions transformed_extensions? ;

?Could you file a CLDR ticket on this?

?
Mark

On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I note this in TR32
> *3.2 Unicode Locale Identifier
> <http://unicode.org/reports/tr35/index.html#Unicode_locale_identifier>*
>
> EBNF
> ABNF
>
> unicode_locale_id
> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
> unicode_language_id
>   (transformed_extensions
>   unicode_locale_extensions?
> | unicode_locale_extensions?
>   transformed_extensions?) ; = unicode_language_id
>   ([trasformed_extensions
>   [unicode_locale_extensions]]
> / [unicode_locale_extensions
>   [transformed_extensions]])
>
> * first there's a typo in the ABNF syntax ("trasformed")
> * the syntax is not strictly equivalent, or the ABNF is unnecessarily not
> context-free
>
> It should better be:
>
> EBNF
> ABNF
>
> unicode_locale_id
> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
> unicode_language_id
>  (transformed_extensions
>   unicode_locale_extensions?
> | unicode_locale_extensions
>   transformed_extensions?)?; = unicode_language_id
>  [transformed_extensions
>   [unicode_locale_extensions]
> / unicode_locale_extensions
>   [transformed_extensions]]
>
>
>
> 2017-03-28 11:56 GMT+02:00 Joan Montan? <joan at montane.cat>:
>
>>
>>
>> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>>
>>> To add to what Ken and Markus said: like many other identifiers, there
>>> are a number of different categories.
>>>
>>>    1. *Ill-formed: *"$1"
>>>    2. *Well-formed, but not valid: *"usx". Is *syntactic* according to
>>>    http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence
>>>    <http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence>,
>>>    but is not *valid* according to http://unicode.org/reports/tr5
>>>    1/proposed.html#valid-emoji-tag-sequences
>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>    .
>>>    3. *Valid, but not recommended: "usca". *Corresponds to the valid
>>>    Unicode subdivision code for California according to
>>>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta
>>>    g-sequences
>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>    and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
>>>    4. *Recommended:* "gbsct". Corresponds to the valid Unicode
>>>    subdivision code for Scotland, and *is* listed in
>>>    http://unicode.org/Public/emoji/5.0/
>>>    <http://unicode.org/Public/emoji/5.0/>.
>>>
>>>  As Ken says, the terminology is a little bit in flux for term
>>> 'recommended'. TR51 is still open for comment, although we won't make any
>>> changes that would invalidate http://unicode.org/Public/emoji/5.0/.
>>>
>>
>> Just two remarks
>>
>> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site)
>> arises something like chicken-egg problem. Vendors don't easily add new
>> subdivision-flags (because they aren't recommended), and Unicode doesn't
>> recommend new subdivision flags (because they aren't supported by vendors).
>>
>> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
>> valid, but not recommended, Unicode subdivisions codes eligible? For
>> instances, say, could someone adopt California, Texas, Pomerania, or
>> Catalonia flags?
>>
>>
>> Regards,
>> Joan Montan?
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/5b7fe81a/attachment.html>

From mark at macchiato.com  Tue Mar 28 05:59:00 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 12:59:00 +0200
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
Message-ID: <CAJ2xs_F0oq3WhPskL8TUP4-y3YepuogST-O3FnJcYvpgjb+bhQ@mail.gmail.com>

On Tue, Mar 28, 2017 at 12:39 PM, Martin J. D?rst <duerst at it.aoyama.ac.jp>
wrote:

?....?

No, your work wouldn't be impossible. It might be quite a bit more
> difficult, but not impossible. I have written papers about Han ideographs
> and Japanese text processing where I had to create my own fonts (8-bit,
> with mostly random assignments of characters because these were one-off
> jobs), or fake things with inline bitmap images (trying to get information
> on the final printer resolution and how many black pixels wide a stem or
> crossbar would have to be to avoid dropouts, and not being very successful).
>
> I have heard the argument that some character variant is needed because of
> research, history,... quite a few times. If a character has indeed been
> historically used in a contrasting way, this is definitely a good argument
> for encoding. But if a character just looked somewhat different a few
> (hundreds of) years ago, that doesn't make such a good argument. Otherwise,
> somebody may want to propose new codepoints for Bodoni and Helvetica,...
>

?I agree with Martin.

Moreover, his last paragraphs are getting at the crux of the matter.
Unicode is not a registry of glyphs for letters, nor should try to be.
Simply because someone used a particular shape at some time to mean a
letter doesn't mean that Unicode should encode a letter for that shape. We
do not need to capture all of the shapes in
https://upload.wikimedia.org/wikipedia/commons/f/fc/Gebrochene_Schriften.png
simply because somebody is going to "publish a volume full of" those shapes.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/053d78d0/attachment.html>

From ian.clifton at chem.ox.ac.uk  Tue Mar 28 06:00:25 2017
From: ian.clifton at chem.ox.ac.uk (Ian Clifton)
Date: Tue, 28 Mar 2017 12:00:25 +0100
Subject: Encoding of old compatibility characters
In-Reply-To: <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 (Philippe Verdy's message of "Tue, 28 Mar 2017 11:33:41 +0200")
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
Message-ID: <4q7f39oed2.fsf@chem.ox.ac.uk>

Philippe Verdy <verdy_p at wanadoo.fr> writes:

> Ideally a smart text renderer could as well display that glyph with a
> leading multiplication sign (a mathematical middle dot) and implicitly
> convert the following digits (and sign) as real superscript/exponent
> (using contextual substitution/positioning like for Eastern
> Arabic/Urdu), without necessarily writing the 10 base with smaller
> digits.

Actually, I would see this as putting unnecessary clutter back in! I
would say the advantage of the ? notation, introduced with Algol 60, is
that it subsumes and makes implicit the multiplication and
exponentiation operators, resulting in a visually compact denotation of
a real number in ?scientific notation?, and it does so with a single
symbol that hints at its own meaning.

I?ve used ? a couple of times, without explanation, in my own
emails?without, as far as I?m aware, causing any misunderstanding.

> Without it, people will want to use 20? to mean it is the decimal
> number twenty and not hexadecimal number thirty two.

Yes, this ambiguity is a drawback. Hopefully, the use cases should be
sufficiently different that real confusion would be unlikely (and of
course, normally, U+23E8 should never be used to denote decimal number
base).

-- 
Ian Clifton ?                 ?: +44 1865 275677
Chemistry Research Laboratory ?: +44 1865 285002
Oxford University             ??: ian.clifton at chem.ox.ac.uk
Mansfield Road   Oxford OX1 3TA   UK


From verdy_p at wanadoo.fr  Tue Mar 28 06:01:47 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 13:01:47 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAJ2xs_H1duG-pa66tOWj_UA5Ei_Ud0eD8xar=z1Y1t3aXVPJLQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAGa7JC0JRY9JOhKNhOKxXbi3pHD=TSEQdbUBd5cx7-fMmaL-oQ@mail.gmail.com>
 <CAJ2xs_H1duG-pa66tOWj_UA5Ei_Ud0eD8xar=z1Y1t3aXVPJLQ@mail.gmail.com>
Message-ID: <CAGa7JC3pFjWDA53t6FnjKGTKWyHTrKb3tPGQubZ6QR2j9op0sg@mail.gmail.com>

I just filed the bug in the CLDR contact form.

2017-03-28 12:49 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> ?Thanks. Probably best as:
>
> unicode_locale_id = unicode_language_id
>                     ( transformed_extensions unicode_locale_extensions?
>                     | unicode_locale_extensions transformed_extensions? )?
> ;?
>
> even clearer would be two steps:
>
> unicode_locale_id = unicode_language_id extensions? ;
>
> extensions        = transformed_extensions unicode_locale_extensions?
>                   | unicode_locale_extensions transformed_extensions? ;
>
> ?Could you file a CLDR ticket on this?
>
> ?
> Mark
>
> On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy <verdy_p at wanadoo.fr>
> wrote:
>
>> I note this in TR32
>> *3.2 Unicode Locale Identifier
>> <http://unicode.org/reports/tr35/index.html#Unicode_locale_identifier>*
>>
>> EBNF
>> ABNF
>>
>> unicode_locale_id
>> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
>> unicode_language_id
>>   (transformed_extensions
>>   unicode_locale_extensions?
>> | unicode_locale_extensions?
>>   transformed_extensions?) ; = unicode_language_id
>>   ([trasformed_extensions
>>   [unicode_locale_extensions]]
>> / [unicode_locale_extensions
>>   [transformed_extensions]])
>>
>> * first there's a typo in the ABNF syntax ("trasformed")
>> * the syntax is not strictly equivalent, or the ABNF is unnecessarily not
>> context-free
>>
>> It should better be:
>>
>> EBNF
>> ABNF
>>
>> unicode_locale_id
>> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
>> unicode_language_id
>>  (transformed_extensions
>>   unicode_locale_extensions?
>> | unicode_locale_extensions
>>   transformed_extensions?)?; = unicode_language_id
>>  [transformed_extensions
>>   [unicode_locale_extensions]
>> / unicode_locale_extensions
>>   [transformed_extensions]]
>>
>>
>>
>> 2017-03-28 11:56 GMT+02:00 Joan Montan? <joan at montane.cat>:
>>
>>>
>>>
>>> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>>>
>>>> To add to what Ken and Markus said: like many other identifiers, there
>>>> are a number of different categories.
>>>>
>>>>    1. *Ill-formed: *"$1"
>>>>    2. *Well-formed, but not valid: *"usx". Is *syntactic* according to
>>>>    http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence
>>>>    <http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_sequence>,
>>>>    but is not *valid* according to http://unicode.org/reports/tr5
>>>>    1/proposed.html#valid-emoji-tag-sequences
>>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>>    .
>>>>    3. *Valid, but not recommended: "usca". *Corresponds to the valid
>>>>    Unicode subdivision code for California according to
>>>>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta
>>>>    g-sequences
>>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>>    and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
>>>>    4. *Recommended:* "gbsct". Corresponds to the valid Unicode
>>>>    subdivision code for Scotland, and *is* listed in
>>>>    http://unicode.org/Public/emoji/5.0/
>>>>    <http://unicode.org/Public/emoji/5.0/>.
>>>>
>>>>  As Ken says, the terminology is a little bit in flux for term
>>>> 'recommended'. TR51 is still open for comment, although we won't make any
>>>> changes that would invalidate http://unicode.org/Public/emoji/5.0/.
>>>>
>>>
>>> Just two remarks
>>>
>>> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode
>>> site) arises something like chicken-egg problem. Vendors don't easily add
>>> new subdivision-flags (because they aren't recommended), and Unicode
>>> doesn't recommend new subdivision flags (because they aren't supported by
>>> vendors).
>>>
>>> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
>>> valid, but not recommended, Unicode subdivisions codes eligible? For
>>> instances, say, could someone adopt California, Texas, Pomerania, or
>>> Catalonia flags?
>>>
>>>
>>> Regards,
>>> Joan Montan?
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/606d22bf/attachment.html>

From duerst at it.aoyama.ac.jp  Tue Mar 28 06:26:38 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 20:26:38 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>
 <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>
Message-ID: <06a3292e-a275-8b0d-d03d-5c76e0870977@it.aoyama.ac.jp>

I agree with Alstair.

The list of font technology options was mostly to show that there are 
already a lot of options (some might even say too many), so font 
technology doesn't really limit our choices.

Regards,   Martin.

On 2017/03/27 23:04, Alastair Houghton wrote:
> On 27 Mar 2017, at 10:14, Julian Bradfield <jcb+unicode at inf.ed.ac.uk> wrote:
>>
>> I contend, therefore, that no decision about Unicode should take into
>> account any ephemeral considerations such as this year's electronic
>> font technology, and that therefore it's not even useful to mention
>> them.
>
> I?d disagree with that, for two reasons:
>
> 1. Unicode has to be usable *today*; it?s no good designing for some kind of hyper-intelligent AI-based font technology a thousand years hence, because we don?t have that now.  If it isn?t usable today for any given purpose, people won?t use it for that, and will adopt alternative solutions (like using images to represent text).
>
> 2. ?This year?s electronic font technology? is actually quite powerful, and is unlikely to be supplanted by something *less* powerful in future.  There is an argument about exactly how widespread support for it is (for instance, simple text editors are clearly lacking in support for stylistic alternates, except possibly on the Mac where there?s built-in support in the standard text edit control), but again I think it?s reasonable to expect support to grow over time, rather than being removed.
>
> I don?t think it?s unreasonable, then, to point out that mechanisms like stylistic or contextual alternates exist, or indeed for that knowledge to affect a decision about whether or not a character should be encoded, *bearing in mind* the likely direction of travel of font and text rendering support in widely available operating systems.
>
> All that said, I?d definitely defer to others on the subject of whether or not Unicode needs the Deseret characters being discussed here.  That?s very much not my field.
>
> Kind regards,
>
> Alastair.
>
> --
> http://alastairs-place.net

From duerst at it.aoyama.ac.jp  Tue Mar 28 06:33:25 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 20:33:25 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <a6ad786a-7738-377d-6ea5-909566b93ed2@it.aoyama.ac.jp>
 <F7010786-BDC3-4B02-B5A0-32829375C9D2@evertype.com>
 <4136dea9-dfb0-2c95-ab51-fdc0a4a4a758@it.aoyama.ac.jp>
 <slrnodhlv8.hj1.jcb@home.stevens-bradfield.com>
 <3F2F367B-0F95-4E58-8795-E89BA417D46C@alastairs-place.net>
 <34932545-09D9-4692-8FE3-4196EB8BA07B@evertype.com>
Message-ID: <e2ece7c6-60a9-98a1-0cc7-27626398e8ab@it.aoyama.ac.jp>

On 2017/03/28 01:49, Michael Everson wrote:

> Sorry, but typographic control of that sort is grand for typesetting, where you can select ranges of text and language-tag it (assuming your program accepts and supports all the language tags you might need (which they don?t)) and you can select fonts which have all the trickery baked into them (hardly any do) and then? can you use this in file names? In your plain-text databases? In your text messages?

Do you think that the 1855/1859 distinction is needed in file names? In 
text messages? It may help in some kinds of databases, but it may also 
be possible to just tag each piece of text in the database with "1855" 
or "1859" if that distinction is important (e.g. for historical 
documents). As far as I understand, we are still looking for actual 
texts that use both shapes of the same ligature concurrently.

Regards,   Martin.

From duerst at it.aoyama.ac.jp  Tue Mar 28 06:38:22 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 20:38:22 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
 <D94673B0-155F-49BF-9E29-6FF5E43BFB21@apple.com>
 <460682BA-84D0-4804-8E45-12C8802C963B@evertype.com>
Message-ID: <40e15c1e-b175-6dc3-d0df-3eb07e5f0eb8@it.aoyama.ac.jp>

On 2017/03/28 01:20, Michael Everson wrote:

> Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??<????>, ?few? as ??<????>, ?truefully? [sic] as ????<????>????????, and ?you? as ??<????>.

These are all 1859 variants, yes? That would just show that these 
variants existed (which I think nobody in this discussion has doubted), 
but not that there was contrasting use. And is that letter hand-written 
or printed?

Regards,    Martin.

From duerst at it.aoyama.ac.jp  Tue Mar 28 07:10:58 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Tue, 28 Mar 2017 21:10:58 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CABPY6Z0wuz+s4VrdyAfDJFKCwZNN5bi7ivMt6g9JXe01h=4Yuw@mail.gmail.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
Message-ID: <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>

On 2017/03/27 21:59, Michael Everson wrote:
> On 27 Mar 2017, at 08:05, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
>
>>> Consider 2EBC ? CJK RADICAL MEAT and 2E9D ? CJK RADICAL MOON which are apparently really supposed to have identical glyphs, though we use an old-fashioned style in the charts for the former. (Yes, I am of course aware that there are other reasons for distinguishing these, but as far as glyphs go, even our standard distinguishes them artificially.)
>>
>> "apparently", maybe. Let's for a moment leave aside the radicals themselves, which are to a large extent artificial constructs.
>
> I do stipulate not being a CJK expert. But those are indeed different due to their origins, however similar their shapes are.

Except for the radicals themselves, I haven't found a contrasting pair. 
What I think we would need to find to influence the current 
argumentation (except for general "history is important", which I think 
we all agree) is a case of a character that originally existed both with 
a MEAT radical and a MOON radical, but has only a single usage. Then 
whether there were one or two code points would provide an analog for 
the situation we have at hand.

Also note that there is a difference in meaning. The characters with 
MEAT radicals mostly refer to body parts and organs. The characters with 
MOON radicals are mostly time-related.


>> Let's look at the actual characters with these radicals (e.g. U+6709,... for MOON and U+808A,... for MEAT), in the multi-column code charts of ISO 10646. There are some exceptions, but in most cases, the G/J/K columns show no difference (i.e. always the ? shape, with two horizontal bars), whereas the H/T/V columns show the ? shape (two downwards slanted bars) for the "MEAT" radical and the ? shape for the moon radical. So whether these radicals have identical glyphs depends on typographic tradition/font/?
>
> They are still always very similar, right?

Similarity is in the eye of the beholder (or the script).

Sometimes, a little dot or hook is irrelevant. Sometimes it's the single 
difference that makes it a totally different character.


>> In Japan, many people may be rather unaware of the difference, whereas in Taiwan, it may be that school children get drilled on the difference.
>
> That?s interesting.

Not necessarily for the poor Taiwanese students, and not necessarily for 
the Japanese who try to find a character in a dictionary ordered by 
radical :-(.


>>> Changing to a different font in order to change one or two glyphs is a mechanism that we have actually rejected many times in the past. We have encoded variant and alternate characters for many scripts.
>>
>> Well, yes, rejected many times in cases where that was appropriate. But also accepted many times, in cases that we may not even remember, because they may not even have been made explicitly.
>
> Do come up with examples if you have any.

I had the following in mind:

>> The roman/italic a/? and g/? distinctions (the later code points only used to show the distinction in plain text, which could as well be done descriptively),
>
> Aa and ?? are used contrastively for different sounds in some languages and in the IPA. ?? is not, to my knowledge, used contrastively with Gg (except that ? can only mean /?/, while orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is reasonably analogous to ?? and <lig>????</lig> being used for /ju?/.

The contrastive use *in some languages or notations* (IPA) is the reason 
these are separately encoded. The fact that these are not contrastively 
used in most major languages is responsible for the fact that they don't 
use different code points when used in these languages. It would be a 
real hassle to have to change from g to ? when switching e.g. from Times 
Roman to Times Italic.

In Deseret, we are still missing any contrastive usage, so that suggests 
to be careful with encoding.


>> as well as a large number of distinctions in Han fonts, come to my mind.

It's difficult to show these distinctions, because they are NOT 
separately encoded, but three-stroke and four-stroke grass radical is 
the most well known.


> And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ??<????>??. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/.

Ah, I see. So we seem to have five different ways (counting the two 
ligature variants) of writing the same word, with three different 
pronunciations. The important question is whether the two ligatures do 
imply any difference in pronunciation (as opposed to time of writing or 
author/printer preference), i.e. whether the ligated sequences ?????? or 
??<????>?? are pronounced differently (not by a phonologist but by an 
average user).


>> Is the choice of variant up to the author (for which variants), or is it the editor or printer who makes the choice (for which variants)?
>
> In a handwritten manuscript obviously the choice is the author?s. As to historical printing, printers may have

Did you want to write something more here?


>> And what informs this choice? If we have any historic metal types, are there examples where a font contains both ligature variants?
>
> Ken Beesley have samples of a metal font (the 1857 St Luois punches) which had both ?? and ????; I don?t know what other sorts were in that font.

As I explained in another post, that may just be a 1855/1859 hybrid.


Regards,   Martin.

From mark at macchiato.com  Tue Mar 28 07:22:36 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Tue, 28 Mar 2017 14:22:36 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC3pFjWDA53t6FnjKGTKWyHTrKb3tPGQubZ6QR2j9op0sg@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAGa7JC0JRY9JOhKNhOKxXbi3pHD=TSEQdbUBd5cx7-fMmaL-oQ@mail.gmail.com>
 <CAJ2xs_H1duG-pa66tOWj_UA5Ei_Ud0eD8xar=z1Y1t3aXVPJLQ@mail.gmail.com>
 <CAGa7JC3pFjWDA53t6FnjKGTKWyHTrKb3tPGQubZ6QR2j9op0sg@mail.gmail.com>
Message-ID: <CAJ2xs_F+HVSRWzU4ASrsxQb8biuZDaKJiheXuJro-34_NoEsAA@mail.gmail.com>

Thanks

Mark

On Tue, Mar 28, 2017 at 1:01 PM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:

> I just filed the bug in the CLDR contact form.
>
> 2017-03-28 12:49 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>
>> ?Thanks. Probably best as:
>>
>> unicode_locale_id = unicode_language_id
>>                     ( transformed_extensions unicode_locale_extensions?
>>                     | unicode_locale_extensions transformed_extensions?
>> )? ;?
>>
>> even clearer would be two steps:
>>
>> unicode_locale_id = unicode_language_id extensions? ;
>>
>> extensions        = transformed_extensions unicode_locale_extensions?
>>                   | unicode_locale_extensions transformed_extensions? ;
>>
>> ?Could you file a CLDR ticket on this?
>>
>> ?
>> Mark
>>
>> On Tue, Mar 28, 2017 at 12:36 PM, Philippe Verdy <verdy_p at wanadoo.fr>
>> wrote:
>>
>>> I note this in TR32
>>> *3.2 Unicode Locale Identifier
>>> <http://unicode.org/reports/tr35/index.html#Unicode_locale_identifier>*
>>>
>>> EBNF
>>> ABNF
>>>
>>> unicode_locale_id
>>> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
>>> unicode_language_id
>>>   (transformed_extensions
>>>   unicode_locale_extensions?
>>> | unicode_locale_extensions?
>>>   transformed_extensions?) ; = unicode_language_id
>>>   ([trasformed_extensions
>>>   [unicode_locale_extensions]]
>>> / [unicode_locale_extensions
>>>   [transformed_extensions]])
>>>
>>> * first there's a typo in the ABNF syntax ("trasformed")
>>> * the syntax is not strictly equivalent, or the ABNF is unnecessarily
>>> not context-free
>>>
>>> It should better be:
>>>
>>> EBNF
>>> ABNF
>>>
>>> unicode_locale_id
>>> <http://unicode.org/reports/tr35/index.html#unicode_locale_id> =
>>> unicode_language_id
>>>  (transformed_extensions
>>>   unicode_locale_extensions?
>>> | unicode_locale_extensions
>>>   transformed_extensions?)?; = unicode_language_id
>>>  [transformed_extensions
>>>   [unicode_locale_extensions]
>>> / unicode_locale_extensions
>>>   [transformed_extensions]]
>>>
>>>
>>>
>>> 2017-03-28 11:56 GMT+02:00 Joan Montan? <joan at montane.cat>:
>>>
>>>>
>>>>
>>>> 2017-03-28 7:57 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:
>>>>
>>>>> To add to what Ken and Markus said: like many other identifiers, there
>>>>> are a number of different categories.
>>>>>
>>>>>    1. *Ill-formed: *"$1"
>>>>>    2. *Well-formed, but not valid: *"usx". Is *syntactic* according
>>>>>    to http://unicode.org/reports/tr51/proposed.html#def_emoji_tag_
>>>>>    sequence, but is not *valid* according to
>>>>>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta
>>>>>    g-sequences
>>>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>>>    .
>>>>>    3. *Valid, but not recommended: "usca". *Corresponds to the valid
>>>>>    Unicode subdivision code for California according to
>>>>>    http://unicode.org/reports/tr51/proposed.html#valid-emoji-ta
>>>>>    g-sequences
>>>>>    <http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences>
>>>>>    and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/
>>>>>    .
>>>>>    4. *Recommended:* "gbsct". Corresponds to the valid Unicode
>>>>>    subdivision code for Scotland, and *is* listed in
>>>>>    http://unicode.org/Public/emoji/5.0/
>>>>>    <http://unicode.org/Public/emoji/5.0/>.
>>>>>
>>>>>  As Ken says, the terminology is a little bit in flux for term
>>>>> 'recommended'. TR51 is still open for comment, although we won't make any
>>>>> changes that would invalidate http://unicode.org/Public/emoji/5.0/.
>>>>>
>>>>
>>>> Just two remarks
>>>>
>>>> 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode
>>>> site) arises something like chicken-egg problem. Vendors don't easily add
>>>> new subdivision-flags (because they aren't recommended), and Unicode
>>>> doesn't recommend new subdivision flags (because they aren't supported by
>>>> vendors).
>>>>
>>>> 2n one: What about "Adopt a Character" (AKA "Adopt an emoji"). Will be
>>>> valid, but not recommended, Unicode subdivisions codes eligible? For
>>>> instances, say, could someone adopt California, Texas, Pomerania, or
>>>> Catalonia flags?
>>>>
>>>>
>>>> Regards,
>>>> Joan Montan?
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/801a4573/attachment.html>

From everson at evertype.com  Tue Mar 28 07:38:38 2017
From: everson at evertype.com (Michael Everson)
Date: Tue, 28 Mar 2017 13:38:38 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <a701c03c-aa10-80c9-b67a-a6b99ad76019@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <CABPY6Z0TX1j5NzQYo1Vkm13VEpUaEF3pfwe5mVx+B=vS7Ox5mg@mail.gmail.com>
 <A39D572A-CF21-4619-8A21-486ACF53277A@apple.com>
 <7A0EFBD9-9352-4C35-BD9E-FE5284A49F1C@evertype.com>
 <a701c03c-aa10-80c9-b67a-a6b99ad76019@it.aoyama.ac.jp>
Message-ID: <852E1F82-A015-4616-BA59-8AABBF4FABC2@evertype.com>

On 28 Mar 2017, at 07:32, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

> On 2017/03/28 01:03, Michael Everson wrote:
>> On 27 Mar 2017, at 16:56, John H. Jenkins <jenkins at apple.com> wrote:
> 
>> The 1857 St Louis punches definitely included both the 1855 EW ?? and the 1859 OI <????>. Ken Beesley shows them in smoke proofs in his 2004 paper on Metafont.
> 
> Good to have some actual examples. However, the example at hand does, as far as I understand it, not necessarily support separate encoding.

Of course it does.

> While it mixes 1855 and 1859, it contains only one of the ligature variants each.

It?s a smoke proof taken from some metal sorts. It shows that at least these two characters were in that font. 

> Indeed, it could be taken as support for the theory that the top and bottom row ligatures in https://en.wikipedia.org/wiki/Deseret_alphabet#/media/File:Deseret_glyphs_ew_and_oi_transformation_from_1855_to_1859.svg were used interchangeably, and that the 1857 St Louis punches just made one particular choice of glyph selection.

"Letters to represent the same diphthong? does not mean ?letters used interchangeably?. These letters have entirely different histories. They are not similar to one another. They are not ?glyph variants? of one another by ANY measure of character identity that I have learned in two decades of this work, where I have examined and successfully proposed a great many characters. Martin, your scepticism just doesn?t convince. It seems like it?s scepticism for its own sake. You only have to, you know, use your EYES to see that 1855 EW looks NOTHING LIKE 1859 EW. Doesn?t matter if they?re used to represent the same sound. That doesn?t mean they?re in free variation. In fact, what it looks like is that early texts may use some letters, later texts may use other letters, and a few texts

This is a matter of SPELLING. Of the choice the author makes. It may be important for dating a manuscript. Representing texts as they are written is as important for early Deseret as it is for medieval Latin, to researchers who care to represent the text as it was without normalizing it to one thing or another. 

> What would give a strong argument would be the *concurrent* existence of *corresponding* ligatures in the same font, or the concurrent (even better, contrasting) use of corresponding ligatures in the same text.

Well, ain?t it just too bad that the accident of history has not left us complete print shops with all the fonts that were ever used for Deseret. 

The origin of these four letters as ligatures of four distinct letters with SHORT I is the right argument for character identity. Recognizability is also a strong argument. We used that when we encoded Phoenician, though some people argued that Semitic studies would collapse if we didn?t treat Phoenician as a font variant of Hebrew. 

Maybe those of you who don?t have to face the ever-moving bar of encoding criteria over and over again don?t remember that stuff. 

> What's interesting (weird?) is that the "1859" OI <????> appears in 1857 punches. Time travel? Or is the label "1859" a misnomer or just a convention?

I think 1859 refers to a particular publication. 

Michael Everson

From asmusf at ix.netcom.com  Tue Mar 28 08:09:00 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 28 Mar 2017 06:09:00 -0700
Subject: Encoding of old compatibility characters
In-Reply-To: <4q7f39oed2.fsf@chem.ox.ac.uk>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
Message-ID: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/0190237b/attachment.html>

From everson at evertype.com  Tue Mar 28 08:56:28 2017
From: everson at evertype.com (Michael Everson)
Date: Tue, 28 Mar 2017 14:56:28 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
Message-ID: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>

On 28 Mar 2017, at 11:39, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

>> And what would the value of this be? Why should I (who have been doing this for two decades) not be able to use the word ?character? when I believe it correct? Sometimes you people who have been here for a long time behave as though we had no precedent, as though every time a character were proposed for encoding it?s as thought nothing had ever been encoded before.
> 
> I didn't say that you have to change words. I just said that I could agree to a slightly differently worded phrase.

An ? ligature is a ligature of a and of e. It is not some sort of pretzel. What Deseret has is this:

10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
	* officially named ?ew? in the code chart
	* used for ew in earlier texts
10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
	* officially named ?oi? in the code chart
	* used for oi in earlier texts
1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE
	* used for oi in later texts
1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE
	* used for ew in later texts

Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character. 

Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character. 

To do so is to show no understanding of the history of writing systems at all. You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here. 

> And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion.

The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this. 

>> The sharp s analogy wasn?t useful because whether ?s or ?z users can?t tell either and don?t care.
> 
> Sorry, but that was exactly the point of this analogy. As to "can't tell", it's easy to ask somebody to look at an actual ? letter and say whether the right part looks more like an s or like a z.

By ?can?t tell? I mean ?recognize as essentially the same letterform?. The streetsigns in some German cities use a very ?? if you look at it and know anything about typography. Most people probably don?t notice. They see ? and that?s precisely because ?s and ?? look very much alike. 

> On the other hand, users of Deseret may or may not ignore the difference between the 1855 and 1859 shapes when they read.

The people who wrote the manuscripts are dead. Most readers and writers of Deseret today use the shapes that are in their fonts, which are those in the Unicode charts, and most texts published today don?t use the EW and OI ligatures at all, because that?s John Jenkins? editorial practice. The need to distinguish these letters (which are distinguished because of their history as letterforms, not because of the diphthong) is no different from the reason we encoded these ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?. Scholars required those. Manuscripts may contain them side by side. Or their usage may be separated by hundreds of kilometres or hundreds of years. There is no difference. There were pages of discussion as to WHY scholars needed the medievalist characters. The counter argument was ?Why not normalize?? We had similar pages of discussion as to WHY Uralicists needed the great many characters we encoded for them. 

Why is it that you people can encode BROCCOLI on the basis of nothing but ?people might like it? but we cannot use sound existing precedent to encode characters which (while similar in use to other characters) are an index of orthographic change in a historical script and orthography? There are plenty of ?glyph variations? in early Deseret texts vis ? vis which I?d ignore. 

This isn?t one of them. 

> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different.

Martin, there is no answer to this unless you can read the minds of people who are dead a century or more. Therefore it is not a useful criterion, and the other criteria (letter origin, spelling choice) are the indices which must guide our understanding. The result of those criteria is that there are four characters here, not two. 

> No Fraktur fonts, for instance, offer a shape for U+00DF that looks like an ?s. And what Antiiqua fonts do, well, you get this:
>> 
>> https://en.wikipedia.org/wiki/%C3%9F#/media/File:Sz_modern.svg
> 
> Yes. And we are just starting to collect evidence for Deseret fonts.

Well you aren?t going to get full repertoires from the 19th-century lead type because they don?t exist. We have what we have of them, and we have the manuscripts. As to modern digital typefaces, there are NONE which support the 1859 letters. And I?ve seen most of them. 

>> And there?s nothing unrecognizable about the ?? (< ?? (= ?z)) ligature there.
> 
> Well, not to somebody used to it. But non-German users quite often use a Greek ? where they should use a ?, so it's no surprise people don't distinguish the ?s and ?z derived glyphs.

I?ve received German texts which used Greek ?. But that?s not the point. People don?t distinguish the ?s and ?? glyphs because they look pretty much the same AND there?s no reason to distinguish them. A world of difference between that and the Deseret LETTERs WITH STROKE.

>> The situation in Deseret is different.
> 
> The graphic difference is definitely bigger,

For pity?s sake, Martin. ?? ?? look NOTHING ALIKE. And ?? and ?? look NOTHING ALIKE. This isn?t anything like ?s and ?? and ?z and ?. 

> so to an outsider, it's definitely quite impossible to identify the pairs of shapes. But that does in no way mean that these have to be seen as different characters (rather than just different glyphs) by insiders (actual users).

They had a script reform and they cut new type. The did this on purpose. Note that in their ligatures they shifted from SHORT AH and LONG OO to LONG AH and SHORT OO. 

> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters.

I do not believe you. If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all.

I?ll stipulate that few Germans can read S?tterlin or similar hands. :-)

> Similar for many fantasy fonts, and for people not very familiar with the Latin script.

What?s a fantasy font? And what does this have to do with supporting the encoding in plain text of historical documents in the Deseret script?

>> The lower two letterforms are in no way ?glyph variants? of the upper two letterforms. Apart from the stroke of the SHORT I ?? they share nothing in common ? because they come from different sources and are therefore different characters.
> 
> The range of what can be a glyph variant is quite wide across scripts and font styles. Just that the shapes differ widely, or that the origin is different, doesn't make this conclusive.

LONG OO WITH STROKE is not a glyph variant of SHORT OO WITH STROKE. LONG AH WITH STROKE is not a glyph variant of SHORT AH WITH STROKE. 

>> I don?t think that ANY user of Deseret is all that ?average?. Certainly some users of Deseret are experts interested in the script origin, dating, variation, and so on ? just as we have medievalists who do the same kind of work. I?m about to publish a volume full of characters from Latin Extended-D. My work would have been impossible had we not encoded those characters.
> 
> No, your work wouldn't be impossible. It might be quite a bit more difficult, but not impossible.

No. Wrong. Wrong, wrong, wrong. No, Martin. We encoded the Latin characters on the basis of good arguments. You do NOT get to invalidate that, or to pretend that the encoding of those characters was a mistake, or anything like it. Many scholars ? including myself ? use these characters, and that is what the Universal Character Set is for. 

Also, apparently, it is for pictures of BROCCOLI. 

> I have written papers about Han ideographs and Japanese text processing where I had to create my own fonts (8-bit, with mostly random assignments of characters because these were one-off jobs), or fake things with inline bitmap images (trying to get information on the final printer resolution and how many black pixels wide a stem or crossbar would have to be to avoid dropouts, and not being very successful).

All of use make use of nonce glyphs for examples. That?s not the same as making an edition of a medieval Cornish text, or of a Mormon diary. We do NOT want to have to use font trickery 

> I have heard the argument that some character variant is needed because of research, history,... quite a few times. If a character has indeed been historically used in a contrasting way,

Contrast may be geographical or temporal. 

> this is definitely a good argument for encoding. But if a character just looked somewhat different a few (hundreds of) years ago,

Also, LATIN LETTER D WITH STROKE is a different letter from LATIN LETTER T WITH STROKE. Why? Because the underlying letters are different. And it?s no different for Deseret. 

Your suggestion that LONG AH WITH STROKE and SHORT AH WITH STROKE are the same character is unsupportable. 

> that doesn't make such a good argument. Otherwise, somebody may want to propose new codepoints for Bodoni and Helvetica,?

This suggestion is nonsense. 

On 28 Mar 2017, at 11:59, Mark Davis ?? <mark at macchiato.com> wrote:

> ?I agree with Martin.
> 
> Moreover, his last paragraphs are getting at the crux of the matter. Unicode is not a registry of glyphs for letters, nor should try to be. 

DESERET LETTER LONG AH WITH STROKE is not a glyph variant of DESERET LETTER SHORT AH WITH STROKE. 

> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape.

Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake. 

> We do not need to capture all of the shapes in https://upload.wikimedia.org/wikipedia/commons/f/fc/Gebrochene_Schriften.png simply because somebody is going to "publish a volume full of" those shapes.

That analogy has nothing to do with the discussion about the Deseret letters. 


On 28 Mar 2017, at 12:33, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

> Do you think that the 1855/1859 distinction is needed in file names? In text messages? It may help in some kinds of databases, but it may also be possible to just tag each piece of text in the database with "1855" or "1859" if that distinction is important (e.g. for historical documents). As far as I understand, we are still looking for actual texts that use both shapes of the same ligature concurrently.

I think that this is the sort of distinction that should be made in plain text, yes. The 1859 letters are not "glyph variants? of the 1855 letters by any criterion in the history of writing systems that I recognize. 

On 2017/03/28 01:20, Michael Everson wrote:

>> Ken transcribes into modern type a letter by Shelton dated 1859, in which ?boy? is written ??<????>, ?few? as ??<????>, ?truefully? [sic] as ????<????>????????, and ?you? as ??<????>.
> 
> These are all 1859 variants, yes?

Yes, it was one letter written by one person at one sitting and he used one orthography and he didn?t mix it with the other orthography. 

> That would just show that these variants existed (which I think nobody in this discussion has doubted), but not that there was contrasting use. And is that letter hand-written or printed?

They had a script reform. At first Mormons used the letter SHORT AH WITH STROKE [??] for /??/ and then later they used LONG AH WITH STROKE [???] for /??/. And at first Mormons used the letter LONG OO WITH STROKE [?u?] for /ju?/ and then later they used SHORT OO WITH STROKE [??] for /ju?/. And some Mormons didn?t use either, they just wrote the diphthongs with digraphs of other letters. 

On 28 Mar 2017, at 13:10, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

>> And the same goes for the /ju?/ ligatures. The word tube /tju?b/ can be written TY?B ???????? or ?????? or ??<????>??. But the unligated the sequences would be pronounced differently: ???????? /tju?b/ and ???????? /t?u?b/ and ???????? /t??b/.
> 
> Ah, I see. So we seem to have five different ways (counting the two ligature variants) of writing the same word,

That?s called spelling.

> with three different pronunciations.

No, that?s wrong. I give those transcriptions to show the usual meanings of the Deseret letters. So if you were going to write ?tube? /tju?b/ you would write ???????? or ?????? or ??<????>??. In the second sentence I show that while the ligated letters ?? and <????> can be used for /ju?/ the unligated sequences ???? and ???? would in principle be pronounced /?u?/ and /??/ respectively.

Obviously the pronunciation of the word ?tube? would not have changed for speakers of English in Mormon territories in the middle of the 19th century. (Of course many dialects of English in North America now have /tu?b/ rather than /tju?b/ but that is not relevant here. 

> The important question is whether the two ligatures do imply any difference in pronunciation (as opposed to time of writing or author/printer preference), i.e. whether the ligated sequences ?????? or ??<????>?? are pronounced differently (not by a phonologist but by an average user).

No, it?s spelling.

Michael Everson

From richard.wordingham at ntlworld.com  Tue Mar 28 11:14:35 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 28 Mar 2017 17:14:35 +0100
Subject: U+0261 LATIN SMALL LETTER SCRIPT G
In-Reply-To: <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
 <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
Message-ID: <20170328171435.39d8bd40@JRWUBU2>

On Tue, 28 Mar 2017 21:10:58 +0900
"Martin J. D?rst" <duerst at it.aoyama.ac.jp> wrote:
(in Re: Standaridized variation sequences for the Desert alphabet?)

> On 2017/03/27 21:59, Michael Everson wrote:

> > Aa and ?? are used contrastively for different sounds in some
> > languages and in the IPA. ?? is not, to my knowledge, used
> > contrastively with Gg (except that ? can only mean /?/, while
> > orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is
> > reasonably analogous to ?? and <lig>????</lig> being used for /ju?/.  

> The contrastive use *in some languages or notations* (IPA) is the
> reason these are separately encoded.

I thought that reason is that at the time, the IPA proscribed the use of
the two-storey 'g' in phonetic notation.  They have since relented.
This was disunification on the basis that one form simply looks wrong.

Which writing system contrasts the two?

Richard.


From asmusf at ix.netcom.com  Tue Mar 28 11:30:15 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 28 Mar 2017 09:30:15 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
Message-ID: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>

On 3/28/2017 6:56 AM, Michael Everson wrote:
> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.
We need a pretzel emoji.

A./


From frederic.grosshans at gmail.com  Tue Mar 28 11:35:41 2017
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Tue, 28 Mar 2017 18:35:41 +0200
Subject: U+0261 LATIN SMALL LETTER SCRIPT G
In-Reply-To: <20170328171435.39d8bd40@JRWUBU2>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
 <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
 <20170328171435.39d8bd40@JRWUBU2>
Message-ID: <e8f6fe82-6a35-70c8-41fc-683d6ca9a6ca@gmail.com>

Le 28/03/2017 ? 18:14, Richard Wordingham a ?crit :
> On Tue, 28 Mar 2017 21:10:58 +0900
> "Martin J. D?rst" <duerst at it.aoyama.ac.jp> wrote:
> (in Re: Standaridized variation sequences for the Desert alphabet?)
>
>> On 2017/03/27 21:59, Michael Everson wrote:
>>> Aa and ?? are used contrastively for different sounds in some
>>> languages and in the IPA. ?? is not, to my knowledge, used
>>> contrastively with Gg (except that ? can only mean /?/, while
>>> orthographic g can mean /?/, /d?/, /x/ etc. But g vs ? is
>>> reasonably analogous to ?? and <lig>????</lig> being used for /ju?/.
>> The contrastive use *in some languages or notations* (IPA) is the
>> reason these are separately encoded.
> [...]
> Which writing system contrasts the two?

I had found in 2013 a G? contrast in mathematical notations of an old (1952) physics book (see http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0092.html)

   Fr?d?ric


From verdy_p at wanadoo.fr  Tue Mar 28 11:47:53 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 28 Mar 2017 18:47:53 +0200
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>
Message-ID: <CAGa7JC09d97XXjLGF=kkQb0MAV9UO+BL3pNGrokj=Ao0hHrX_A@mail.gmail.com>

2017-03-28 18:30 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:

> On 3/28/2017 6:56 AM, Michael Everson wrote:
>
>> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.
>>
> We need a pretzel emoji.

We need a broken tooth emoji too !
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/4fb5e8d4/attachment.html>

From jknappen at web.de  Tue Mar 28 11:52:24 2017
From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=)
Date: Tue, 28 Mar 2017 18:52:24 +0200
Subject: Aw: Re: U+0261 LATIN SMALL LETTER SCRIPT G
In-Reply-To: <e8f6fe82-6a35-70c8-41fc-683d6ca9a6ca@gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
 <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
 <20170328171435.39d8bd40@JRWUBU2>,
 <e8f6fe82-6a35-70c8-41fc-683d6ca9a6ca@gmail.com>
Message-ID: <trinity-73285566-7ecf-4ba9-bd98-80b49d735477-1490719944866@3capp-webde-bs42>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/30ef269d/attachment.html>

From frederic.grosshans at gmail.com  Tue Mar 28 12:26:16 2017
From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=)
Date: Tue, 28 Mar 2017 17:26:16 +0000
Subject: U+0261 LATIN SMALL LETTER SCRIPT G
In-Reply-To: <trinity-73285566-7ecf-4ba9-bd98-80b49d735477-1490719944866@3capp-webde-bs42>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj6i74=8HUqyaPUTkXg1pjmOk+W+NhAE3V6LGrQHvJiHGw@mail.gmail.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
 <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
 <20170328171435.39d8bd40@JRWUBU2>
 <e8f6fe82-6a35-70c8-41fc-683d6ca9a6ca@gmail.com>
 <trinity-73285566-7ecf-4ba9-bd98-80b49d735477-1490719944866@3capp-webde-bs42>
Message-ID: <CAJAPE_RQak4M_OtrdFu8PxMnNEM+JOsqfni_AmDubWB9oeDpOg@mail.gmail.com>

I don't think it is a script capital G, but I admit it is arguable. One of
the reasons is that the related variables s and ? are not script capital.
If you're interested, I could check in the book if script capital are used
in this book for other notations.

Le mar. 28 mars 2017 ? 18:52, "J?rg Knappen" <jknappen at web.de> a ?crit :

This is a script capital G or, in TeX notation, {\cal G}. It reflects the
use of multiple styles of the same underlying alhabet in mathematics and
sciences.
It is not a capital script g (note the different ordering of capital and
script).

--J?rg Knappen

I had found in 2013 a G? contrast in mathematical notations of an old
(1952) physics book (see
http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0092.html)

Fr?d?ric
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/92b78c3b/attachment.html>

From pedberg at apple.com  Tue Mar 28 12:30:04 2017
From: pedberg at apple.com (Peter Edberg)
Date: Tue, 28 Mar 2017 10:30:04 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>
Message-ID: <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com>


> On Mar 28, 2017, at 9:30 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:
> 
> On 3/28/2017 6:56 AM, Michael Everson wrote:
>> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.
> We need a pretzel emoji.

Already in Unicode 10 / emoji 5.0:
http://www.unicode.org/emoji/charts/emoji-released.html#1f968 <http://www.unicode.org/emoji/charts/emoji-released.html#1f968>

> A./
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/c1abf13e/attachment.html>

From doug at ewellic.org  Tue Mar 28 13:41:38 2017
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 28 Mar 2017 11:41:38 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>

Mark Davis wrote:

> 3. Valid, but not recommended: "usca". Corresponds to the valid
> Unicode subdivision code for California according to
>?http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
> and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.

"Not recommended" is no better and no less disappointing than "not
standard." Both phrases imply strongly that the sequence, while
syntactically valid, should not be used.

Burying a disclaimer that "implementations can support them, but they
may not interoperate well" in the speaker's notes of slide 38 of a
53-page presentation does nothing to change this perception.

"Even though it is possible to support the US states, or any subset of
them, implementations don?t have to." Well, of course they don't.
Implementations don't have to support the three British flags either if
they don't want to, or any national flags or other emoji, or any
particular character for that matter. The superfluous statement is
easily reduced to "Don't do this."

Joan Montan?'s return to the list to comment on this issue was
interesting because of a post from February 2015, in which Andrea
Giammarchi reported [1] on Joan's request [2] for Twitter to support
flags for specific "active online communities" that happened to have a
TLD, by stringing three or more Regional Indicator Symbols together:

> [S][C][O][T] --> it shows Scottish flag
> [C][Y][M][R][U] --> it shows a Welsh flag
> [B][Z][H] --> it shows a Breton flag
> [C][A][T] --> it shows Catalan flag
> [E][U][S] --> it shows a Basque flag
> [G][A][L] --> it shows a Gallician flag

[1] http://www.unicode.org/mail-arch/unicode-ml/y2015-m02/0039.html
[2] https://github.com/twitter/twemoji/issues/40

Of course this approach was incompatible with conformant use of RIS;
visit [2] with an RIS-conformant browser to see the inadvertently
displayed flags of Seychelles, Cyprus, Belize, Canada, etc.

I don't know if the ensuing thread helped inspire ESC to pursue the
present mechanism involving sequences of Plane 14 tags -- the earliest
mention I can find is PRI #299, just a few months later -- but the
intent seemed straightforward and sensible: provide an official,
conformant mechanism to support a recognized user need, with a suitable
fallback strategy, rather than encouraging users via inaction to adopt a
non-conformant and broken solution.

Unfortunately, the follow-up turned out to be "... and then discourage
THAT mechanism as well, except in a couple of selected cases, and tell
people to use stickers instead."

If this story sounds vaguely familiar to old-timers, it's exactly the
path that was followed the last time Plane 14 tag characters were under
discussion, between 1998 and 2000: someone wrote an RFC to embed
language tags in plain text using invalid UTF-8 sequences; Unicode
responded by introducing a proper, conformant mechanism to use Plane 14
characters instead; then the conformant replacement mechanism itself was
deprecated and users were told to use out-of-band tagging, exactly what
the original RFC sought to avoid.

"Not recommended," "not standard," "not interoperable," or any other
term ESC settles on for the 5000+ valid flag sequences that are not
England, Scotland, and Wales is just a short, easy step away from
deprecation for these as well.


--
Doug Ewell | Thornton, CO, US | ewellic.org


From asmusf at ix.netcom.com  Tue Mar 28 15:09:19 2017
From: asmusf at ix.netcom.com (Asmus Freytag (c))
Date: Tue, 28 Mar 2017 13:09:19 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix. netcom.com>
 <4ADB0C3C-8560-49C1-81DD-90AA8B15A336@apple.com>
Message-ID: <62deac46-7dbd-6e35-8c28-cb2628799848@ix.netcom.com>

On 3/28/2017 10:30 AM, Peter Edberg wrote:
>
>> On Mar 28, 2017, at 9:30 AM, Asmus Freytag <asmusf at ix.netcom.com 
>> <mailto:asmusf at ix.netcom.com>> wrote:
>>
>> On 3/28/2017 6:56 AM, Michael Everson wrote:
>>> An ? ligature is a ligature of a and of e. It is not some sort of 
>>> pretzel.
>> We need a pretzel emoji.
>
> Already in Unicode 10 / emoji 5.0:
> http://www.unicode.org/emoji/charts/emoji-released.html#1f968

No, like the ae, so a half eaten one. :)
>
>> A./
>>
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/feb920ab/attachment.html>

From asmusf at ix.netcom.com  Tue Mar 28 15:17:43 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Tue, 28 Mar 2017 13:17:43 -0700
Subject: U+0261 LATIN SMALL LETTER SCRIPT G
In-Reply-To: <CAJAPE_RQak4M_OtrdFu8PxMnNEM+JOsqfni_AmDubWB9oeDpOg@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <11409ab9-8f8c-dfdc-334a-2f2053545030@ix.netcom.com>
 <589AC6B0-9AEA-4AAC-B40E-327B69C9196D@evertype.com>
 <2565096e-4db4-aa39-cfc0-257b6a0716a5@it.aoyama.ac.jp>
 <06FDD62B-5156-45B3-856B-20FACCC4A633@evertype.com>
 <e06954af-c9d6-3938-f690-a0e566c7064c@it.aoyama.ac.jp>
 <20170328171435.39d8bd40@JRWUBU2>
 <e8f6fe82-6a35-70c8-41fc-683d6ca9a6ca@gmail.com> <trinity-732
 85566-7ecf-4ba9-bd98-80b49d735477-1490719944866@3capp-webde-bs42>
 <CAJAPE_RQak4M_OtrdFu8PxMnNEM+JOsqfni_AmDubWB9oeDpOg@mail.gmail.com>
Message-ID: <fcd1bcca-bab9-0a62-315a-4365b9251782@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/fa664f3a/attachment.html>

From richard.wordingham at ntlworld.com  Tue Mar 28 16:29:44 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Tue, 28 Mar 2017 22:29:44 +0100
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
Message-ID: <20170328222944.3c53914c@JRWUBU2>

On Tue, 28 Mar 2017 11:41:38 -0700
"Doug Ewell" <doug at ewellic.org> wrote:

> "Not recommended," "not standard," "not interoperable," or any other
> term ESC settles on for the 5000+ valid flag sequences that are not
> England, Scotland, and Wales is just a short, easy step away from
> deprecation for these as well.

It's certainly on the cards that the sequence for the Scottish flag will
be deprecated in favour of an RI sequence.

Richard.

From markus.icu at gmail.com  Tue Mar 28 18:52:04 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 28 Mar 2017 16:52:04 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
Message-ID: <CAN49p6qXyvsrnP4tmmsAj7y9-Sr4zb7XkkA_VVe-KyYbYD5iCw@mail.gmail.com>

On Tue, Mar 28, 2017 at 11:41 AM, Doug Ewell <doug at ewellic.org> wrote:

> Mark Davis wrote:
>
> > 3. Valid, but not recommended: "usca". Corresponds to the valid
> > Unicode subdivision code for California according to
> > http://unicode.org/reports/tr51/proposed.html#valid-emoji-tag-sequences
> > and CLDR, but is not listed in http://unicode.org/Public/emoji/5.0/.
>
> "Not recommended" is no better and no less disappointing than "not
> standard." Both phrases imply strongly that the sequence, while
> syntactically valid, should not be used.
>

I think the distinction between "valid" and "recommended" is confusing
terminology-wise, but it does make sense to have a distinction between
"valid" and "we know that one or more vendors are motivated to show these
sequences as single glyphs". "valid" is clearly defined, and then there is
a subset of valid that's listed in a catalog.

Just like anyone is free to string some characters together with
intervening ZWJ, but it is useful to have a catalog of sequences that are,
or are going to be, in actual use, so that it is known which sequences are
likely to work more or less the same on some set of devices.

This right now is the right time to propose better wording in the spec so
that implementers like you don't feel like they may get the rug pulled from
under them down the road.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/0c4cdd71/attachment.html>

From mark at kli.org  Tue Mar 28 20:02:24 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 28 Mar 2017 21:02:24 -0400
Subject: Encoding of old compatibility characters
In-Reply-To: <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
Message-ID: <2e4a5a86-8b45-acd9-4e80-1f4a31f55805@kli.org>

I don't think I want my text renderer to be *that* smart.  If I want ?, 
I'll put ?.  If I want a multiplication sign or something, I'll put 
that.  Without the multiplication sign, it's still quite understandable, 
more so than just "e".

It is valid for a text rendering engine to render "g" with one loop or 
two.  I don't think it's valid for it to render "g" as "xg" or "-g" or 
anything else.  The ? character looks like it does.  You don't get to 
add multiplication signs to it because you THINK you know what I'm 
saying with it.  And using 20? to mean "twenty base ten" sounds 
perfectly reasonable to me also.

~mark

On 03/28/2017 05:33 AM, Philippe Verdy wrote:
> Ideally a smart text renderer could as well display that glyph with a 
> leading multiplication sign (a mathematical middle dot) and implicitly 
> convert the following digits (and sign) as real superscript/exponent 
> (using contextual substitution/positioning like for Eastern 
> Arabic/Urdu), without necessarily writing the 10 base with smaller 
> digits.
> Without it, people will want to use 20? to mean it is the decimal 
> number twenty and not hexadecimal number thirty two.
>
> 2017-03-28 11:18 GMT+02:00 Fr?d?ric Grosshans 
> <frederic.grosshans at gmail.com <mailto:frederic.grosshans at gmail.com>>:
>
>     Le 28/03/2017 ? 02:22, Mark E. Shoulson a ?crit :
>
>         Aw, but ? is awesome!  It's much cooler-looking and more
>         visually understandable than "e" for exponent notation. In
>         some code I've been playing around with I support it as a
>         valid alternative to "e".
>
>
>     I Agree 1?3 times with you on this !
>
>         Fr?d?ric
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/36480058/attachment.html>

From mark at kli.org  Tue Mar 28 20:07:45 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 28 Mar 2017 21:07:45 -0400
Subject: Encoding of old compatibility characters
In-Reply-To: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
 <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
Message-ID: <601652f3-eb4e-e0a9-b592-416998f3c38f@kli.org>

On 03/28/2017 09:09 AM, Asmus Freytag wrote:
> On 3/28/2017 4:00 AM, Ian Clifton wrote:
>> I?ve used ? a couple of times, without explanation, in my own
>> emails?without, as far as I?m aware, causing any misunderstanding.
>
> Works especially well, whenever it renders as a box with 23E8 inscribed!
>
> A./
>
I ? Unicode.

~mark

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/b02f36be/attachment.html>

From mark at kli.org  Tue Mar 28 20:31:39 2017
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 28 Mar 2017 21:31:39 -0400
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
Message-ID: <550b86a8-111e-bfd9-bc82-40c9a3584b1e@kli.org>

Kind of have to agree with Doug here. Either support the mechanism or 
don't.  Saying "wellllllll, you CAN do this if you WANT to" always 
implies a "...but you probably shouldn't."  Why even bother making it a 
possibility?

On 03/28/2017 02:41 PM, Doug Ewell wrote:
> "Even though it is possible to support the US states, or any subset of
> them, implementations don?t have to." Well, of course they don't.
> Implementations don't have to support the three British flags either if
> they don't want to, or any national flags or other emoji, or any
> particular character for that matter. The superfluous statement is
> easily reduced to "Don't do this."

That's a pretty good re-statement.

~mark

From duerst at it.aoyama.ac.jp  Tue Mar 28 21:32:37 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Wed, 29 Mar 2017 11:32:37 +0900
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
Message-ID: <3b8ac649-cf66-9670-dfb3-41af15e0dde0@it.aoyama.ac.jp>

Hello Doug,

On 2017/03/29 03:41, Doug Ewell wrote:

> If this story sounds vaguely familiar to old-timers, it's exactly the
> path that was followed the last time Plane 14 tag characters were under
> discussion, between 1998 and 2000: someone wrote an RFC to embed
> language tags in plain text using invalid UTF-8 sequences; Unicode
> responded by introducing a proper, conformant mechanism to use Plane 14
> characters instead; then the conformant replacement mechanism itself was
> deprecated and users were told to use out-of-band tagging, exactly what
> the original RFC sought to avoid.

I think there is some missing information here. First, the original 
proposal that used invalid UTF-8 sequences never was an RFC, only an 
Internet Draft. But what's more important, the protocol that motivated 
all this work (ACAP) never went anywhere. Nor did any other use of the 
plane 14 language tag characters get any kind of significant traction. 
That lead to depreciation, because it would have been a bad idea to let 
people think that the information in these taggings would actually be used.

For some people (including me), that was always seen as the likely 
outcome; the language tag characters were mostly introduced as a 
defensive mechanism (way better than invalid UTF-8) rather than 
something we hoped everybody would jump on. Putting them on plane 14 
(which meant that it would be four bytes for each character, and 
therefore quite a lot of bytes for each tag) was part of that message.


> "Not recommended," "not standard," "not interoperable," or any other
> term ESC settles on for the 5000+ valid flag sequences that are not
> England, Scotland, and Wales is just a short, easy step away from
> deprecation for these as well.

I think the situation is vastly different here. First, the Consortium 
never officially 'activated' any subdivision flags, so it would be 
impossible to deprecate them. Second, we already see some pressure (on 
this list) to 'recommend' more of these, and I guess the vendors and the 
Consortium will give in to this pressure, even if slowly and to some 
extent quite reluctantly. It's anyone's bet in what time frame and order 
e.g. the flags of California and Texas will be 'recommended'. But I have 
personally no doubt that these (and quite a few others) will eventually 
make it, even if I have mixed feelings about that.

Regards,   Martin.

From duerst at it.aoyama.ac.jp  Tue Mar 28 21:38:52 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Wed, 29 Mar 2017 11:38:52 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <CAGa7JC09d97XXjLGF=kkQb0MAV9UO+BL3pNGrokj=Ao0hHrX_A@mail.gmail.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <6ce45184-f8b6-f1dd-3769-0f1b1fe00d81@ix.netcom.com>
 <CAGa7JC09d97XXjLGF=kkQb0MAV9UO+BL3pNGrokj=Ao0hHrX_A@mail.gmail.com>
Message-ID: <b92ec13a-b1c4-a089-2086-dafc69972072@it.aoyama.ac.jp>

On 2017/03/29 01:47, Philippe Verdy wrote:
> 2017-03-28 18:30 GMT+02:00 Asmus Freytag <asmusf at ix.netcom.com>:
>
>> On 3/28/2017 6:56 AM, Michael Everson wrote:
>>
>>> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.
>>>
>> We need a pretzel emoji.
>
> We need a broken tooth emoji too !

I prefer soft pretzels!

Regards,   Martin.

From leob at mailcom.com  Tue Mar 28 23:41:51 2017
From: leob at mailcom.com (Leo Broukhis)
Date: Tue, 28 Mar 2017 21:41:51 -0700
Subject: Encoding of old compatibility characters
In-Reply-To: <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
 <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
Message-ID: <CAFmvRsdG1X21oVk+OqZ0O7WT6UxUQQ8qWkirKH91yr4oEbSP6Q@mail.gmail.com>

On Tue, Mar 28, 2017 at 6:09 AM, Asmus Freytag <asmusf at ix.netcom.com> wrote:

> On 3/28/2017 4:00 AM, Ian Clifton wrote:
>
> I?ve used ? a couple of times, without explanation, in my own
> emails?without, as far as I?m aware, causing any misunderstanding.
>
> Works especially well, whenever it renders as a box with 23E8 inscribed!
>
Are you still using Windows 7 or RedHat 5, or something equally old?
Newer systems have ? out of the box.

Leo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170328/8882f373/attachment.html>

From wjgo_10009 at btinternet.com  Wed Mar 29 04:59:58 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Wed, 29 Mar 2017 10:59:58 +0100 (BST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
References: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
Message-ID: <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost>

Mark E. Shoulson wrote:

> Kind of have to agree with Doug here. Either support the mechanism or don't.  Saying "wellllllll, you CAN do this if you WANT to" always implies a "...but you probably shouldn't." Why even bother making it a possibility?

Mark's use of wellllllll made me smile and brightened my day, because it resonated with my use, in a different context, of wolllll near the end of the last page of Chapter 16 of my novel.

http://www.users.globalnet.co.uk/~ngo/localizable_sentences_the_novel_chapter_016.pdf A PDF document of size 31.01 kilobytes.

Returning to what Doug and Mark wrote. When I read things like "not recommended" I imagine a situation where someone who is employed by a large information technology company being the person who actually sits down with the specification documents and makes a decision as to what to encode. That person is probably not one of the people who is in charge of running the company.

So the person may well have an annual review meeting with people several steps up the hierarchy of the company, people who can promote, grudgingly continue to employ, or sack the employee.

So I imagine the possibility of, at that meeting, the question of "Why did you implement all of those flags in our product?" being asked.

The employee then explains his or her thinking, a desire to help end users and to have compatibility with communication with devices made by other manufacturers and for it all to be colourful and fun.

The employee is then asked if he or she knew that implementation was not recommended. Did he or she know of that and went the other way thinking he or she knew better or had he or she not read that part of the documentation.

So maybe the employee takes such a possible scenario into account when deciding whether to implement the flags in the first place. Relying on "not recommended" is safer. If the people higher up get letters from consumers asking for implementation and they ask for it to be done, then good, that would be enjoyable, but why be the one who could be criticised.

I also imagine a scenario that instead of the "not recommended" that the advice might have been that it would be great and progressive if lots of flags were implemented in lots of products and it would be great if it could be done as soon as possible, by this summer if possible, ready for displaying at the conference in the autumn and to help that along here are some links to some free-to-use open source artwork that Unicode Inc. is making available in case you want to use it and here are some links to some free-to-use open source OpenType font glyph substitution code that Unicode Inc. is making available in case you want to use it.

Well, why not? :-)

William Overington

Wednesday 29 March 2017


From duerst at it.aoyama.ac.jp  Wed Mar 29 05:12:19 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Wed, 29 Mar 2017 19:12:19 +0900
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
Message-ID: <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>

Hello everybody,

Let me start with a short summary of where I think we are at, and how we 
got there.

- The discussion started out with two letters,
   with two letter forms each. There is explicit talk of the
   40-letter alphabet and glyphs in the Wikipedia page, not
   of two different letters.
- That suggests that IF this script is in current use, and the
   shapes for these diphthongs are interchangeable (for those
   who use the script day-to-day, not for meta-purposes such
   as historic and typographic texts), keeping things unified
   is preferable.
- As far as we have heard (in the course of the discussion,
   after questioning claims made without such information),
   it seems that:
   - There may not be enough information to understand how the
     creators and early users of the script saw this issue,
     on a scale that may range between "everybody knows these
     are the same, and nobody cares too much who uses which,
     even if individual people may have their preferences in
     their handwriting" to something like "these are different
     choices, and people wouldn't want their texts be changed
     in any way when published".
   - Similarly, there seem to be not enough modern practitioners
     of the script using the ligatures that could shed any
     light on the question asked in the previous item in a
     historical context, first apparently because there are not
     that many modern practitioners at all, and second because
     modern practitioners seem to prefer spelling with
     individual letters rather than using the ligatures.
- IF the above is true, then it may be that these ligatures
   are mostly used for historic purposes only, in which case
   it wouldn't do any harm to present-day users if they were separated.

If the above is roughly correct, then it's important that we reached 
that conclusion after explicitly considering the potential of a split to 
create inconvenience and confusion for modern practitioners, not after 
just looking at the shapes only, coming up with separate historical 
derivations for each of them, and deciding to split because history is 
way more important than modern practice.

In that light, some more comments lower down.

On 2017/03/28 22:56, Michael Everson wrote:
> On 28 Mar 2017, at 11:39, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:

> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.

Yes. But it's important that we know that because we have been faced 
with many cases where "?" and "ae" were used interchangeably. For 
somebody not knowing the (extended) Latin alphabet and its usages, they 
might easily see more of a pretzel and less of 'a' and 'e'. I might try 
some experiments with some of my students (although I'm using "formul?" 
in my lecture notes, and so they might already be too familiar with the 
"?").

Also, if it were the case that shapes like "?" and "?" were used 
interchangeably across all uses of the Latin alphabet, I'm quite sure we 
would encode it with one code point rather than two, even if some 
researchers might claim that the later was derived from an "o" rather 
than an "?", or even if we knew it was derived from an "o" (as we know 
for the ?).


> What Deseret has is this:
>
> 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
> 	* officially named ?ew? in the code chart
> 	* used for ew in earlier texts
> 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
> 	* officially named ?oi? in the code chart
> 	* used for oi in earlier texts
> 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE
> 	* used for oi in later texts
> 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE
> 	* used for ew in later texts

Currently, it has this:

10426 ?? DESERET CAPITAL LETTER OI

10427 ?? DESERET CAPITAL LETTER EW

My personal opinion is that names are mostly hints, and not too much 
should be read into them, but if anything, the names in the current 
charts would suggest that the encoding is for the 39th/40th letter of 
the Deseret alphabet, whatever its shape, not for some particular shape.

And you know as well as I do that we can't change names. So if we split, 
we might end up with something like:

10426 ?? DESERET CAPITAL LETTER OI

10427 ?? DESERET CAPITAL LETTER EW

1xxxx <????> DESERET CAPITAL LETTER VARIANT OI

1xxxx <????> DESERET CAPITAL LETTER VARIANT EW


> Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character.
>
> Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character.

We have just established that there are no characters with such names in 
the standard. It's not the names or the history that I'm arguing.


> To do so is to show no understanding of the history of writing systems at all.

What I'd agree to is that cases where shapes with different historical 
origins merge and get treated as one and the same character are quite a 
lot rarer than cases where they don't merge. But we have seen cases 
where such a merge happens. ? is one of them. There are quite a few in 
Han (not surprising because there are tons of ideographs there to begin 
with).

But that experience doesn't mean that we have to rush to a conclusion 
without examining as much of the evidence as we can get hold of.


> You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here.

Skepticism is when presented with options without background facts is a 
virtue in my opinion.


>> And as for precedent, the fact that we have encoded a lot of characters in Unicode doesn't mean that we can encode more characters without checking each and every single case very carefully, as we are doing in this discussion.
>
> The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this.

As for BROCCOLI that you mention later and other emoji, first I would 
like to make clear that I don't use emoji personally nor do I push for 
their encoding.

But what's important for the discussion at hand is that when it comes to 
emoji, the question of whether we should unify or disunify BROCCOLI and 
CAULIFLOWER (just a hypothetical example) isn't as important. That's 
because there is no preexisting user community that would be seriously 
inconvenienced the way it would happen if we suddenly disunified the 
?s/?z ligature, or suddenly unified "?" and "?". Emoji are a hopeless 
hodgepodge, where users click on what they see, and hope that it shows 
close enough to what they meant at the other end or after a few years.


>> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different.
>
> Martin, there is no answer to this unless you can read the minds of people who are dead a century or more.

Thanks for telling us, finally.


>> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters.
>
> I do not believe you.

It's true. When younger, I tried to read some old books written in 
Fraktur. It was hard work. Most of the lower letters were okay, but the 
? and the f were easy to confuse, and the k is also confusing. A lot of 
guessing was needed for upper case. I'm quite sure most people these 
days couldn't easily identify upper case letters in isolation. Of 
course, context helps a lot.

> If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all.

Shops and newspapers (e.g. NYT) and the like rely a lot on a logo 
effect. And the situation may be slightly different in Germany and in 
Switzerland.

> I?ll stipulate that few Germans can read S?tterlin or similar hands. :-)

Definitely agreed!


> On 28 Mar 2017, at 11:59, Mark Davis ?? <mark at macchiato.com> wrote:
>
>> ?I agree with Martin.

>> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape.
>
> Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake.

And coming to a discussion like this out of a concern for modern 
practitioners of the script (even if it seems, after a lot of 
discussion, that there aren't that many of these, and the issue at hand 
may indeed not concern them that much) is not some sort of attempt to 
unify things for unification's sake.


Regards,    Martin.

From everson at evertype.com  Wed Mar 29 08:08:59 2017
From: everson at evertype.com (Michael Everson)
Date: Wed, 29 Mar 2017 14:08:59 +0100
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
Message-ID: <C4177831-6841-471D-9989-F9EE69263060@evertype.com>

Martin,

It?s as though you?d not participated in this work for many years, really. 

> On 29 Mar 2017, at 11:12, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> Hello everybody,
> 
> Let me start with a short summary of where I think we are at, and how we got there.
> 
> - The discussion started out with two letters, with two letter forms each. There is explicit talk of the 40-letter alphabet and glyphs in the Wikipedia page, not of two different letters.

SO WHAT? Alphabets have ?letters? in them. ?Letters? are not ?characters?. In Welsh, ?ch? and ?dd? and ?ll? are ?letters?. 

> - That suggests that IF this script is in current use,

You don?t even know? You?re kidding, right?

> and the shapes for these diphthongs are interchangeable

It does NOT ?suggest? that at all. 

> (for those who use the script day-to-day, not for meta-purposes such as historic and typographic texts), keeping things unified is preferable.

Deseret was a spelling reform replacement alphabet used for a period of time by the Mormons in what is now Utah. It is structurally very similar to Pitman?s Phonotypic alphabets. Alphabets. There were many revisions of those. Some of them used letterforms we have encoded today, for IPA for instance. Some used letterforms we?d hardly recognize, and we?d never, ever consider them to be glyph variants of the IPA letters. 

> - As far as we have heard (in the course of the discussion, after questioning claims made without such information), it seems that:

Yeah, it doesn?t ?seem? anything but a whole lot of special pleading to bolster your rigid view that the glyphs in question can be interchangeable because of the sounds they may represent. 

>  - There may not be enough information to understand how the creators and early users of the script saw this issue, 

Um, yeah. As if there were for Phoenician, or Luwian hieroglyphs, right?

> on a scale that may range between "everybody knows these are the same, and nobody cares too much who uses which, even if individual people may have their preferences in their handwriting" to something like "these are different choices, and people wouldn't want their texts be changed in any way when published?.

We know what the diphthongs were. We know that the script had a spelling reform where some characters were abandoned in favour of other characters. There was at least one font wh

And there is lots of handwriting in which people write what they want to write, in the non-Latin alphabet they learned. 

As far as your guessing what people had in their minds about what they were writing, and as to your speculation about what the very few printers who had Deseret type might have done with such manuscripts, well, it is all reine Phantasie on your part. 

Oh! Look! There was a spelling reform. I should write ?Fantasie?, shouldn?t I? Wait! I can have spell-check dictionaries suit my preference! Wow! That?s amazing!

>  - Similarly, there seem to be not enough modern practitioners of the script using the ligatures that could shed any light on the question asked in the previous item in a historical context,

Completely irrelevant. Nobody worried about the number of modern users of the Insular letters we encoded. Why put such a constraints on users of Deseret? ?? ?? ?? ? ?? ?? ??. 

> first apparently because there are not that many modern practitioners at all, and second because modern practitioners seem to prefer spelling with individual letters rather than using the ligatures.

This is equally ridiculous. John Jenkins chooses not write the digraphs in the works which he transcribed, because that?s what *he* chooses. He doesn?t speak for anyone else who may choose to write in Deseret, and your assumption that ?modern practitioners? do this is groundless. 

It also ignores the fact that the script had a reform and that the value of separate encodings for the various characters is of value to those studying the provenance and orthographic practices of those who wrote Deseret when it was in active use. 

This is exactly the same thing as the medievalist Latin abbreviation and other characters we encoded. There is neither sense nor logic nor utility in trying to argue for why editors of Deseret documents shouldn?t have the same kinds of tools that medievalists have. And as far as medievalist concerns go, many of the characters are used by relatively few researchers. Some of the characters we encoded are used all over Europe at many times. Some are used only by Nordicists, some by Celticists, and some by subsets within the Nordicist and Celticist communities. 

> - IF the above is true, then it may be that these ligatures are mostly used for historic purposes only, in which case it wouldn't do any harm to present-day users if they were separated.

Harm? What harm? Recently the UTC looked at a proposal for capital letters for ? and ?. Evidence for their existence was shown. One person on the call to the UTC said he didn?t think anyone needed them. Two of us do need them. I needed them last weekend and I had to use awkward workarounds. They weren?t accepted. There wasn?t any good rationale for the rejection. I mean, the letters exist. Case is a normal function of the script. But they weren?t accepted. For the guy who didn?t think he needed them, well, so what? If they?re encoded, he doesn?t have to use them. 

Harm to present-day users? I agree with you. Any modern-day user creating new texts who doesn?t like to use the diphthong letters doesn?t have to use them. Any modern-day user trying to represent historic texts accurately, however, can?t, because not all the letters are encoded. 

> If the above is roughly correct, then it's important that we reached that conclusion after explicitly considering the potential of a split to create inconvenience and confusion for modern practitioners,

People who use Deseret use it to for historical purposes and for cultural reasons. Everybody in Utah reads English in standard Latin orthography. 

> not after just looking at the shapes only, coming up with separate historical derivations for each of them, and deciding to split because history is way more important than modern practice.

I didn?t ?come up? with separate historical derivations for the four characters in question. It is entirely obvious that LONG AH, SHORT AH, LONG OO, and SHORT OO are variously combined with the stroke of SHORT I. 

Entirely obvious. There is no other interpretation. 

> In that light, some more comments lower down.
> 
> On 2017/03/28 22:56, Michael Everson wrote:
>> On 28 Mar 2017, at 11:39, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
>> An ? ligature is a ligature of a and of e. It is not some sort of pretzel.
> 
> Yes. But it's important that we know that because we have been faced with many cases where "?" and "ae" were used interchangeably.

Irrelevant. This is just spelling. It?s no different than colour/color or maximize/maximise or aluminium/aluminum. 

> For somebody not knowing the (extended) Latin alphabet and its usages, they might easily see more of a pretzel and less of 'a' and 'e'. I might try some experiments with some of my students (although I'm using "formul?" in my lecture notes, and so they might already be too familiar with the "??).

You have missed the point fabulously. The point was that the ? ligature can be easily identified as being made of A and of E. And the four Deseret characters can easily be identified as being made of LONG AH, SHORT AH, LONG OO, and SHORT OO with the stroke of SHORT I. 

> Also, if it were the case that shapes like "?" and "?" were used interchangeably across all uses of the Latin alphabet, I'm quite sure we would encode it with one code point rather than two, even if some researchers might claim that the later was derived from an "o" rather than an "?", or even if we knew it was derived from an "o" (as we know for the ?).

I don?t agree, and there are hundreds of 

>> What Deseret has is this:
>> 
>> 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
>> 	* officially named ?ew? in the code chart
>> 	* used for ew in earlier texts
>> 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
>> 	* officially named ?oi? in the code chart
>> 	* used for oi in earlier texts
>> 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE
>> 	* used for oi in later texts
>> 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE
>> 	* used for ew in later texts
> 
> Currently, it has this:
> 
> 10426 ?? DESERET CAPITAL LETTER OI
> 
> 10427 ?? DESERET CAPITAL LETTER EW

You are being deliberately obtuse. Note that I stated clearly ?officially named ?ew/oi? in the code chart?. 

> My personal opinion is that names are mostly hints, and not too much should be read into them, 

I do not share this opinion.

> but if anything, the names in the current charts would suggest that the encoding is for the 39th/40th letter of the Deseret alphabet, whatever its shape, not for some particular shape.

You make too much of these numbers, but then there are charts of the 38-letter alphabet and charts of the 40-letter alphabet, but those numbers have to do with the number of English phonemes represented in Phonotypy and in Deseret, and with the augmentation of that by the addition of letters which represent phonemes. 

> And you know as well as I do that we can't change names. So if we split, we might end up with something like:
> 
> 10426 ?? DESERET CAPITAL LETTER OI
> 
> 10427 ?? DESERET CAPITAL LETTER EW
> 
> 1xxxx <????> DESERET CAPITAL LETTER VARIANT OI
> 
> 1xxxx <????> DESERET CAPITAL LETTER VARIANT EW

I?m pretty sure we will propose the names LONG AH WITh STROKE and SHORT OO WITH STROKE. The two un-encoded characters are used for the *diphthongs* oi and ew but they are not ?variants? of the other letters. 

We do not require matching names here. Compare LATIN LETTER YR and LATIN LETTER SMALL CAPITAL R. Compare LATIN CAPITAL LETTER HWAIR and LATIN SMALL LETTER HV. 

>> Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH STROKE are glyph variants of the same character.
>> 
>> Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH STROKE are glyph variants of the same character.
> 
> We have just established that there are no characters with such names in the standard. It's not the names or the history that I'm arguing.

You?re being obtuse again. Fine. 

Don?t go trying to tell me that EW and SHORT OO WITH STROKE are glyph variants of the same character.

Don?t go trying to tell me that LONG AH WITH STROKE and OI are glyph variants of the same character.

They?re not. The origin of all those letterforms is obvious, and we do not encode sounds, we encode the elements of writing systems. 

>> To do so is to show no understanding of the history of writing systems at all.
> 
> What I'd agree to is that cases where shapes with different historical origins merge and get treated as one and the same character are quite a lot rarer than cases where they don't merge. 

They didn?t merge in Deseret. They had a reform, removing some characters and adding some other characters. 

> But we have seen cases where such a merge happens. ? is one of them.

That?s even arguable because ?? only really occurs in the whole-font Fraktur style. It?s pretty rare to see it in Antiqua. Of course it must be attested there, but it?s by no means common. 

> There are quite a few in Han (not surprising because there are tons of ideographs there to begin with).
> 
> But that experience doesn't mean that we have to rush to a conclusion without examining as much of the evidence as we can get hold of.

I haven?t rushed to a conclusion. I?ve made a thorough analysis. 

>> You?re smarter than that. So are Asmus and Mark and Erkki and any of the other sceptics who have chimed in here.
> 
> Skepticism is when presented with options without background facts is a virtue in my opinion.

Your argument seemed to be based solely on the use of the letters for the sounds, ignoring the historical derivation and the facts of the spelling reform in Deseret. 

>> The UTC encodes a great many characters without checking them at all, or even offering documentation on them to SC2. Don?t think we haven?t observed this.
> 
> As for BROCCOLI that you mention later and other emoji, first I would like to make clear that I don't use emoji personally nor do I push for their encoding.

I *do* use emoji and I have devised many emoji which are now in current use. I do find that the process for adding symbols to the UCS (which is not the same thing as giving symbols the emoji property) is not functioning particularly well at present. 

> But what's important for the discussion at hand is that when it comes to emoji, the question of whether we should unify or disunify BROCCOLI and CAULIFLOWER (just a hypothetical example) isn't as important.

Eventually we will have CABBAGE, and then some people will need to use ZWJ to join CABBAGE and KNIFE so that sauerkraut can be represented, and then other people will need to use ZWJ to join CABBAGE and HOT PEPPER for kimchi, and in Ireland we?ve got bacon and cabbage of course, and...

Heh. 

> That's because there is no preexisting user community that would be seriously inconvenienced the way it would happen if we suddenly disunified the ?s/?z ligature, or suddenly unified "?" and "?". Emoji are a hopeless hodgepodge, where users click on what they see, and hope that it shows close enough to what they meant at the other end or after a few years.

No one using Deseret will be inconvenienced by adding additional historical characters for the already historical script. Anyone using modern Deseret fonts *would* be inconvenience by unifying the LONG-AH-WITH-STROKE and SHORT-AH-WITH-STROKE characters and the LONG-OO-WITH-STROKE and SHORT-OO-WITH-STROKE characters, I think. No current fonts that I know of have the 1859 glyphs, apart from private fonts Ken Beesley used for his own work. 

>>> Of course they will easily see different shapes, but what's important isn't the shapes, it's what they associate it with. If for them, it's just two shapes for one and the same 40th letter of the Deseret alphabet, then that is a strong suggestion for not encoding separately, even if the shapes look really different.
>> 
>> Martin, there is no answer to this unless you can read the minds of people who are dead a century or more.
> 
> Thanks for telling us, finally.

What on earth do you mean? I have withheld no secrets. I?ve objected to your wilful unification of characters with obviously different origins. 

>>> To use another analogy, many people these days (me included) would have difficulties identifying Fraktur letters, in particular if they show up just as individual letters.
>> 
>> I do not believe you.
> 
> It's true. When younger, I tried to read some old books written in Fraktur. It was hard work. Most of the lower letters were okay, but the ? and the f were easy to confuse, and the k is also confusing. A lot of guessing was needed for upper case. I'm quite sure most people these days couldn't easily identify upper case letters in isolation. Of course, context helps a lot.

It?s not the easiest thing but it does not take all that much to accustom oneself to it. 

>> If this were true menus in restaurants and public signage on shops wouldn?t have Fraktur at all. It?s true that sometimes the orthography on such things is bad, as where they don?t use ligatures correctly or the ? at all.
> 
> Shops and newspapers (e.g. NYT) and the like rely a lot on a logo effect. And the situation may be slightly different in Germany and in Switzerland.

People can read the menus and the public signage nevertheless. Fraktur is not so unbelievably different that it?s entirely opaque. 

>> I?ll stipulate that few Germans can read S?tterlin or similar hands. :-)
> 
> Definitely agreed!

I learned to write S?tterlin. Going back and reading something written takes work too? 

> 
> 
>> On 28 Mar 2017, at 11:59, Mark Davis ?? <mark at macchiato.com> wrote:
>> 
>>> ?I agree with Martin.
> 
>>> Simply because someone used a particular shape at some time to mean a letter doesn't mean that Unicode should encode a letter for that shape.
>> 
>> Coming to a forum like this out of a concern for the corpus of Deseret literature is not some sort of attempt to encode things for encoding?s sake.
> 
> And coming to a discussion like this out of a concern for modern practitioners of the script (even if it seems, after a lot of discussion, that there aren't that many of these, and the issue at hand may indeed not concern them that much) is not some sort of attempt to unify things for unification's sake.

I think you made a lot of assumptions about ?modern practitioners? which you didn?t disclose.

A proposal will be forthcoming. I want to thank several people who have written to me privately supporting my position with regard to this topic on this list. I can only say that supporting me in public is more useful than supporting me in private. 

Michael

From asmusf at ix.netcom.com  Wed Mar 29 09:04:21 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 29 Mar 2017 07:04:21 -0700
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
Message-ID: <700f1d2c-daf2-470a-b75c-166812805887@ix.netcom.com>

Martin,

thanks for the careful summary.

As in all these cases it is possible to argue from different premises, 
so I would, unfortunately, not expect that this discussion will reach 
the consensus of all parties.

In the end, Unicode is made for the modern user, whether they are native 
users of a script, or modern users archiving or discussing historic texts.

The specific principles used in each encoding decision matter, but only 
insofar as the result works for the modern (and future!) users of the 
standard.

A./

PS: as to modern use of Fraktur -- many fonts for black-letter logos are 
modified to help modern readers recognize the words.

On 3/29/2017 3:12 AM, Martin J. D?rst wrote:
> Hello everybody,
>
> Let me start with a short summary of where I think we are at, and how 
> we got there.
>
> - The discussion started out with two letters,
>   with two letter forms each. There is explicit talk of the
>   40-letter alphabet and glyphs in the Wikipedia page, not
>   of two different letters.
> - That suggests that IF this script is in current use, and the
>   shapes for these diphthongs are interchangeable (for those
>   who use the script day-to-day, not for meta-purposes such
>   as historic and typographic texts), keeping things unified
>   is preferable.
> - As far as we have heard (in the course of the discussion,
>   after questioning claims made without such information),
>   it seems that:
>   - There may not be enough information to understand how the
>     creators and early users of the script saw this issue,
>     on a scale that may range between "everybody knows these
>     are the same, and nobody cares too much who uses which,
>     even if individual people may have their preferences in
>     their handwriting" to something like "these are different
>     choices, and people wouldn't want their texts be changed
>     in any way when published".
>   - Similarly, there seem to be not enough modern practitioners
>     of the script using the ligatures that could shed any
>     light on the question asked in the previous item in a
>     historical context, first apparently because there are not
>     that many modern practitioners at all, and second because
>     modern practitioners seem to prefer spelling with
>     individual letters rather than using the ligatures.
> - IF the above is true, then it may be that these ligatures
>   are mostly used for historic purposes only, in which case
>   it wouldn't do any harm to present-day users if they were separated.
>
> If the above is roughly correct, then it's important that we reached 
> that conclusion after explicitly considering the potential of a split 
> to create inconvenience and confusion for modern practitioners, not 
> after just looking at the shapes only, coming up with separate 
> historical derivations for each of them, and deciding to split because 
> history is way more important than modern practice.
>
> In that light, some more comments lower down.
>
> On 2017/03/28 22:56, Michael Everson wrote:
>> On 28 Mar 2017, at 11:39, Martin J. D?rst <duerst at it.aoyama.ac.jp> 
>> wrote:
>
>> An ? ligature is a ligature of a and of e. It is not some sort of 
>> pretzel.
>
> Yes. But it's important that we know that because we have been faced 
> with many cases where "?" and "ae" were used interchangeably. For 
> somebody not knowing the (extended) Latin alphabet and its usages, 
> they might easily see more of a pretzel and less of 'a' and 'e'. I 
> might try some experiments with some of my students (although I'm 
> using "formul?" in my lecture notes, and so they might already be too 
> familiar with the "?").
>
> Also, if it were the case that shapes like "?" and "?" were used 
> interchangeably across all uses of the Latin alphabet, I'm quite sure 
> we would encode it with one code point rather than two, even if some 
> researchers might claim that the later was derived from an "o" rather 
> than an "?", or even if we knew it was derived from an "o" (as we know 
> for the ?).
>
>
>> What Deseret has is this:
>>
>> 10426 DESERET CAPITAL LETTER LONG OO WITH STROKE
>>     * officially named ?ew? in the code chart
>>     * used for ew in earlier texts
>> 10427 DESERET CAPITAL LETTER SHORT AH WITH STROKE
>>     * officially named ?oi? in the code chart
>>     * used for oi in earlier texts
>> 1xxxx DESERET CAPITAL LETTER LONG AH WITH STROKE
>>     * used for oi in later texts
>> 1xxxx DESERET CAPITAL LETTER SHORT OO WITH STROKE
>>     * used for ew in later texts
>
> Currently, it has this:
>
> 10426 ?? DESERET CAPITAL LETTER OI
>
> 10427 ?? DESERET CAPITAL LETTER EW
>
> My personal opinion is that names are mostly hints, and not too much 
> should be read into them, but if anything, the names in the current 
> charts would suggest that the encoding is for the 39th/40th letter of 
> the Deseret alphabet, whatever its shape, not for some particular shape.
>
> And you know as well as I do that we can't change names. So if we 
> split, we might end up with something like:
>
> 10426 ?? DESERET CAPITAL LETTER OI
>
> 10427 ?? DESERET CAPITAL LETTER EW
>
> 1xxxx <????> DESERET CAPITAL LETTER VARIANT OI
>
> 1xxxx <????> DESERET CAPITAL LETTER VARIANT EW
>
>
>> Don?t go trying to tell me that LONG OO WITH STROKE and SHORT OO WITH 
>> STROKE are glyph variants of the same character.
>>
>> Don?t go trying to tell me that LONG AH WITH STROKE and SHORT AH WITH 
>> STROKE are glyph variants of the same character.
>
> We have just established that there are no characters with such names 
> in the standard. It's not the names or the history that I'm arguing.
>
>
>> To do so is to show no understanding of the history of writing 
>> systems at all.
>
> What I'd agree to is that cases where shapes with different historical 
> origins merge and get treated as one and the same character are quite 
> a lot rarer than cases where they don't merge. But we have seen cases 
> where such a merge happens. ? is one of them. There are quite a few in 
> Han (not surprising because there are tons of ideographs there to 
> begin with).
>
> But that experience doesn't mean that we have to rush to a conclusion 
> without examining as much of the evidence as we can get hold of.
>
>
>> You?re smarter than that. So are Asmus and Mark and Erkki and any of 
>> the other sceptics who have chimed in here.
>
> Skepticism is when presented with options without background facts is 
> a virtue in my opinion.
>
>
>>> And as for precedent, the fact that we have encoded a lot of 
>>> characters in Unicode doesn't mean that we can encode more 
>>> characters without checking each and every single case very 
>>> carefully, as we are doing in this discussion.
>>
>> The UTC encodes a great many characters without checking them at all, 
>> or even offering documentation on them to SC2. Don?t think we haven?t 
>> observed this.
>
> As for BROCCOLI that you mention later and other emoji, first I would 
> like to make clear that I don't use emoji personally nor do I push for 
> their encoding.
>
> But what's important for the discussion at hand is that when it comes 
> to emoji, the question of whether we should unify or disunify BROCCOLI 
> and CAULIFLOWER (just a hypothetical example) isn't as important. 
> That's because there is no preexisting user community that would be 
> seriously inconvenienced the way it would happen if we suddenly 
> disunified the ?s/?z ligature, or suddenly unified "?" and "?". Emoji 
> are a hopeless hodgepodge, where users click on what they see, and 
> hope that it shows close enough to what they meant at the other end or 
> after a few years.
>
>
>>> Of course they will easily see different shapes, but what's 
>>> important isn't the shapes, it's what they associate it with. If for 
>>> them, it's just two shapes for one and the same 40th letter of the 
>>> Deseret alphabet, then that is a strong suggestion for not encoding 
>>> separately, even if the shapes look really different.
>>
>> Martin, there is no answer to this unless you can read the minds of 
>> people who are dead a century or more.
>
> Thanks for telling us, finally.
>
>
>>> To use another analogy, many people these days (me included) would 
>>> have difficulties identifying Fraktur letters, in particular if they 
>>> show up just as individual letters.
>>
>> I do not believe you.
>
> It's true. When younger, I tried to read some old books written in 
> Fraktur. It was hard work. Most of the lower letters were okay, but 
> the ? and the f were easy to confuse, and the k is also confusing. A 
> lot of guessing was needed for upper case. I'm quite sure most people 
> these days couldn't easily identify upper case letters in isolation. 
> Of course, context helps a lot.
>
>> If this were true menus in restaurants and public signage on shops 
>> wouldn?t have Fraktur at all. It?s true that sometimes the 
>> orthography on such things is bad, as where they don?t use ligatures 
>> correctly or the ? at all.
>
> Shops and newspapers (e.g. NYT) and the like rely a lot on a logo 
> effect. And the situation may be slightly different in Germany and in 
> Switzerland.
>
>> I?ll stipulate that few Germans can read S?tterlin or similar hands. :-)
>
> Definitely agreed!
>
>
>> On 28 Mar 2017, at 11:59, Mark Davis ?? <mark at macchiato.com> wrote:
>>
>>> ?I agree with Martin.
>
>>> Simply because someone used a particular shape at some time to mean 
>>> a letter doesn't mean that Unicode should encode a letter for that 
>>> shape.
>>
>> Coming to a forum like this out of a concern for the corpus of 
>> Deseret literature is not some sort of attempt to encode things for 
>> encoding?s sake.
>
> And coming to a discussion like this out of a concern for modern 
> practitioners of the script (even if it seems, after a lot of 
> discussion, that there aren't that many of these, and the issue at 
> hand may indeed not concern them that much) is not some sort of 
> attempt to unify things for unification's sake.
>
>
> Regards,    Martin.
>


From markus.icu at gmail.com  Wed Mar 29 11:09:30 2017
From: markus.icu at gmail.com (Markus Scherer)
Date: Wed, 29 Mar 2017 09:09:30 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost>
References: <26038998.13875.1490781057693.JavaMail.root@webmail29.bt.ext.cpcloud.co.uk>
 <32709367.14617.1490781598791.JavaMail.defaultUser@defaultHost>
Message-ID: <CAN49p6rLYCEjwp0OaLFMLyNA+i6PCnet9QoO-2n2q0=OuZPRTw@mail.gmail.com>

I think "recommended" could be renamed to "(expected to be) widely
implemented".
markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170329/eeb19281/attachment.html>

From doug at ewellic.org  Wed Mar 29 15:09:25 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 29 Mar 2017 13:09:25 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com>

Markus Scherer wrote:

> I think "recommended" could be renamed to "(expected to be) widely
> implemented". 

That's a modest improvement; it shifts from an advisory health warning
not to use certain sequences to what it is, speculation that some
sequences will be far better supported in the field than others.

I still don't see why this distinction is necessary. It's not made for
other emoji or non-emoji. I have no fonts for Tai Tham,? which has been
in Unicode since 2009, but I don't see any warnings against using Tai
Tham because someone like me might not have a font for it.

? No, I'm not looking for one; that isn't the point.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Wed Mar 29 15:12:11 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 29 Mar 2017 13:12:11 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com>

Martin J. D?rst wrote:

> I think there is some missing information here. First, the original
> proposal that used invalid UTF-8 sequences never was an RFC, only an
> Internet Draft.

Yes, you're right. I realized that a minute after "Send" but didn't
think it changed the story enough to justify a correction. For the
curious, the I-D is at
https://www.ietf.org/archive/id/draft-ietf-acap-mlsf-01.txt .

> But what's more important, the protocol that motivated all this work
> (ACAP) never went anywhere. Nor did any other use of the plane 14
> language tag characters get any kind of significant traction. That
> lead to depreciation, because it would have been a bad idea to let
> people think that the information in these taggings would actually be
> used.

Is that common practice in Unicode, that if something doesn't gain
significant traction in the comparatively short term, it becomes a
candidate for deprecation?

> For some people (including me), that was always seen as the likely
> outcome; the language tag characters were mostly introduced as a
> defensive mechanism (way better than invalid UTF-8) rather than
> something we hoped everybody would jump on. Putting them on plane 14
> (which meant that it would be four bytes for each character, and
> therefore quite a lot of bytes for each tag) was part of that message.

I understand the "defensive" aspect of trying to prevent people from
abusing Unicode, especially in the 1997?1998 time frame when UTF-8 was
still new and people didn't realize the cost of tampering with it.

But if you're going to build a mechanism at all, it seems peculiar to
define it in full but then discourage its intended use at the outset, or
to build it in such a way that users will find it difficult or
unpalatable to use.

> I think the situation is vastly different here. First, the Consortium
> never officially 'activated' any subdivision flags, so it would be
> impossible to deprecate them.

The Emoji 5.0 mechanism of using tag sequences for three subdivision
flags was announced earlier this week. The specification grudgingly
allows, but non-recommends, use of the mechanism for any other flags. It
is that grudging allowance that could be deprecated, not any of the
specific flags.

> Second, we already see some pressure (on this list) to 'recommend'
> more of these, and I guess the vendors and the Consortium will give in
> to this pressure, even if slowly and to some extent quite reluctantly.
> It's anyone's bet in what time frame and order e.g. the flags of
> California and Texas will be 'recommended'. But I have personally no
> doubt that these (and quite a few others) will eventually make it,
> even if I have mixed feelings about that.

Then what was the benefit of "not recommending" them in the first place?
Why is it a problem if vendors look at the list of 5100 or so
subdivisions, or even the small subset that actually have flags, and
think, "OMG, look at all those new flags we'll be forced to support"? Is
this any different from when a new CJK extension or other large block of
characters is added?

I would think vendors could make their own business decisions about what
flags to support. "Hmm, yeah, definitely Texas, maybe Lombardy, not so
sure about Colorado, probably not Guna Yala." I don't see why they had
to be essentially told what to support and what not to. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From jenkins at apple.com  Wed Mar 29 15:45:27 2017
From: jenkins at apple.com (John H. Jenkins)
Date: Wed, 29 Mar 2017 14:45:27 -0600
Subject: Standaridized variation sequences for the Desert alphabet?
In-Reply-To: <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
References: <20170321104104.665a7a7059d7ee80bb4d670165c8327d.c6e0d0ee2d.wbe@email03.godaddy.com>
 <B40C0062-5A88-406B-95F1-B1D2AD3EE511@evertype.com>
 <CAMZ=zj5vpevpBBcYcOZ4i8cjr-oTb9n+4YrMpAkL1k4vEAgLfA@mail.gmail.com>
 <CFBB58DE-A38C-4685-B148-BDC14BD27D4C@evertype.com>
 <CAMZ=zj5p-=3gNLeNwbDH=L8Y3vPG6OKqg2YEpFaawJagZa78BA@mail.gmail.com>
 <84F24B3C-9884-432C-B71F-B8D9D283DE9B@evertype.com>
 <CAMZ=zj45DsTMjcZ9yGG9t7kjz99LNWi9SvSNyoterwv0qZf9qg@mail.gmail.com>
 <D5A2386A-8A4B-4AA8-B1F9-65829A115CD4@evertype.com>
 <CAMZ=zj6r-=iWmt9+ZqP6q4asGih4mh3iJoF69vNxL98rGAVs-A@mail.gmail.com>
 <CAGa7JC1RT9j2CK5s3TxBkAdBhg5+zL-5MHKntoOKuAhhWCSjwQ@mail.gmail.com>
 <24975108-52a4-cda4-737d-6a41ff1b5c14@it.aoyama.ac.jp>
 <17D37ACB-9269-4537-AE60-71BB6CA42366@evertype.com>
 <7686cee6-1b4f-d1a6-8cd7-09859757465c@it.aoyama.ac.jp>
 <587FFDFA-CAAE-4F81-B60D-94EB9C550151@evertype.com>
 <2f05d26e-9d3f-4670-f667-1daf1cd53063@it.aoyama.ac.jp>
 <6C843948-F554-4C52-B103-36508595C4FB@evertype.com>
 <b64263b1-9ebb-89ae-695e-1fb266bcee4b@it.aoyama.ac.jp>
Message-ID: <A780EC74-0825-48BB-97C8-ED17101E55AC@apple.com>


> On Mar 29, 2017, at 4:12 AM, Martin J. D?rst <duerst at it.aoyama.ac.jp> wrote:
> 
> Let me start with a short summary of where I think we are at, and how we got there.
> 
> - The discussion started out with two letters,
>  with two letter forms each. There is explicit talk of the
>  40-letter alphabet and glyphs in the Wikipedia page, not
>  of two different letters.
> - That suggests that IF this script is in current use, and the
>  shapes for these diphthongs are interchangeable (for those
>  who use the script day-to-day, not for meta-purposes such
>  as historic and typographic texts), keeping things unified
>  is preferable.
> - As far as we have heard (in the course of the discussion,
>  after questioning claims made without such information),
>  it seems that:
>  - There may not be enough information to understand how the
>    creators and early users of the script saw this issue,
>    on a scale that may range between "everybody knows these
>    are the same, and nobody cares too much who uses which,
>    even if individual people may have their preferences in
>    their handwriting" to something like "these are different
>    choices, and people wouldn't want their texts be changed
>    in any way when published".

I see this part of the problem more one of proper transcription of existing materials, and less of one of what the original authors saw the issues as. Handwritten material is very important in the study of 19th century LDS history, and although the materials actually in the DA are scant (at best), the peculiarities of the spelling can be instructive. As such, I certainly agree that being able to transcribe material "faithfully" is important.

I'm not an expert in this area, though, so I can't speak for myself whether this separate encoding or variation selectors or some other mechanism is the best way to provide support for this. I'm more than happy to defer to Michael and other people who *are* experts. If paleographers think separate encoding is best, then I'm for separate encoding. 

>  - Similarly, there seem to be not enough modern practitioners
>    of the script using the ligatures that could shed any
>    light on the question asked in the previous item in a
>    historical context, first apparently because there are not
>    that many modern practitioners at all, and second because
>    modern practitioners seem to prefer spelling with
>    individual letters rather than using the ligatures.

Well, as one of the people in this camp, and as Michael has pointed out, I eschew use of these letters altogether. I restrict myself to the 1869 version of the alphabet, which is used in virtually all of the printed materials and has only thirty-eight letters. 

> - IF the above is true, then it may be that these ligatures
>  are mostly used for historic purposes only, in which case
>  it wouldn't do any harm to present-day users if they were separated.
> 
> If the above is roughly correct, then it's important that we reached that conclusion after explicitly considering the potential of a split to create inconvenience and confusion for modern practitioners, not after just looking at the shapes only, coming up with separate historical derivations for each of them, and deciding to split because history is way more important than modern practice.


Fortunately, since the existing Deseret block is full, any separately encoded entities will have to be put somewhere else, making it easier to document the nature and purpose of the symbols involved. 

Not that we can be confident that it will help. (http://www.deseretalphabet.info/XKCD/1726.html <http://www.deseretalphabet.info/XKCD/1726.html>)


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170329/3256c755/attachment.html>

From kenwhistler at att.net  Wed Mar 29 15:55:49 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 29 Mar 2017 13:55:49 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com>
References: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com>
Message-ID: <e2dd9d01-764a-790f-3d8e-3f8a34556bef@att.net>


On 3/29/2017 1:12 PM, Doug Ewell wrote:
> I would think vendors could make their own business decisions about what
> flags to support. "Hmm, yeah, definitely Texas, maybe Lombardy, not so
> sure about Colorado, probably not Guna Yala." I don't see why they had
> to be essentially told what to support and what not to.

I think you have it approximately backwards. It isn't the UTC telling 
the vendors "what to support and what not to" -- it was the vendors 
saying "this is what we need to support, and we'd like to not do it in a 
haphazard way, so let's tell the UTC what we want them to document in 
the data for UTS #51."

You are correct that the vendors can make their own business decisions. 
And apparently as of now, Microsoft, for whatever reason, has made its 
business decision not to support flag emoji *at all* on its phones. 
O.k., that is their decision. So no Texas, no Lombardy, no Colorado, no 
Guna Yala, but also no Japan, no Great Britain, no Scotland... Other 
vendors have decided *to* support flag emoji on their phone platforms. 
O.k., that is their decision. *But*, the ones who do have flags on their 
phones don't want to be in the situation where the iPhone has a flag of 
Scotland which then shows up as a flag tofu on an Android phone, but an 
Android phone has a flag of Texas which then shows up as a flag tofu on 
on iPhone, etc., etc. That way leads to customer complaint madness, with 
1000's (hundreds of 1000's?) of complaints: "My phone is screwed up, fix 
it!"

Or maybe you want the job on the consumer complaint line about that 
topic. ;-)

--Ken


From andrewcwest at gmail.com  Wed Mar 29 16:00:59 2017
From: andrewcwest at gmail.com (Andrew West)
Date: Wed, 29 Mar 2017 22:00:59 +0100
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com>
References: <20170329130925.665a7a7059d7ee80bb4d670165c8327d.347737d545.wbe@email03.godaddy.com>
Message-ID: <CALgEMhxEBuE+0rBby-FK5+Mgff2SeyfJTXih5MZgMQvGFu-jWQ@mail.gmail.com>

On 29 March 2017 at 21:09, Doug Ewell <doug at ewellic.org> wrote:
>
>> I think "recommended" could be renamed to "(expected to be) widely
>> implemented".
>
> That's a modest improvement; it shifts from an advisory health warning
> not to use certain sequences to what it is, speculation that some
> sequences will be far better supported in the field than others.

I don't think that would work.
http://www.unicode.org/Public/emoji/5.0/emoji-sequences.txt explicitly
lists just the three subdivision flags for England, Scotland and Wales
under Emoji Tag Sequences, which indicates that they are special in an
undefined way that none of the thousands of other potential
subdivision flag tag sequences are. There must be a higher bar for
inclusion in the Emoji data files than simply that they are expected
to be widely implemented. Their inclusion in the Emoji data files and
the Emoji charts
(http://www.unicode.org/emoji/charts/emoji-ordering.html) must
indicate that only these three tag sequences are recommended or
sanctioned by the UTC.

(In case anyone thinks I support the current situation, let me state
that I disagree vehemently with the UTC decision to only "recommend"
these three particular subdivision flag tag sequences.)

Andrew

From doug at ewellic.org  Wed Mar 29 16:07:20 2017
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 29 Mar 2017 14:07:20 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com>

Ken Whistler wrote:

> *But*, the ones who do have flags on their phones don't want to be in
> the situation where the iPhone has a flag of Scotland which then shows
> up as a flag tofu on an Android phone, but an Android phone has a flag
> of Texas which then shows up as a flag tofu on on iPhone, etc., etc.
> That way leads to customer complaint madness, with 1000's (hundreds of
> 1000's?) of complaints: "My phone is screwed up, fix it!"

Doesn't this same problem exist for other emoji, or non-emoji, that are
supported on some phones but not others? What's the customer service
resolution in those cases?
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From christoph.paeper at crissov.de  Wed Mar 29 16:17:58 2017
From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=)
Date: Wed, 29 Mar 2017 23:17:58 +0200 (CEST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
Message-ID: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>

Mark Davis ?? <mark at macchiato.com>:
> On Tue, Mar 28, 2017 at 11:56 AM, Joan Montan? <joan at montane.cat> wrote:
> 
> > 1st one: point 4 (Unicode subdivision codes listed in emoji Unicode site)
> > arises something like chicken-egg problem. Vendors don't easily add new
> > subdivision-flags (because they aren't recommended), and Unicode doesn't
> > recommend new subdivision flags (because they aren't supported by vendors).
> ?
> That isn't really the case. In particular, vendors can propose adding
> additional subdivisions to the recommended list.

Awesome, "vendors" can do that. (._.m)

If I made an open-source emoji font that contained flags for all of the 5000ish
ISO 3166-2 codes that actually map to one, would I automatically be considered a
vendor? Do I need to have to pay 18000(?) dollars a year for full membership
first? (That's peanuts for multi-billion dollar companies, but unaffordable for
most individuals and many FOSS projects.)

<http://unicode.org/consortium/levels.html>

Someone could try to push such an edit onto Emojione, Twemoji or Noto Emoji, but
something tells me none of the maintainers would accept flag PRs by random users
unless UTR/UTS#51 already recommended them.

- <https://github.com/twitter/twemoji/>
- <https://github.com/Ranks/emojione/>
- <https://github.com/googlei18n/noto-emoji>
- <https://github.com/behdad/region-flags> <-
<https://github.com/googlei18n/noto-emoji/tree/master/third_party/region-flags>

The last one currently already has support for UK countries, US states and
Canadian provinces. Go figure.

> The UTC Consideration?s ... would come into play in assessing those proposals.
>? So it is certainly possible for there to be (say) a flag of Texas or
>Catalonia
> appearing in an Emoji 6.0 release this year. 

Those are desired, for sure, but so are emoji flags for Kurdistan, Confederated
States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy, Communism,
Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian Aboriginals, and
many more. Of these, only the Kurdish and the Sami flag *may* be covered by
Unicode Emoji 5.0+ (possibly with multiple codes) until yet another (Tag-based)
scheme is adopted.

<https://github.com/Crissov/unicode-proposals/issues?q=is%3Aopen+is%3Aissue+label%3Aflag>

> Similarly, Microsoft could propose adding the ninja cat ZWJ sequences.

I still fail to see how it is a good or smart thing to have to maintain Emoji
Tag Sequences *and* Emoji ZWJ Sequences, when adopting the latter for flags
would have had at least the following advantages: 

- actually useful fallback
- application beyond ISO 3166 restrictions


From kenwhistler at att.net  Wed Mar 29 16:22:32 2017
From: kenwhistler at att.net (Ken Whistler)
Date: Wed, 29 Mar 2017 14:22:32 -0700
Subject: Traction and Deprecation (was: Re: Unicode Emoji 5.0 characters now
 final)
In-Reply-To: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com>
References: <20170329131211.665a7a7059d7ee80bb4d670165c8327d.0215c3a290.wbe@email03.godaddy.com>
Message-ID: <82f78fdc-42e7-0ad4-a8fc-717f732b05a9@att.net>


On 3/29/2017 1:12 PM, Doug Ewell wrote:
> Is that common practice in Unicode, that if something doesn't gain
> significant traction in the comparatively short term, it becomes a
> candidate for deprecation?

If a mechanism was dodgy in the first place and was dubious as a part of 
plain text, then yes.

If a mechanism is clearly a necessary part of the text model, but takes 
a while to catch on, because it is inherently complicated to implement 
and roll out, then no.

Remember, it took a good part of a decade for significant support of 
combining marks to start appearing in Unicode implementations. Even 
longer for fairly good support of the Indic rendering models.

If you are worried about the emoji tag sequence mechanism, then I'd say 
no. Once the use of regional indicator symbols caught on to represent 
flag emoji, that basically settled the question of whether pictographic 
symbols for flags were a part of plain text. Once the emoji tag 
sequences are rolled out for the regional flags (a process I can surmise 
is happening even now as we debate this), there will be no going back. 
You can be guaranteed, given the current attention to Brexit, that the 
tag sequence for the Scotland flag, at least, will leap up the emoji 
frequency list almost immediately. And that data will end up being 
supported essentially forever.

--Ken


From christoph.paeper at crissov.de  Wed Mar 29 16:34:18 2017
From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=)
Date: Wed, 29 Mar 2017 23:34:18 +0200 (CEST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170328222944.3c53914c@JRWUBU2>
References: <20170328114138.665a7a7059d7ee80bb4d670165c8327d.d674d514e8.wbe@email03.godaddy.com>
 <20170328222944.3c53914c@JRWUBU2>
Message-ID: <1002744966.74666.1490823258995.JavaMail.open-xchange@app09.ox.hosteurope.de>

Richard Wordingham <richard.wordingham at ntlworld.com>:
> "Doug Ewell" <doug at ewellic.org> wrote:
> 
> > "Not recommended," "not standard," "not interoperable," or any other
> > term ESC settles on for the 5000+ valid flag sequences that are not
> > England, Scotland, and Wales is just a short, easy step away from
> > deprecation for these as well.

*Sigh* 
Instead of 26 RIS characters and all the TAGs, Unicode should have added a
single new character: U+2065 Flag Code Joiner.

> It's certainly on the cards that the sequence for the Scottish flag will
> be deprecated in favour of an RI sequence.

Which would very likely be U+1F1E6-1F1E7 ???? 'AB' for Alba, because all other
intuitive alpha-2 code elements are either reserved or already assigned.


From beckiergb at gmail.com  Wed Mar 29 16:52:15 2017
From: beckiergb at gmail.com (Rebecca Bettencourt)
Date: Wed, 29 Mar 2017 14:52:15 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
Message-ID: <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>

On Wed, Mar 29, 2017 at 2:17 PM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> If I made an open-source emoji font that contained flags for all of the
> 5000ish
> ISO 3166-2 codes that actually map to one, would I automatically be
> considered a
> vendor? Do I need to have to pay 18000(?) dollars a year for full
> membership
> first? (That's peanuts for multi-billion dollar companies, but
> unaffordable for
> most individuals and many FOSS projects.)
>

...

Those are desired, for sure, but so are emoji flags for Kurdistan,
> Confederated
> States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy,
> Communism,
> Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian Aboriginals,
> and
> many more. Of these, only the Kurdish and the Sami flag *may* be covered by
> Unicode Emoji 5.0+ (possibly with multiple codes) until yet another
> (Tag-based)
> scheme is adopted.
>

Heh, I actually started an open-source emoji font that kinda does this:

https://github.com/kreativekorp/vexillo

It encodes not only some subdivision flags using sequences like [usca],
[ustx], and [caqc], but a whole lot of
nowhere-near-standardized-for-encoding flags under the XX code, such as
[xxcascadia], [xxconlangesperanto], [xxpridebisexual], [xxpridetrans], etc.

And hey, it works already in OS X 10.8+ and Firefox, even if it makes text
selection a little dodgy. :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170329/c28fa956/attachment.html>

From asmusf at ix.netcom.com  Wed Mar 29 17:31:41 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Wed, 29 Mar 2017 15:31:41 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com>
References: <20170329140720.665a7a7059d7ee80bb4d670165c8327d.f4b2c3d4e4.wbe@email03.godaddy.com>
Message-ID: <90cdaf94-281c-0bd5-7c0e-e56f0041ae9e@ix.netcom.com>

On 3/29/2017 2:07 PM, Doug Ewell wrote:
> Ken Whistler wrote:
>
>> *But*, the ones who do have flags on their phones don't want to be in
>> the situation where the iPhone has a flag of Scotland which then shows
>> up as a flag tofu on an Android phone, but an Android phone has a flag
>> of Texas which then shows up as a flag tofu on on iPhone, etc., etc.
>> That way leads to customer complaint madness, with 1000's (hundreds of
>> 1000's?) of complaints: "My phone is screwed up, fix it!"
> Doesn't this same problem exist for other emoji, or non-emoji, that are
> supported on some phones but not others? What's the customer service
> resolution in those cases?
>   

Sure, let them go form a consortium and agree on which ones are in the 
recommended set. But why form a new consortium if you have one already 
where they are all members?

Agreeing on recommended level of support in the sense of "best practice" 
is something that is done for many of the specifications, for example 
some of the algorithms.

A useful guide in evaluating whether it's appropriate to "recommend" 
something is to treat it as if it was mandatory, but with a costly 
override option: if you decide to go against the recommendation you'd 
better have a really solid reason.

Recommending to vendors to support a minimal set is one thing. 
Recommending to users to only use sequences from that set / or vendors 
to not extend coverage beyond the minimum is something else. Both use 
the word "recommendation" but the flavor is rather different (which 
becomes more obvious when you re-phrase as I suggested).

That seems to be the source of the disconnect.

A./

From verdy_p at wanadoo.fr  Wed Mar 29 17:40:19 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 30 Mar 2017 00:40:19 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>
Message-ID: <CAGa7JC2vng_Buwgdqx+QpX5gcA7EAMjbFOL5hBFPKKT97LYRtg@mail.gmail.com>

Note: in your collection you say that the EU flag is the flag of the
European Union, actually it is a flag for Europe at whole, made and
proposed since long by the CoE, Council of Europe (not the european Union
that still did not exist, and not even the EEC or even the CECA that were
also created after the European Council.

The European Union displays the EC flag **under permission** permanently
granted by the European Council. The non-EU members that are CoE members,
or that were invited by the CoE, have a legal right to display it (so it
includes as well Turkey since ever as it was a founding member of CoE, also
Russia, Belarus even if its seat in the EC is suspended, Ukraine,
Kazakhstan, Morocco, Vatican, Andorra, Iceland, Switzerland, Liechtenstein,
Norway...). When the CECA was created (and later the European communities)
it had initially no flag, but it rapidly started to reuse the European flag
proposed by the EC, because every member of the European Community was also
a member of EC,

In ISO 3166-1 however the "EU" code was granted to the European Union (for
legal reasons related to some WIPO standards with specific rules enforced
throughout the EU, plus optionally some volunteer countries in the EEA). It
usually displays the flag adopted by the CoE. There's no ISO 3166-1 code
for Europe at the whole (does it exist legally if we can't clearly define
its borders?) or the CoE itself (which has a logo derived now from the
European flag, but distinctive and reserved as a logo and not encodable.

Note that there's also a flag for a wider region with 56 countries covered
by the EBU (Eurovision Broadcast union), including for example Israel,
Palestine, Armenia, Georgia, Syria, Lebanon, Morocco, Algeria, Tunisia,
Libya and Egypt (not to be confused by the logos used by the Eurovision
song contest: these logos are not flags). However the EBU still does not
include Kazahstan. The EBU howver is a private organization, and its "flag"
looking like a blue "(O)" on white, is in fact a logo and not encodable.
Another logo was used in the past that looked similar to the European flag
with stars on a circle (this old logo, initially monochromatic using white
stars on grey, slightly modernized, is still visible  along with some video
test patterns at start of some Eurovision broadcasts).

2017-03-29 23:52 GMT+02:00 Rebecca Bettencourt <beckiergb at gmail.com>:

> On Wed, Mar 29, 2017 at 2:17 PM, Christoph P?per <
> christoph.paeper at crissov.de> wrote:
>
>> If I made an open-source emoji font that contained flags for all of the
>> 5000ish
>> ISO 3166-2 codes that actually map to one, would I automatically be
>> considered a
>> vendor? Do I need to have to pay 18000(?) dollars a year for full
>> membership
>> first? (That's peanuts for multi-billion dollar companies, but
>> unaffordable for
>> most individuals and many FOSS projects.)
>>
>
> ...
>
> Those are desired, for sure, but so are emoji flags for Kurdistan,
>> Confederated
>> States of America, Romani, Oromo, South Vietnam, Esperanto, Anarchy,
>> Communism,
>> Bisexuality, Transgenderism, Sami, Pan-Africanism, Australian
>> Aboriginals, and
>> many more. Of these, only the Kurdish and the Sami flag *may* be covered
>> by
>> Unicode Emoji 5.0+ (possibly with multiple codes) until yet another
>> (Tag-based)
>> scheme is adopted.
>>
>
> Heh, I actually started an open-source emoji font that kinda does this:
>
> https://github.com/kreativekorp/vexillo
>
> It encodes not only some subdivision flags using sequences like [usca],
> [ustx], and [caqc], but a whole lot of nowhere-near-standardized-for-encoding
> flags under the XX code, such as [xxcascadia], [xxconlangesperanto],
> [xxpridebisexual], [xxpridetrans], etc.
>
> And hey, it works already in OS X 10.8+ and Firefox, even if it makes text
> selection a little dodgy. :)
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/1aa97fc3/attachment.html>

From irgendeinbenutzername at gmail.com  Wed Mar 29 17:52:03 2017
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Thu, 30 Mar 2017 00:52:03 +0200
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>

Ken Whistler wrote:
> *But*, the ones who do have flags on their
> phones don't want to be in the situation where the iPhone has a flag of
> Scotland which then shows up as a flag tofu on an Android phone, but an
> Android phone has a flag of Texas which then shows up as a flag tofu on
> on iPhone, etc., etc. That way leads to customer complaint madness, with
> 1000's (hundreds of 1000's?) of complaints: "My phone is screwed up, fix
> it!"

And this is where the problem becomes even worse. Because there are no
?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences, the
fallback for all unsupported tag sequences looks exactly the same and
carries absolutely no meaning unless put through some Unicode analyzer
machine: ?? WAVING BLACK FLAG, a well-supported emoji that means nothing in
the context it is used in, followed by a single, featureless tofu. At least
a text containing ten different unsupported RI sequences will show you ten
distinct images, even if you are completely unaware that those peculiar
pairs of colourful letters you?ve just been sent are used to build flag
emoji.

Heck, if your device has a default font that includes CANCEL TAG (like my
phone does, but my laptop doesn?t) and therefore doesn?t render it, then
you won?t even be able to see the difference between a regular, generic
black flag and an emoji that was meant to represent some region. This could
potentially lead to great misunderstandings since a plane black flag is
often associated with anarchism and piracy, but rather rarely with England,
Scotland or Wales. The waving white flag that was used as the base in
earlier drafts at the very least had the benefit of looking like a ?blank
slate? of sorts.

This is one of the few cases where the terrible web browser of the Nintendo
3DS can actually be considered superior to any modern device because for
some bizarre reason it applies modulo 65,536 to all code points on display,
resulting in tag characters rendering as visible ASCII.

It would have been much more sensible to construct subdivision flags out of
new, visible characters just like RI sequences. That way we could have had
a fallback rendering that is actually in any way useful. We could also have
preserved the original properties of the tag characters. Last time I
checked their correct usage for language tagging is still rigorously
explained in the standard despite deprecation.

But no, we absolutely had to put out this update as soon as possible
because peoplez want da emojiz. We had to use existing characters for
region sequences because if we had actually given ourselves enough time to
properly think this whole endeavour through we couldn't have made the
precious Scottish flag available until Unicode 11. (Although that hardly
seems to matter anyways seeing how we apparently now release technical
reports and data files that rely on certain characters before those
characters even exist in the standard.) And we had to use the invisible tag
characters from Plane 14 because potatoes, I guess.

You know, back when Emoji Modifiers were released I was initially sceptical
of them being spacing, visibly rendering pictographs rather than formatting
characters. Nowadays I understand that decision. Too bad we were seemingly
unable to make the same decision for flags. I eagerly await the return of
hair colour tags in Emoji 6.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/27702435/attachment.html>

From richard.wordingham at ntlworld.com  Wed Mar 29 18:29:22 2017
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Thu, 30 Mar 2017 00:29:22 +0100
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>
References: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>
Message-ID: <20170330002922.7843ab3c@JRWUBU2>

On Thu, 30 Mar 2017 00:52:03 +0200
Charlotte Buff <irgendeinbenutzername at gmail.com> wrote:

> And this is where the problem becomes even worse. Because there are no
> ?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences,
> the fallback for all unsupported tag sequences looks exactly the same
> and carries absolutely no meaning unless put through some Unicode
> analyzer machine: ?? WAVING BLACK FLAG, a well-supported emoji that
> means nothing in the context it is used in, followed by a single,
> featureless tofu. At least a text containing ten different
> unsupported RI sequences will show you ten distinct images, even if
> you are completely unaware that those peculiar pairs of colourful
> letters you?ve just been sent are used to build flag emoji.

I don't see why the tag characters can't be represented by some form of
corresponding ASCII characters as a fallback registering.  The
bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG
declares a sequence of 3 to 6 intervening ordinary tags to be a flag
emoji, and in an OpenType font a GSUB contextual substitution can
easily convert unrecognised sequences to modified ASCII characters.  It
does not have to explicitly handle each possible combination.

Richard.


From irgendeinbenutzername at gmail.com  Wed Mar 29 18:45:43 2017
From: irgendeinbenutzername at gmail.com (Charlotte Buff)
Date: Thu, 30 Mar 2017 01:45:43 +0200
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <CAKLR3Aq+GsP9pMDn70_uSDP08dWq8EO85X3zdzgPooCeh9Staw@mail.gmail.com>

Richard Wordingham wrote:

> I don't see why the tag characters can't be represented by some form of
> corresponding ASCII characters as a fallback registering. The
> bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG
> declares a sequence of 3 to 6 intervening ordinary tags to be a flag
> emoji, and in an OpenType font a GSUB contextual substitution can
> easily convert unrecognised sequences to modified ASCII characters. It
> does not have to explicitly handle each possible combination.

I suppose this is an adequate solution, but it*?*s also needlessly
convoluted in comparison to RIS where good fallback behaviour just happens
automatically with only the most bare-bones font feature imaginable, i.e.
simply displaying single characters one after another as they would appear
anyways. It is also questionable whether most vendors are going to employ
such system in the first place.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/9566cfee/attachment.html>

From verdy_p at wanadoo.fr  Wed Mar 29 19:16:26 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 30 Mar 2017 02:16:26 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170330002922.7843ab3c@JRWUBU2>
References: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>
 <20170330002922.7843ab3c@JRWUBU2>
Message-ID: <CAGa7JC2OTLiaQRtpY6U5RmTr62oMbUw8W4tBbs_G05+q9=r4nA@mail.gmail.com>

2017-03-30 1:29 GMT+02:00 Richard Wordingham <
richard.wordingham at ntlworld.com>:

> On Thu, 30 Mar 2017 00:52:03 +0200
> Charlotte Buff <irgendeinbenutzername at gmail.com> wrote:
>
> > And this is where the problem becomes even worse. Because there are no
> > ?flag tofus? for 3166-2 regions. Unlike Regional Indicator Sequences,
> > the fallback for all unsupported tag sequences looks exactly the same
> > and carries absolutely no meaning unless put through some Unicode
> > analyzer machine: ?? WAVING BLACK FLAG, a well-supported emoji that
> > means nothing in the context it is used in, followed by a single,
> > featureless tofu. At least a text containing ten different
> > unsupported RI sequences will show you ten distinct images, even if
> > you are completely unaware that those peculiar pairs of colourful
> > letters you?ve just been sent are used to build flag emoji.
>
> I don't see why the tag characters can't be represented by some form of
> corresponding ASCII characters as a fallback registering.  The
> bracketing pair U+1F3F4 WAVING BLACK FLAG .. U+E007F CANCEL TAG
> declares a sequence of 3 to 6 intervening ordinary tags to be a flag
> emoji, and in an OpenType font a GSUB contextual substitution can
> easily convert unrecognised sequences to modified ASCII characters.  It
> does not have to explicitly handle each possible combination.
>

I also think so: the unique black flag (even if it is marked on the corner
with a ? on a diamond) is the worst solution. You can easily set up a
left-side part showing the hoist and the start of the flag, a right part
showing the floating end of the flag, and display the letters with top and
bottom borders connecting together and with the left-side and right-side
part. May be you can also arrange the letters in rows: the first top row
for the 2-letter ISO 3166-1 code, the bottom row for the appended 1-to-4
characters code (letters and digits) of the subdivision.

You may also improve the display by displaying the last letters on top of
the national flag. If subdivision codes are known you may alternatively
render a short name of the subdivision above or below the national flag
(but here there's a problem of language choice: even if official names are
accepted, some subdivisions have several official names in distinct
languages, possibly in distinct scripts; and when there's only one,
probably many users will have problems reading these labels in a foreign
script, such as Arabic or Chinese)

My opinion is that renderers should better support the interactive display
of hints in the user language of its UI, independantly of the language of
the encoded document itself, if the rendering engine is capable of such
interactivity, provided that there's no other competing hint such as title
attributes which may be used in HTML to explain the flag ven when it is
actually rendered. The same will apply for non-graphical rendering such as
aural rendering., instead of spelling the code letters (as a last fallback).

May be it will be larger than an actual flag, but I see no problem at all
if all flags do not have the same ratio (in fact ratios are already not the
same for the official flags of recognized countries). There is absolutely
no obligation
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/0c64b388/attachment.html>

From mark at macchiato.com  Thu Mar 30 02:45:34 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 30 Mar 2017 09:45:34 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
Message-ID: <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>

> If I made an open-source emoji font that contained flags for all of the
> 5000ish
> ISO 3166-2 codes that actually map to one, would I automatically be
> considered a
> vendor?
>

Do I need to have to pay 18000(?) dollars a year for full membership
> first? (That's peanuts for multi-billion dollar companies, but
> unaffordable for
> most individuals and many FOSS projects.)
>

The answer to both of your questions is no.

Please see http://unicode.org/emoji/selection.html#timeline for details.
What the UTC is looking for is commitments from major vendors. It is not
sufficient to join Unicode: we have members who are not major vendors of
emoji. And there are some major vendors that are not members.

Of course, there is some judgment involved as to what constitutes "major":
at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K
doesn't.

Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/09d9bd4b/attachment.html>

From duerst at it.aoyama.ac.jp  Thu Mar 30 03:42:55 2017
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Thu, 30 Mar 2017 17:42:55 +0900
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
Message-ID: <bd83b741-1253-326d-37f3-794d53c894e3@it.aoyama.ac.jp>

On 2017/03/30 06:17, Christoph P?per wrote:
> Mark Davis ?? <mark at macchiato.com>:

>> That isn't really the case. In particular, vendors can propose adding
>> additional subdivisions to the recommended list.
>
> Awesome, "vendors" can do that. (._.m)
>
> If I made an open-source emoji font that contained flags for all of the 5000ish
> ISO 3166-2 codes that actually map to one, would I automatically be considered a
> vendor?

I don't think so. But if you want to get more flags listed, then 
creating actual flags, with suitable licenses, and telling others to use 
them and tell other, and so on, may easily reach vendors sooner or later.


> - <https://github.com/twitter/twemoji/>
> - <https://github.com/Ranks/emojione/>
> - <https://github.com/googlei18n/noto-emoji>
> - <https://github.com/behdad/region-flags> <-
> <https://github.com/googlei18n/noto-emoji/tree/master/third_party/region-flags>
>
> The last one currently already has support for UK countries, US states and
> Canadian provinces. Go figure.

And most if not all of these flags are from Wikimedia. So that shows 
that open source has some influence, even without money.

Regards,   Martin.

From christoph.paeper at crissov.de  Thu Mar 30 04:48:02 2017
From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=)
Date: Thu, 30 Mar 2017 11:48:02 +0200 (CEST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGa7JC2vng_Buwgdqx+QpX5gcA7EAMjbFOL5hBFPKKT97LYRtg@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>
 <CAGa7JC2vng_Buwgdqx+QpX5gcA7EAMjbFOL5hBFPKKT97LYRtg@mail.gmail.com>
Message-ID: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de>

Philippe Verdy <verdy_p at wanadoo.fr> hat am 30. M?rz 2017 um 00:40 geschrieben:

> There's no ISO 3166-1 code for Europe at the whole (does it exist legally if
> we can't clearly define its borders?)

`150` in UN M.49 which ISO 3166-1 was derived from and is compatible with. CLDR
could safely adopt that if needed.

No alpha-2 and hence no RIS sequence, though. An Emoji Tag Sequence would be
straight-forward, though: U+1F3F4-E0031-E0035-E0030-E007F.


From christoph.paeper at crissov.de  Thu Mar 30 04:59:21 2017
From: christoph.paeper at crissov.de (=?UTF-8?Q?Christoph_P=C3=A4per?=)
Date: Thu, 30 Mar 2017 11:59:21 +0200 (CEST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>
References: <CAKLR3Aq++kDo+wbpQkAEhV_R5D1xu+nYU1XXsxXkNf2Xp1Ph3w@mail.gmail.com>
Message-ID: <2046545915.16798.1490867961823.JavaMail.open-xchange@app08.ox.hosteurope.de>

Charlotte Buff <irgendeinbenutzername at gmail.com>>
> 
> Heck, if your device has a default font that includes CANCEL TAG (...) and
> therefore doesn?t render it,  
> then you won?t even be able to see the difference between a regular, generic
> black flag and an emoji that was meant to represent some region.  
> This could potentially lead to great misunderstandings since a plane black
> flag is often associated with anarchism and piracy,  
> but rather rarely with England, Scotland or Wales.  
> The waving white flag that was used as the base in earlier drafts at the very
> least had the benefit of looking like a ?blank slate? of sorts.

White flags are associated with surrender (but also peace). That is at least as
bad as a black flag. The checkered flag U+1F3C1 ?? could have been a compromise.
It is also readily associated with sports.


From mark at macchiato.com  Thu Mar 30 06:58:47 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 30 Mar 2017 13:58:47 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>
 <CAGa7JC2vng_Buwgdqx+QpX5gcA7EAMjbFOL5hBFPKKT97LYRtg@mail.gmail.com>
 <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de>
Message-ID: <CAJ2xs_Ga6v+bsd0n_DVFQige01n=zshFvsgYtYF56cWrc=TqjQ@mail.gmail.com>

> `150` in UN M.49 which ISO 3166-1 was derived from and is compatible
with. CLDR
could safely adopt that if needed.

No need to "safely adopt". It is already valid:

http://www.unicode.org/reports/tr51/proposed.html#flag-emoji-tag-sequences

If you follow the links you'll end up at

http://unicode.org/repos/cldr/trunk/common/validity/region.xml

And find that 150 is already valid. (For the format of that file, see LDML.)

====

Where people have looked at the documentation and their questions are still
not answered, that feedback is useful so that the documentation can be
improved. But it appears that at least some people haven't bothered to do
that, when it could answer a lot of the questions/complaints on this list.

Mark

On Thu, Mar 30, 2017 at 11:48 AM, Christoph P?per <
christoph.paeper at crissov.de> wrote:

> Philippe Verdy <verdy_p at wanadoo.fr> hat am 30. M?rz 2017 um 00:40
> geschrieben:
>
> > There's no ISO 3166-1 code for Europe at the whole (does it exist
> legally if
> > we can't clearly define its borders?)
>
> `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with.
> CLDR
> could safely adopt that if needed.
>
> No alpha-2 and hence no RIS sequence, though. An Emoji Tag Sequence would
> be
> straight-forward, though: U+1F3F4-E0031-E0035-E0030-E007F.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/30c53406/attachment.html>

From doug at ewellic.org  Thu Mar 30 09:58:09 2017
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 30 Mar 2017 07:58:09 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com>

Asmus Freytag wrote:

> Recommending to vendors to support a minimal set is one thing.
> Recommending to users to only use sequences from that set / or vendors
> to not extend coverage beyond the minimum is something else. Both use
> the word "recommendation" but the flavor is rather different (which
> becomes more obvious when you re-phrase as I suggested).
>
> That seems to be the source of the disconnect.

That seems a fair analysis.

Another way of putting this is that marking a particular subset of valid
sequences as "recommended" is one thing, while listing sequences in a
table with a column "Standard sequence?", with some sequences marked
"Yes" and others marked "No," is something else.

Equivalently, characterizing a group of valid sequences as "Valid, but
not recommended" is something else.

If the goal is to tell users that three of the sequences are especially
likely to be supported, or to tell vendors that they should prioritize
support for these three, then "recommended" and "additional," used as a
pair, would be more appropriate.

If the goal is to tell users "we don't want you to use the other 5100
sequences" and to tell vendors "we don't want you to offer support for
them," then the existing wording is fine.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From wjgo_10009 at btinternet.com  Thu Mar 30 09:03:11 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Thu, 30 Mar 2017 15:03:11 +0100 (BST)
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>
Message-ID: <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost>

> What the UTC is looking for is commitments from major vendors.

Well should it be applying such a filter on progress?

I opine that assessment should be on merit and that new ideas should be considered on an even-handed basis. Progress should not be on the basis of what major vendors choose to do. Requiring commitments from major vendors could be a barrier to new enterprises developing and a barrier to progress for the benefit of consumers being made.

> Of course, there is some judgment involved as to what constitutes "major": at one extreme clearly 1B DAUs qualifies, and at the other extreme, 1K doesn't.

What does 1B DAUs mean please?

William Overington

Thursday 30 March 2017

From doug at ewellic.org  Thu Mar 30 12:12:04 2017
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 30 Mar 2017 10:12:04 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170330101204.665a7a7059d7ee80bb4d670165c8327d.19670e161a.wbe@email03.godaddy.com>

William_J_G Overington wrote:

>> Of course, there is some judgment involved as to what constitutes
>> "major": at one extreme clearly 1B DAUs qualifies, and at the other
>> extreme, 1K doesn't.
>
> What does 1B DAUs mean please?

>From http://acronyms.thefreedictionary.com/DAU I gathered that this
might be search-engine industry jargon for "1 billion daily active
users" as opposed to 1000 of them. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From charupdate at orange.fr  Thu Mar 30 14:06:39 2017
From: charupdate at orange.fr (Marcel Schneider)
Date: Thu, 30 Mar 2017 21:06:39 +0200 (CEST)
Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now
 final)
In-Reply-To: <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>
 <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost>
Message-ID: <1001013024.10243.1490900799142.JavaMail.www@wwinf2209>

On Thu, 30 Mar 2017 15:03:11 +0100 (BST), William_J_G Overington wrote:
> 
> > What the UTC is looking for is commitments from major vendors.
> 
> Well should it be applying such a filter on progress?
> 
> I opine that assessment should be on merit and that new ideas should be 
> considered on an even-handed basis. Progress should not be on the basis of 
> what major vendors choose to do. Requiring commitments from major vendors 
> could be a barrier to new enterprises developing and a barrier to progress 
> for the benefit of consumers being made.

That?s exactly the point: that the marketplace should be tailored for the 
benefit of consumers, not for the sole benefit of vendors. Instead, the 
question seems always to be ?who is paying for it?? Another example has been 
recently discussed: the use of superscript letters is ?discouraged?, seemingly 
to prevent a set of consumers from being able to write in an acceptable way a 
couple of languages in plain text, and to subjugate these customers to the use 
of a series of rich text software. The problem is not whether to use high-end 
software or not, but the way how users get their stuff messed up if they don?t.

When it was up to encode the first set of superscript Latin letters in 
Unicode 1.0 ? or were they *too* enforced by Bruce Paterson of ISO/IEC?10646? ? 
all straightforward people surely were going to follow the pattern of:

2071 SUPERSCRIPT LATIN SMALL LETTER I
* functions as a modifier letter
# <super> 0069
207F SUPERSCRIPT LATIN SMALL LETTER N
* functions as a modifier letter
# <super> 006E
@ Latin subscript modifier letters
1D62 LATIN SUBSCRIPT SMALL LETTER I
# <sub> 0069
1D63 LATIN SUBSCRIPT SMALL LETTER R
# <sub> 0072
1D64 LATIN SUBSCRIPT SMALL LETTER U
# <sub> 0075
1D65 LATIN SUBSCRIPT SMALL LETTER V
# <sub> 0076

and name them accordingly. But given the way of finally calling them:

@@ 02B0 Spacing Modifier Letters 02FF
@ Latin superscript modifier letters
x (superscript latin small letter i - 2071)
x (superscript latin small letter n - 207F)
02B0 MODIFIER LETTER SMALL H
* aspiration
# <super> 0068
02B1 MODIFIER LETTER SMALL H WITH HOOK

and so on, somebody must have arisen telling ?Wait! if we label them as what they 
are, folks will use these instead of our software, so let?s disguise them a bit!? 
As a result, we?ve ended up with every script on earth being writeable in plain 
text except Latin. That seems to be an abuse of dominant position, to make an 
unknown amount of more bargain at the expense of a relatively narrow subset of 
disfavored end-users, as if the usefulness of vendors? software would essentially 
depend on one single feature: superscript formatting.

Regards,
Marcel


From verdy_p at wanadoo.fr  Thu Mar 30 14:13:29 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 30 Mar 2017 21:13:29 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAH=y87bR=87zzD8yL3ZQHLRWFyntDeGNUTW0ThdgiTJ-Po_Kng@mail.gmail.com>
 <CAGa7JC2vng_Buwgdqx+QpX5gcA7EAMjbFOL5hBFPKKT97LYRtg@mail.gmail.com>
 <169175535.16733.1490867282854.JavaMail.open-xchange@app08.ox.hosteurope.de>
Message-ID: <CAGa7JC11f=02KQR9L3_2Kv_M=7OfoiQN2i+e7B1f5pQ_Yo453w@mail.gmail.com>

2017-03-30 11:48 GMT+02:00 Christoph P?per <christoph.paeper at crissov.de>:

> Philippe Verdy <verdy_p at wanadoo.fr> hat am 30. M?rz 2017 um 00:40
> geschrieben:
>
> > There's no ISO 3166-1 code for Europe at the whole (does it exist
> legally if
> > we can't clearly define its borders?)
>
> `150` in UN M.49 which ISO 3166-1 was derived from and is compatible with.
> CLDR
> could safely adopt that if needed.
>

I have not seen a clear statement that UN M.49 code 150 for Europe (as a
whole) was related to the EU assignment in ISO 3166-1 which refers to the
European Union (but in fact still refers legally to the European Community
the only part legally recognized, even the the European Union attempted to
unify the communities this unification was partial, and three separat
"pilars" were kept). I've clearly read that EU was assigned in ISO3166 only
because of its use in WIPO standards. There are some other assignments made
for keeping compatibility with ITU standards, or with the Postal Union.

Note the ITU also defines a "European broadcasting region" that covers
north Africa and come countries of the Middle East: it is the base of
existence of the EBU (Eurovision), the second base being also the Council
of Europe one or the other being a requirement for full membership. The ITU
definition is appropriate because it matchs with coverage areas by
satellites.

So I don't think there is any equivalence between code 150 and code EU
(which includes parts outside 150, for example some of the French and Dutch
overseas dependencies in America, and Africa).

After the "Brexit" we don't know if GB will still be part of EU for WIPO
standards.. But British domain names registered in the ".eu" ccTLD will
remain valid (the TLD is not bound to the same rules as WIPO standards). As
far as I have seen GB will keep its existing status in WIPO so it should
still be part the "EU" code in ISO3166-1, unless its own membership in WIPO
is amended (I have doubt it will ever happen, GB would loose some of their
existing IP right protection).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170330/756ea20a/attachment.html>

From tuvalkin at gmail.com  Thu Mar 30 15:17:24 2017
From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=)
Date: Thu, 30 Mar 2017 21:17:24 +0100
Subject: Encoding of old compatibility characters
In-Reply-To: <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
 <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
 <CAFmvRsdG1X21oVk+OqZ0O7WT6UxUQQ8qWkirKH91yr4oEbSP6Q@mail.gmail.com>
 <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com>
Message-ID: <efec2b64-48c9-4e93-64ab-ea938700212d@gmail.com>

On 2017.03.29 05:41, Leo Broukhis asked:

> Are you still using Windows 7 or RedHat 5, or something equally old?
> Newer systems have ? out of the box.

I?m using Windows XP and "?" renders perfectly as "??". Maybe fonts can
be installed without ?upgrading? the whole operating system? Who knew?!

--                                                              ____.
Ant?nio MARTINS-Tuv?lkin                                       |  ()|
<tuvalkin at gmail.com>                                           |####|
PT-1500-239 Lisboa                    N?o me invejo de quem tem     |
PT-2695-010 Bobadela LRS              carros, parelhas e montes     |
+351 934 821 700, +351 212 463 477    s? me invejo de quem bebe     |
facebook.com/profile.php?id=744658416 a ?gua em todas as fontes     |
---------------------------------------------------------------------
De sable uma fonte e bordadura escaqueada de jalde e goles por timbre
bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!"
---------------------------------------------------------------------


From doug at ewellic.org  Thu Mar 30 16:39:18 2017
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 30 Mar 2017 14:39:18 -0700
Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters
 now final)
Message-ID: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com>

The UN "M49 Standard" (that's how they're styling it now; I guess we
should stop writing "M.49") assigns a code element for each "country or
area" and groups these into "geographical regions."

To find the "countries or areas" included within code element 150 for
"Europe," simply visit https://unstats.un.org/unsd/methodology/m49/ ,
select Geographic Regions from the menu at the left, and expand the
entries for Europe and its four subregions. The lists are available in
six languages, including French.

To find the countries that make up the European Union at any given
moment, visit http://europa.eu/european-union/about-eu/countries_fr (or
similar for other EU languages). As is well known, this list has changed
in the past and will change in the future.

The point is that UNSD's definition of Europe and the roster of the
European Union are different lists, and no attempt is made by either
organization to make these lists identical or to explain or justify
differences. 
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From verdy_p at wanadoo.fr  Thu Mar 30 17:02:13 2017
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 31 Mar 2017 00:02:13 +0200
Subject: Encoding of old compatibility characters
In-Reply-To: <efec2b64-48c9-4e93-64ab-ea938700212d@gmail.com>
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
 <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
 <CAFmvRsdG1X21oVk+OqZ0O7WT6UxUQQ8qWkirKH91yr4oEbSP6Q@mail.gmail.com>
 <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com>
 <efec2b64-48c9-4e93-64ab-ea938700212d@gmail.com>
Message-ID: <CAGa7JC15sThZH91xzj+Tn7CzqoFVc-Z3Z94ssMvsSR4Jy4bVBA@mail.gmail.com>

Probably you've installed the Noto collection on your Windows XP, or
installed some software that added fonts to the system (pmossibly with
updates to the Uniscribe library, suc has an old version of Office).
Anyway I would no longer trust XP for doing correct rendering for many
scripts, even with Uniscribe which is not needed for this simple character
mapped in the BMP. Now minimal support in XP is essentially by third party
software providers. Most have resigned, except Mozilla and some security
suites that attempt to fill the gaps abandonned now by Microsoft (but still
maintain it... because there are still various banks using it for example
in their ATM: you know it when you frequently see the ATM rebooting of
sometimes unusable as it has crashed with a "BSOD" displayed).


2017-03-30 22:17 GMT+02:00 Ant?nio Martins-Tuv?lkin <tuvalkin at gmail.com>:

> On 2017.03.29 05:41, Leo Broukhis asked:
>
> Are you still using Windows 7 or RedHat 5, or something equally old?
>> Newer systems have ? out of the box.
>>
>
> I?m using Windows XP and "?" renders perfectly as "??". Maybe fonts can
> be installed without ?upgrading? the whole operating system? Who knew?!
>
> --                                                              ____.
> Ant?nio MARTINS-Tuv?lkin                                       |  ()|
> <tuvalkin at gmail.com>                                           |####|
> PT-1500-239 Lisboa                    N?o me invejo de quem tem     |
> PT-2695-010 Bobadela LRS              carros, parelhas e montes     |
> +351 934 821 700, +351 212 463 477    s? me invejo de quem bebe     |
> facebook.com/profile.php?id=744658416 a ?gua em todas as fontes     |
> ---------------------------------------------------------------------
> De sable uma fonte e bordadura escaqueada de jalde e goles por timbre
> bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!"
> ---------------------------------------------------------------------
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/d0cddc58/attachment.html>

From c933103 at gmail.com  Thu Mar 30 17:16:12 2017
From: c933103 at gmail.com (gfb hjjhjh)
Date: Fri, 31 Mar 2017 06:16:12 +0800
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <CAGHjPPLkwVRWbe1ANSjXgMihHBchgBbZoHa6Uwi4PZEudgbwhA@mail.gmail.com>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>
 <CAGHjPPLkwVRWbe1ANSjXgMihHBchgBbZoHa6Uwi4PZEudgbwhA@mail.gmail.com>
Message-ID: <CAGHjPPL+pZ+V+k8G-3QOZ-eNF3uYL63eXw47vfkHEHPYWvUFhQ@mail.gmail.com>

On the topic I am surprised to see the only large Chinese comoany in the
member list is Huawei, with none of large Chinese internet company,
including Baidu, Alibaba, Tencent, Sina, Netease participating in Unicode.
In associate member list there is a company named zhongyi but that link is
already 404ed...
>
>
> 2017?3?30? 15:51 ? "Mark Davis ??" <mark at macchiato.com> ???
>>
>>
>>> If I made an open-source emoji font that contained flags for all of the
5000ish
>>> ISO 3166-2 codes that actually map to one, would I automatically be
considered a
>>> vendor?
>>
>>
>>> Do I need to have to pay 18000(?) dollars a year for full membership
>>> first? (That's peanuts for multi-billion dollar companies, but
unaffordable for
>>> most individuals and many FOSS projects.)
>>
>>
>> The answer to both of your questions is no.
>>
>> Please see http://unicode.org/emoji/selection.html#timeline for details.
What the UTC is looking for is commitments from major vendors. It is not
sufficient to join Unicode: we have members who are not major vendors of
emoji. And there are some major vendors that are not members.
>>
>> Of course, there is some judgment involved as to what constitutes
"major": at one extreme clearly 1B DAUs qualifies, and at the other
extreme, 1K doesn't.
>>
>> Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/dff1e221/attachment.html>

From petercon at microsoft.com  Thu Mar 30 19:49:01 2017
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 31 Mar 2017 00:49:01 +0000
Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters
 now final)
In-Reply-To: <1001013024.10243.1490900799142.JavaMail.www@wwinf2209>
References: <20170327121720.665a7a7059d7ee80bb4d670165c8327d.a35f901708.wbe@email03.godaddy.com>
 <CAGa7JC1s6zKFSF0mnFXkFgFwZZ+_p_JVDqpGBKZO1WuRq1KndQ@mail.gmail.com>
 <CAGa7JC21DmorWOWtznD+tAwcC9Vk3hhpw+3Gfx37QhBwD7w-6Q@mail.gmail.com>
 <CAN49p6r9d-7KN_Sv-_cYr+0CLHB7pVt2j+Kh3C1QrbMCEYudqw@mail.gmail.com>
 <CAGa7JC3J81qmqY6qFgMu0vSr=HoGEMZd9uVL02iVs2gG3A6+bQ@mail.gmail.com>
 <CAN49p6rmr15T2q3tU=o30kH4U6aSrOb4u7gaq4qZUYz0WF3wMA@mail.gmail.com>
 <CAGa7JC1m2c_n9KydyEspgHnrnAROWpn+5vo+2PrO9uSqiM4DiQ@mail.gmail.com>
 <CAN49p6oR7kN_soBV2qQX_W3MqG38kXxLi2+EWv1XTuCYAyetSA@mail.gmail.com>
 <CAJ2xs_HiscfAx5xuqZM76XWDRyTnGPW570wU7Gc8NcUn0fDJsw@mail.gmail.com>
 <CAKaaSX8AxonhevAYue0YQnWKRCkckEMGs4kixT8uHtb+MvTbMw@mail.gmail.com>
 <CAJ2xs_F01CnMfvfp8Ji+4Jb_N=yndHKssO7E9_2g=oX-kmZOHQ@mail.gmail.com>
 <671216829.74493.1490822278345.JavaMail.open-xchange@app09.ox.hosteurope.de>
 <CAJ2xs_FRGSU9OMHTgt0hAO0UT3rQGk4ubW4t9vy-jW5T+0mTCQ@mail.gmail.com>
 <26795525.35284.1490882591522.JavaMail.defaultUser@defaultHost>
 <1001013024.10243.1490900799142.JavaMail.www@wwinf2209>
Message-ID: <DM2PR21MB0028BCC05C2A4148B5728752D5370@DM2PR21MB0028.namprd21.prod.outlook.com>

The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends. If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not. So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers.


Peter


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marcel Schneider
Sent: Thursday, March 30, 2017 12:07 PM
To: unicode at unicode.org
Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final)

On Thu, 30 Mar 2017 15:03:11 +0100 (BST), William_J_G Overington wrote:
> 
> > What the UTC is looking for is commitments from major vendors.
> 
> Well should it be applying such a filter on progress?
> 
> I opine that assessment should be on merit and that new ideas should 
> be considered on an even-handed basis. Progress should not be on the 
> basis of what major vendors choose to do. Requiring commitments from 
> major vendors could be a barrier to new enterprises developing and a 
> barrier to progress for the benefit of consumers being made.

That?s exactly the point: that the marketplace should be tailored for the benefit of consumers, not for the sole benefit of vendors. Instead, the question seems always to be ?who is paying for it?? Another example has been recently discussed: the use of superscript letters is ?discouraged?, seemingly to prevent a set of consumers from being able to write in an acceptable way a couple of languages in plain text, and to subjugate these customers to the use of a series of rich text software. The problem is not whether to use high-end software or not, but the way how users get their stuff messed up if they don?t.

When it was up to encode the first set of superscript Latin letters in Unicode 1.0 ? or were they *too* enforced by Bruce Paterson of ISO/IEC?10646? ? all straightforward people surely were going to follow the pattern of:

2071 SUPERSCRIPT LATIN SMALL LETTER I
* functions as a modifier letter
# <super> 0069
207F SUPERSCRIPT LATIN SMALL LETTER N
* functions as a modifier letter
# <super> 006E
@ Latin subscript modifier letters
1D62 LATIN SUBSCRIPT SMALL LETTER I
# <sub> 0069
1D63 LATIN SUBSCRIPT SMALL LETTER R
# <sub> 0072
1D64 LATIN SUBSCRIPT SMALL LETTER U
# <sub> 0075
1D65 LATIN SUBSCRIPT SMALL LETTER V
# <sub> 0076

and name them accordingly. But given the way of finally calling them:

@@ 02B0 Spacing Modifier Letters 02FF
@ Latin superscript modifier letters
x (superscript latin small letter i - 2071) x (superscript latin small letter n - 207F)
02B0 MODIFIER LETTER SMALL H
* aspiration
# <super> 0068
02B1 MODIFIER LETTER SMALL H WITH HOOK

and so on, somebody must have arisen telling ?Wait! if we label them as what they are, folks will use these instead of our software, so let?s disguise them a bit!? 
As a result, we?ve ended up with every script on earth being writeable in plain text except Latin. That seems to be an abuse of dominant position, to make an unknown amount of more bargain at the expense of a relatively narrow subset of disfavored end-users, as if the usefulness of vendors? software would essentially depend on one single feature: superscript formatting.

Regards,
Marcel


From boldewyn at gmail.com  Fri Mar 31 02:10:08 2017
From: boldewyn at gmail.com (Manuel Strehl)
Date: Fri, 31 Mar 2017 09:10:08 +0200
Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0
 characters now final)
In-Reply-To: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com>
References: <20170330143918.665a7a7059d7ee80bb4d670165c8327d.6938ac6022.wbe@email03.godaddy.com>
Message-ID: <CAEZUo2fB9w66XgApcDXbqonhF9eyk=RTyeEk0vsU4nT6juSbvA@mail.gmail.com>

Maybe I'm missing context, but what is the specific problem of those lists
differing?

The EU and Europe _are_ two different things. The United States of America
similarly do not include the whole of America, despite the name.

And Norway and Switzerland and some others (incl. soon England) might not
be too happy with either institution to make a forced move to unify those
lists.

?Manuel

2017-03-30 23:39 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> The UN "M49 Standard" (that's how they're styling it now; I guess we
> should stop writing "M.49") assigns a code element for each "country or
> area" and groups these into "geographical regions."
>
> To find the "countries or areas" included within code element 150 for
> "Europe," simply visit https://unstats.un.org/unsd/methodology/m49/ ,
> select Geographic Regions from the menu at the left, and expand the
> entries for Europe and its four subregions. The lists are available in
> six languages, including French.
>
> To find the countries that make up the European Union at any given
> moment, visit http://europa.eu/european-union/about-eu/countries_fr (or
> similar for other EU languages). As is well known, this list has changed
> in the past and will change in the future.
>
> The point is that UNSD's definition of Europe and the roster of the
> European Union are different lists, and no attempt is made by either
> organization to make these lists identical or to explain or justify
> differences.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/80aa422f/attachment.html>

From eliz at gnu.org  Fri Mar 31 02:57:11 2017
From: eliz at gnu.org (Eli Zaretskii)
Date: Fri, 31 Mar 2017 10:57:11 +0300
Subject: Encoding of old compatibility characters
In-Reply-To: <CAGa7JC15sThZH91xzj+Tn7CzqoFVc-Z3Z94ssMvsSR4Jy4bVBA@mail.gmail.com>
 (message from Philippe Verdy on Fri, 31 Mar 2017 00:02:13 +0200)
References: <CAKLR3AoBnWvxK2=sCF-f2PWc3iFPKpW6j_+TWLLSh-657XXz9w@mail.gmail.com>
 <92ba6970-86e1-5d80-e3c9-239283a384b0@gmail.com>
 <41b2170a-6efb-518d-8c02-3881fbb09bae@kli.org>
 <2ba990ce-9d57-4e8b-b4dd-e9f1a821cd3b@gmail.com>
 <CAGa7JC26yc0Ayud_wqDwMvPKTy23eK63hUf983xVjMT5Z4+igw@mail.gmail.com>
 <4q7f39oed2.fsf@chem.ox.ac.uk>
 <2d2b2a87-f4d8-7f28-59de-f6cf7437c9c5@ix.netcom.com>
 <CAFmvRsdG1X21oVk+OqZ0O7WT6UxUQQ8qWkirKH91yr4oEbSP6Q@mail.gmail.com>
 <7e7af7d6-dfc4-159a-832f-e60f24136b0f@gmail.com>
 <efec2b64-48c9-4e93-64ab-ea938700212d@gmail.com>
 <CAGa7JC15sThZH91xzj+Tn7CzqoFVc-Z3Z94ssMvsSR4Jy4bVBA@mail.gmail.com>
Message-ID: <83fuht6fqg.fsf@gnu.org>

> From: Philippe Verdy <verdy_p at wanadoo.fr>
> Date: Fri, 31 Mar 2017 00:02:13 +0200
> Cc: unicode Unicode Discussion <unicode at unicode.org>
> 
> Probably you've installed the Noto collection on your Windows XP, or installed some software that added fonts
> to the system (pmossibly with updates to the Uniscribe library, suc has an old version of Office).

Arial Unicode MS supports that character, FWIW.

From philip_chastney at yahoo.com  Fri Mar 31 03:24:01 2017
From: philip_chastney at yahoo.com (philip chastney)
Date: Fri, 31 Mar 2017 08:24:01 +0000 (UTC)
Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0
 characters now final)
References: <1372498955.7962993.1490948641169.ref@mail.yahoo.com>
Message-ID: <1372498955.7962993.1490948641169@mail.yahoo.com>

ahem -- as I expect you're well aware, it's the United Kingdon that's opting to quit the EU, and England is only a part of the United Kingdom

... and the United Kingdon, in turn, only covers part of the British Isles

/phil

--------------------------------------------
On Fri, 31/3/17, Manuel Strehl <boldewyn at gmail.com> wrote:

 Subject: Re: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0 characters now final)
 To: "Unicode Mailing List" <unicode at unicode.org>
 Date: Friday, 31 March, 2017, 7:10 AM
 
 Maybe
 I'm missing context, but what is the specific problem of
 those lists differing?
 
 The
 EU and Europe _are_ two different things. The United States
 of America similarly do not include the whole of America,
 despite the name.
 
 And
 Norway and Switzerland and some others (incl. soon England)
 might not be too happy with either institution to make a
 forced move to unify those lists.
 
 ?Manuel
 
 2017-03-30 23:39 GMT+02:00
 Doug Ewell <doug at ewellic.org>:
 The UN
 "M49 Standard" (that's how they're styling
 it now; I guess we
 
 should stop writing "M.49") assigns a code element
 for each "country or
 
 area" and groups these into "geographical
 regions."
 
 
 To find the "countries or areas" included within
 code element 150 for
 
 "Europe," simply visit https://unstats.un.org/unsd/
 methodology/m49/ ,
 
 select Geographic Regions from the menu at the left, and
 expand the
 
 entries for Europe and its four subregions. The lists are
 available in
 
 six languages, including French.
 
 
 To find the countries that make up the European Union at any
 given
 
 moment, visit http://europa.eu/european-
 union/about-eu/countries_fr (or
 
 similar for other EU languages). As is well known, this list
 has changed
 
 in the past and will change in the future.
 
 
 The point is that UNSD's definition of Europe and the
 roster of the
 
 European Union are different lists, and no attempt is made
 by either
 
 organization to make these lists identical or to explain or
 justify
 
 differences.
 
 
 --
 
 Doug Ewell | Thornton, CO, US | ewellic.org
 
 
From mark at macchiato.com  Fri Mar 31 05:03:14 2017
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Fri, 31 Mar 2017 12:03:14 +0200
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com>
References: <20170330075809.665a7a7059d7ee80bb4d670165c8327d.549ef3cc50.wbe@email03.godaddy.com>
Message-ID: <CAJ2xs_E02B3KRiPRtom=WUKSSOSas1W911rjBAwWR910OX_dVA@mail.gmail.com>

Ken's observation "?approximately backwards?" is exactly right, and that's
the same reason why Markus suggested something along the lines of
"interoperable".

I don't think we've come up with a pithy category name yet, but I tried
different wording on the slides on http://unicode.org/emoji/. See what you
think, Doug.

Mark

Mark

On Thu, Mar 30, 2017 at 4:58 PM, Doug Ewell <doug at ewellic.org> wrote:

> Asmus Freytag wrote:
>
> > Recommending to vendors to support a minimal set is one thing.
> > Recommending to users to only use sequences from that set / or vendors
> > to not extend coverage beyond the minimum is something else. Both use
> > the word "recommendation" but the flavor is rather different (which
> > becomes more obvious when you re-phrase as I suggested).
> >
> > That seems to be the source of the disconnect.
>
> That seems a fair analysis.
>
> Another way of putting this is that marking a particular subset of valid
> sequences as "recommended" is one thing, while listing sequences in a
> table with a column "Standard sequence?", with some sequences marked
> "Yes" and others marked "No," is something else.
>
> Equivalently, characterizing a group of valid sequences as "Valid, but
> not recommended" is something else.
>
> If the goal is to tell users that three of the sequences are especially
> likely to be supported, or to tell vendors that they should prioritize
> support for these three, then "recommended" and "additional," used as a
> pair, would be more appropriate.
>
> If the goal is to tell users "we don't want you to use the other 5100
> sequences" and to tell vendors "we don't want you to offer support for
> them," then the existing wording is fine.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/4eda3161/attachment.html>

From doug at ewellic.org  Fri Mar 31 10:47:55 2017
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 31 Mar 2017 08:47:55 -0700
Subject: [OT] Europe vs. European Union (was: Re: Unicode Emoji 5.0
 characters now final)
Message-ID: <20170331084755.665a7a7059d7ee80bb4d670165c8327d.2259ebbb4d.wbe@email03.godaddy.com>

Manuel Strehl wrote:

> Maybe I'm missing context, but what is the specific problem of those
> lists differing? 
>
> The EU and Europe _are_ two different things. The United States of
> America similarly do not include the whole of America, despite the
> name.

A previous offshoot of the flag thread had veered into discussion of the
UN code element for Europe, and the ISO exceptionally reserved code
element for the EU, and the lists of countries in each, and something
about WIPO and ITU and ccTLDs.

I was pointing out what you said, that the lists differ by nature and
comparing them is a fruitless exercise.
 
--
Doug Ewell | Thornton, CO, US | ewellic.org


From petercon at microsoft.com  Fri Mar 31 10:59:26 2017
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 31 Mar 2017 15:59:26 +0000
Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters
 now final)
In-Reply-To: <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost>
References: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>,
 <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost>
Message-ID: <DM2PR21MB0028E77B5C616235DB618383D5370@DM2PR21MB0028.namprd21.prod.outlook.com>

William, you completely miss the point: As long as Unicode is the way to provide emoji to consumers, their needs and desires will not be best or fully met. Unicode as an AND gate is too many AND gates.


Peter


Sent from my Windows 10 phone


From: William_J_G Overington<mailto:wjgo_10009 at btinternet.com>
Sent: Friday, March 31, 2017 7:50 AM
To: Peter Constable<mailto:petercon at microsoft.com>; unicode at unicode.org<mailto:unicode at unicode.org>
Subject: Re: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters now final)


Peter Constable wrote:

> The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends.

The consumers can only choose from what is available to consumers. So what the Unicode Technical Committee recommends or "not-recommends" may well have a very significant effect upon the choices available to the consumer.

> If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not.

Sally may not know that the Unicode Technical Committee exists. Sally may have bought her computer or mobile telephone and just uses it, choosing from the emoji available in a menu system, perhaps never realizing all of the detailed standards work and implementation work that took place before the device was manufactured. It is not that Sally is having a particular emoji recommended to her as such, yet if the Unicode Technical Committee "not-recommends" implementation of some emoji that are in the standards document, then Sally may never get the opportunity to choose to use those emoji.

> So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers.

Well, all of the various standards needed to produce useful products are important. It is not a matter of one being considered before the other. For a particular emoji to become available in a device that is available to a consumer there are various stages. They are like an AND gate where all inputs must be true in order for the result to be true.

The Unicode Technical Committee has enormous power and influence to affect the future of information technology.

It works both ways. Where an encoding is made there can be progress, yet where an idea is rejected then there is no way forward for an interoperable plain text encoding to become achieved.

I submitted a document in 2015. It was determined to be out of scope and was not included in the Document Register and the Unicode Technical Committee did not consider it.

I submitted a later version and received no reply about it at all.

So I cannot make progress over an interoperable plain text encoding becoming implemented at the present time. Quite a number of UTC meetings have taken place since.

Yet the scope of Unicode is a people-made rule, it could change if people with influence want it to change. The UTC could consider my document and hold a Public Review if it chose to do so.

So, the Unicode Technical Committee has enormous power and influence to affect the future of information technology.

When a "not-recommendation" of what to support takes place the decision to do that "not-recommending" can have significant and long-lasting effects on progress.

William Overington

Friday 31 March 2017


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/0227bd7a/attachment.html>

From wjgo_10009 at btinternet.com  Fri Mar 31 09:50:08 2017
From: wjgo_10009 at btinternet.com (William_J_G Overington)
Date: Fri, 31 Mar 2017 15:50:08 +0100 (BST)
Subject: Tailoring the Marketplace (is: Re: Unicode Emoji 5.0 characters
 now final)
In-Reply-To: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
References: <7118436.33420.1490966745072.JavaMail.root@webmail24.bt.ext.cpcloud.co.uk>
Message-ID: <12873136.45822.1490971808757.JavaMail.defaultUser@defaultHost>

Peter Constable wrote:

> The interest of consumers, in regard to emoji, will never be best met by Unicode-encoded emoji, no matter what process there is for determining what should be "recommended", because consumers inevitably want emoji they recommend for themselves, not what anybody else recommends.

The consumers can only choose from what is available to consumers. So what the Unicode Technical Committee recommends or "not-recommends" may well have a very significant effect upon the choices available to the consumer.

> If Sally wants an emoji to convey her thoughts on her grandson's school play, or on the latest tweet from a politician, or whatever, she wants it _now_, and she doesn't particularly care if you or I would recommend that emoji to her or not.

Sally may not know that the Unicode Technical Committee exists. Sally may have bought her computer or mobile telephone and just uses it, choosing from the emoji available in a menu system, perhaps never realizing all of the detailed standards work and implementation work that took place before the device was manufactured. It is not that Sally is having a particular emoji recommended to her as such, yet if the Unicode Technical Committee "not-recommends" implementation of some emoji that are in the standards document, then Sally may never get the opportunity to choose to use those emoji.

> So, before we go talking about whether _Unicode_ is accommodating the benefit of consumers, I think should be asking whether _all the popular communications protocols_ are accommodating the benefit of consumers.

Well, all of the various standards needed to produce useful products are important. It is not a matter of one being considered before the other. For a particular emoji to become available in a device that is available to a consumer there are various stages. They are like an AND gate where all inputs must be true in order for the result to be true.

The Unicode Technical Committee has enormous power and influence to affect the future of information technology.

It works both ways. Where an encoding is made there can be progress, yet where an idea is rejected then there is no way forward for an interoperable plain text encoding to become achieved.

I submitted a document in 2015. It was determined to be out of scope and was not included in the Document Register and the Unicode Technical Committee did not consider it.

I submitted a later version and received no reply about it at all.

So I cannot make progress over an interoperable plain text encoding becoming implemented at the present time. Quite a number of UTC meetings have taken place since.

Yet the scope of Unicode is a people-made rule, it could change if people with influence want it to change. The UTC could consider my document and hold a Public Review if it chose to do so.

So, the Unicode Technical Committee has enormous power and influence to affect the future of information technology.

When a "not-recommendation" of what to support takes place the decision to do that "not-recommending" can have significant and long-lasting effects on progress.

William Overington

Friday 31 March 2017


From doug at ewellic.org  Fri Mar 31 12:38:03 2017
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 31 Mar 2017 10:38:03 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com>

Mark Davis wrote:

> Ken's observation "?approximately backwards?" is exactly right, and
> that's the same reason why Markus suggested something along the lines
> of "interoperable".

If the list was arrived at by members of the Consortium who are vendors
responsible for implementing (or not) emoji flags, then it would be good
to state this fact rather clearly and visibly. Otherwise it really does
look like UTC doing the recommending, and the recommending-against.

> I don't think we've come up with a pithy category name yet, but I
> tried different wording on the slides on http://unicode.org/emoji/.
> See what you think, Doug.

Slide 37 (speaker's notes) says:

"While at this point only three flags are on the recommended list,
implementations can provide other subdivision flags."

That's not a problem, except for being buried in speaker's notes. It
implies that all valid sequences are fine but some might not be
universally supported. That's normal for Unicode.

Slide 38 (slide and speaker's notes) says:

"Valid (but not recommended for vendors)"

Nope. That brings it right back to "Hey, vendors, Unicode recommends
that you don't support these." As I said Thursday, if that is the
intent, then don't change the wording; it's perfect as is.

The wordsmithing -- if that's all it is and not truly a warning-against
-- needs to apply primarily to the "not recommended" category. I
suggested "additional" to remove the explicit negative of "not
recommended" and "Standard? - No." In today's tread-lightly speech, "not
recommended" has the strong sense of "recommended against." Eating
poison ivy is Not Recommended.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From petercon at microsoft.com  Fri Mar 31 17:06:54 2017
From: petercon at microsoft.com (Peter Constable)
Date: Fri, 31 Mar 2017 22:06:54 +0000
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com>
References: <20170331103803.665a7a7059d7ee80bb4d670165c8327d.3a4c15067a.wbe@email03.godaddy.com>
Message-ID: <DM2PR21MB0028740699ADCE8F84D99AA9D5370@DM2PR21MB0028.namprd21.prod.outlook.com>

Would "are not very likely to be well-supported in common platforms or applications" work?


Peter


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Friday, March 31, 2017 10:38 AM
To: Mark Davis ?? <mark at macchiato.com>
Cc: Asmus Freytag <asmusf at ix.netcom.com>; Unicode Mailing List <unicode at unicode.org>
Subject: RE: Unicode Emoji 5.0 characters now final

Mark Davis wrote:

> Ken's observation "?approximately backwards?" is exactly right, and 
> that's the same reason why Markus suggested something along the lines 
> of "interoperable".

If the list was arrived at by members of the Consortium who are vendors responsible for implementing (or not) emoji flags, then it would be good to state this fact rather clearly and visibly. Otherwise it really does look like UTC doing the recommending, and the recommending-against.

> I don't think we've come up with a pithy category name yet, but I 
> tried different wording on the slides on http://unicode.org/emoji/.
> See what you think, Doug.

Slide 37 (speaker's notes) says:

"While at this point only three flags are on the recommended list, implementations can provide other subdivision flags."

That's not a problem, except for being buried in speaker's notes. It implies that all valid sequences are fine but some might not be universally supported. That's normal for Unicode.

Slide 38 (slide and speaker's notes) says:

"Valid (but not recommended for vendors)"

Nope. That brings it right back to "Hey, vendors, Unicode recommends that you don't support these." As I said Thursday, if that is the intent, then don't change the wording; it's perfect as is.

The wordsmithing -- if that's all it is and not truly a warning-against
-- needs to apply primarily to the "not recommended" category. I suggested "additional" to remove the explicit negative of "not recommended" and "Standard? - No." In today's tread-lightly speech, "not recommended" has the strong sense of "recommended against." Eating poison ivy is Not Recommended.

--
Doug Ewell | Thornton, CO, US | ewellic.org


From doug at ewellic.org  Fri Mar 31 17:38:00 2017
From: doug at ewellic.org (Doug Ewell)
Date: Fri, 31 Mar 2017 15:38:00 -0700
Subject: Unicode Emoji 5.0 characters now final
Message-ID: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com>

Peter Constable wrote:

> Would "are not very likely to be well-supported in common platforms or
> applications" work?

No, I think it should be even longer, maybe a paragraph or two, because
the concept of "A-list" versus "everything else" is just too complex and
unfamiliar to express concisely.

What's wrong with "other" or "additional" in contrast to "recommended"
or "preferred"? Or is the intent really to say "don't use these"?

--
Doug Ewell | Thornton, CO, US | ewellic.org


From asmusf at ix.netcom.com  Fri Mar 31 18:43:18 2017
From: asmusf at ix.netcom.com (Asmus Freytag)
Date: Fri, 31 Mar 2017 16:43:18 -0700
Subject: Unicode Emoji 5.0 characters now final
In-Reply-To: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com>
References: <20170331153800.665a7a7059d7ee80bb4d670165c8327d.3727a49ed6.wbe@email03.godaddy.com>
Message-ID: <b46b0cf3-17df-6c85-ded5-cb90224b8408@ix.netcom.com>

An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20170331/241da7b4/attachment.html>