From doug at ewellic.org  Sun May  1 14:32:19 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 1 May 2016 13:32:19 -0600
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <mailman.1.1462122002.20638.unicode@unicode.org>
References: <mailman.1.1462122002.20638.unicode@unicode.org>
Message-ID: <1FD941D6110E4641A78751CC2BB73CCB@DougEwell>

Don Osborn wrote:

> Substituting characters such that the key for an otherwise unused
> character yields a hooked letter or a tone-marked vowel may be seen as
> sufficient for their purposes and easier than switching to Unicode and
> sorting out a new keyboard system.

The myth is that switching to Unicode requires switching to a new and 
{ unfamiliar, complex, hard to adopt } keyboard layout. Even when the 
"new" part is true, the rest need not be.

Assuming they are currently using a Windows U.S. English layout, someone 
could easily provide them with a layout that either:

1. puts the non-ASCII letters on the keys corresponding to the ASCII 
symbols currently repurposed by their font (for example, pressing q 
yields ?), or

2. puts them on AltGr combinations (for example, pressing AltGr+e yields 
?).

In the first case, there would be no apparent change for the user, but 
the mapping from q to ? would be moved out of the font and into the 
input process.

The second case would allow access to both English and (e.g.) Bambara 
characters, but would require a change for the user typing Bambara, so 
would probably meet with more resistance.

Tools could be easily written to convert existing text like "tqgq" to 
the real spelling, so compatibility with the hacked fonts would become 
less of a concern.

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From duerst at it.aoyama.ac.jp  Mon May  2 02:34:08 2016
From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=)
Date: Mon, 2 May 2016 16:34:08 +0900
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
Message-ID: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp>

Hello Don,

I agree with Doug that creating a good keyboard layout is a good thing 
to do. Among the people on this list, you probably have the best 
contacts, and can help create some test layouts and see how people react.

Also, creating fonts that have the necessary coverage but are encoded in 
Unicode may help, depending on how well the necessary characters are 
supported out of the box in the OS version in use on the ground (which 
may be quite old).

Also, a conversion program will help. It shouldn't be too difficult, 
because as far as I understand, it's essentially just a few characters 
than need conversion, and it's 1 byte to multibyte. Even in a low level 
language such as C, that's just a few lines, and any of the students in 
my programming course could write that (they just wrote something 
similar as an exercise last week).

On 2016/05/01 02:27, Don Osborn wrote:
> Last October I posted about persistence of old modified/hacked 8-bit
> fonts, with an example from Mali. This is a quick follow up, with
> belated thanks to those who responded to that post on and off list, and
> a set of examples from China and Nigeria. I conclude below with some
> thoughts about what this says about dissemination of information about
> Unicode.

I'm not familiar with the actual situation on the ground, which may vary 
in each place, but in general, what will convince people is not 
theoretical information, but practical tools and examples about what 
works better with Unicode (e.g.: if you do it this way, it will show 
correctly in the Web browser on your new smart phone, or if you do it 
this way, even your relative in Europe can read it without installing a 
special font,...).

Even in the developed world, where most people these days are using 
Unicode, most don't know what it is, and that's just fine, because it 
just works.

Regards,   Martin.

From ed.trager at gmail.com  Mon May  2 11:03:58 2016
From: ed.trager at gmail.com (Ed Trager)
Date: Mon, 2 May 2016 12:03:58 -0400
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
 <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp>
Message-ID: <CAP6tU+m7cvUywm36OsuE1Ufm7JyJTYNGL5FUjuDSXcrf_p_Nig@mail.gmail.com>

In addition to creating platform-specific keyboard layouts as Doug
suggested, I would also like to point out that it is now also possible ?and
possibly even easier? to create web-based keyboard and input method engines
that may allow a greater degree of cross-platform support, reducing
platform-specific work.

Also with web applications the "software installation" issue is eliminated.
Remember that while it is easy for technologically savvy folks like members
of this mailing list to install keyboard drivers on any platform we like,
this process is somewhat beyond the reach of many people I know, even when
they are otherwise fairly comfortable using computers.

As an example, see http://unifont.org/keycurry/, a Javascript/jQuery-based
web app that I wrote and use for myself all of the time.

One limitation of keycurry is that currently almost all of the keyboard
maps assume an American QWERTY layout. But honestly it would not be very
difficult to generate variant maps for AZERTY or whatever else one wants. I
just have not bothered myself to do that extra work because I bought my
laptop in the U.S. and the default QWERTY layout works fine for me,
especially now that I can write new keyboard maps for most scripts and
languages in a matter of a few minutes (unifont.org/keycurry now uses
JSON-based keyboard maps with UTF-8, in addition to an older format based
on Yudit; obviously IMEs for scripts like Korean or Chinese take a lot
longer to write, but simple keymaps for Latin and many other scripts are
super easy to make).

In fact, with web-based solutions, users don't even have to download or
install the fonts, as obviously we can just use web fonts to supply
Unicode-based fonts to the web app. (In fact this is exactly what I do for
the Tai Tham keyboards in keycurry, inter alia).

Best - Ed

On Mon, May 2, 2016 at 3:34 AM, Martin J. D?rst <duerst at it.aoyama.ac.jp>
wrote:

> Hello Don,
>
> I agree with Doug that creating a good keyboard layout is a good thing to
> do. Among the people on this list, you probably have the best contacts, and
> can help create some test layouts and see how people react.
>
> Also, creating fonts that have the necessary coverage but are encoded in
> Unicode may help, depending on how well the necessary characters are
> supported out of the box in the OS version in use on the ground (which may
> be quite old).
>
> Also, a conversion program will help. It shouldn't be too difficult,
> because as far as I understand, it's essentially just a few characters than
> need conversion, and it's 1 byte to multibyte. Even in a low level language
> such as C, that's just a few lines, and any of the students in my
> programming course could write that (they just wrote something similar as
> an exercise last week).
>
> On 2016/05/01 02:27, Don Osborn wrote:
>
>> Last October I posted about persistence of old modified/hacked 8-bit
>> fonts, with an example from Mali. This is a quick follow up, with
>> belated thanks to those who responded to that post on and off list, and
>> a set of examples from China and Nigeria. I conclude below with some
>> thoughts about what this says about dissemination of information about
>> Unicode.
>>
>
> I'm not familiar with the actual situation on the ground, which may vary
> in each place, but in general, what will convince people is not theoretical
> information, but practical tools and examples about what works better with
> Unicode (e.g.: if you do it this way, it will show correctly in the Web
> browser on your new smart phone, or if you do it this way, even your
> relative in Europe can read it without installing a special font,...).
>
> Even in the developed world, where most people these days are using
> Unicode, most don't know what it is, and that's just fine, because it just
> works.
>
> Regards,   Martin.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160502/a9f2234f/attachment.html>

From oren.watson at gmail.com  Mon May  2 11:31:36 2016
From: oren.watson at gmail.com (Oren Watson)
Date: Mon, 2 May 2016 12:31:36 -0400
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
 <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp>
Message-ID: <CAKs2F=p2FLcypw9J+00Y4iXWmDq83nFjM3C8snRMJBcYdBq4EA@mail.gmail.com>

Hm... I don't think that simply search-replacing of ascii characters for
the characters the font uses them for will work, except on .txt files.
Microsoft Word documents, HTML files, and any other non-plaintext files
will almost certainly be corrupted by such a program, because the tags
might contain those letters. (in addition, unlike .docx files, .doc files
from windows xp contain binary data which could have arbitrary bytes.)

Probably in practical terms a good solution is to make a Microsoft Word
macro to do the replacement, and post instruction to copypaste it.

On Mon, May 2, 2016 at 3:34 AM, Martin J. D?rst <duerst at it.aoyama.ac.jp>
wrote:

> Hello Don,
>
> I agree with Doug that creating a good keyboard layout is a good thing to
> do. Among the people on this list, you probably have the best contacts, and
> can help create some test layouts and see how people react.
>
> Also, creating fonts that have the necessary coverage but are encoded in
> Unicode may help, depending on how well the necessary characters are
> supported out of the box in the OS version in use on the ground (which may
> be quite old).
>
> Also, a conversion program will help. It shouldn't be too difficult,
> because as far as I understand, it's essentially just a few characters than
> need conversion, and it's 1 byte to multibyte. Even in a low level language
> such as C, that's just a few lines, and any of the students in my
> programming course could write that (they just wrote something similar as
> an exercise last week).
>
> On 2016/05/01 02:27, Don Osborn wrote:
>
>> Last October I posted about persistence of old modified/hacked 8-bit
>> fonts, with an example from Mali. This is a quick follow up, with
>> belated thanks to those who responded to that post on and off list, and
>> a set of examples from China and Nigeria. I conclude below with some
>> thoughts about what this says about dissemination of information about
>> Unicode.
>>
>
> I'm not familiar with the actual situation on the ground, which may vary
> in each place, but in general, what will convince people is not theoretical
> information, but practical tools and examples about what works better with
> Unicode (e.g.: if you do it this way, it will show correctly in the Web
> browser on your new smart phone, or if you do it this way, even your
> relative in Europe can read it without installing a special font,...).
>
> Even in the developed world, where most people these days are using
> Unicode, most don't know what it is, and that's just fine, because it just
> works.
>
> Regards,   Martin.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160502/52dc109d/attachment.html>

From jcb+unicode at inf.ed.ac.uk  Wed May  4 01:54:48 2016
From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield)
Date: Wed,  4 May 2016 07:54:48 +0100 (BST)
Subject: non-breaking snakes
Message-ID: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>

See
http://xkcd.com/1676/
(making sure to look at the mouse-over text)

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


From mark at macchiato.com  Wed May  4 02:07:19 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 4 May 2016 09:07:19 +0200
Subject: non-breaking snakes
In-Reply-To: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
Message-ID: <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>

Very nice!

Mark

On Wed, May 4, 2016 at 8:54 AM, Julian Bradfield <jcb+unicode at inf.ed.ac.uk>
wrote:

> See
> http://xkcd.com/1676/
> (making sure to look at the mouse-over text)
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160504/7f8e0ece/attachment.html>

From samjnaa at gmail.com  Wed May  4 02:14:00 2016
From: samjnaa at gmail.com (Shriramana Sharma)
Date: Wed, 4 May 2016 12:44:00 +0530
Subject: non-breaking snakes
In-Reply-To: <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
Message-ID: <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>

Isn't there some Japanese orthography feature that already does
something like this?

-- 
Shriramana Sharma ???????????? ????????????


From mark at macchiato.com  Wed May  4 02:23:04 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Wed, 4 May 2016 09:23:04 +0200
Subject: non-breaking snakes
In-Reply-To: <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
 <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>
Message-ID: <CAJ2xs_GW6GcX8GWjrSuAb5vHkv2wojn8uewyhG8fxJM00F=Vww@mail.gmail.com>

Arabic has tatweel/kashida for justification; rather similar in principle.

https://en.wikipedia.org/wiki/Kashida

Mark

On Wed, May 4, 2016 at 9:14 AM, Shriramana Sharma <samjnaa at gmail.com> wrote:

> Isn't there some Japanese orthography feature that already does
> something like this?
>
> --
> Shriramana Sharma ???????????? ????????????
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160504/45ba18c4/attachment.html>

From textexin at xencraft.com  Wed May  4 02:23:54 2016
From: textexin at xencraft.com (Tex Texin)
Date: Wed, 4 May 2016 00:23:54 -0700
Subject: non-breaking snakes
In-Reply-To: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
Message-ID: <00e401d1a5d5$ea92a070$bfb7e150$@xencraft.com>

Non-breaking snake is English for Kashida right?

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Julian Bradfield
Sent: Tuesday, May 03, 2016 11:55 PM
To: unicode at unicode.org
Subject: non-breaking snakes

See
http://xkcd.com/1676/
(making sure to look at the mouse-over text)

--
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.


From richard.wordingham at ntlworld.com  Wed May  4 02:27:55 2016
From: richard.wordingham at ntlworld.com (Richard Wordingham)
Date: Wed, 4 May 2016 08:27:55 +0100
Subject: non-breaking snakes
In-Reply-To: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
Message-ID: <20160504082755.6b3b9f9d@JRWUBU2>

On Wed,  4 May 2016 07:54:48 +0100 (BST)
Julian Bradfield <jcb+unicode at inf.ed.ac.uk> wrote:

> See
> http://xkcd.com/1676/
> (making sure to look at the mouse-over text)

I though kashida (TATWEEL) was a precedent not to be followed.  The
issue of course, is that chained snakes do not reflow well, just as
filler text doesn't. 

Richard.

From khaledhosny at eglug.org  Wed May  4 05:46:58 2016
From: khaledhosny at eglug.org (Khaled Hosny)
Date: Wed, 4 May 2016 12:46:58 +0200
Subject: non-breaking snakes
In-Reply-To: <CAJ2xs_GW6GcX8GWjrSuAb5vHkv2wojn8uewyhG8fxJM00F=Vww@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
 <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>
 <CAJ2xs_GW6GcX8GWjrSuAb5vHkv2wojn8uewyhG8fxJM00F=Vww@mail.gmail.com>
Message-ID: <20160504104658.GA24870@macbook>

That sounds more like traditional Tibetan justification than kashida:
http://rishida.net/scripts/tibetan/#justification

On Wed, May 04, 2016 at 09:23:04AM +0200, Mark Davis ?? wrote:
> Arabic has tatweel/kashida for justification; rather similar in principle.
> 
> https://en.wikipedia.org/wiki/Kashida
> 
> Mark
> 
> On Wed, May 4, 2016 at 9:14 AM, Shriramana Sharma <samjnaa at gmail.com> wrote:
> 
> > Isn't there some Japanese orthography feature that already does
> > something like this?
> >
> > --
> > Shriramana Sharma ???????????? ????????????
> >

From simon at simon-cozens.org  Wed May  4 06:07:05 2016
From: simon at simon-cozens.org (Simon Cozens)
Date: Wed, 4 May 2016 21:07:05 +1000
Subject: non-breaking snakes
In-Reply-To: <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
Message-ID: <5729D7D9.7020201@simon-cozens.org>

On 04/05/2016 17:07, Mark Davis ?? wrote:
> Very nice!

The SILE typesetting engine now implements full support for this new
justification strategy. Please see http://www.sile-typesetter.org/

From verdy_p at wanadoo.fr  Wed May  4 06:15:08 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 4 May 2016 13:15:08 +0200
Subject: non-breaking snakes
In-Reply-To: <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
Message-ID: <CAGa7JC1W4RqtCb-oOxb1j59t1zqb45Bv=DJ2CQCjBWGLYmL3Dg@mail.gmail.com>

Those "snakes" do exist in Arabic for justification purpose (they are
formatting controls insertable between pairs of joined letters and possibly
used as base holders for diacritics).

Otherwise they are just normal "filler" (punctuation-like symbols like
leader dots, otherwise "crap text").

The Arabic tatweel is very smart (better than extending the only spacing
that applies only between words and better than breaking words with
interletter spacing or changing the shape of letters, or packing letters to
remove their normal spacing gap and creating collisions).

Technically such "tatweel" also exist in Latin with its cursive form (with
joined letters), and possibly as well in cursive forms of Greek and
Cyrillic. But they are still not encoded at all (as formatting controls),
even if they could also be used as base holders for some left-side or
right-side diacritics.

2016-05-04 9:07 GMT+02:00 Mark Davis ?? <mark at macchiato.com>:

> Very nice!
>
> Mark
>
> On Wed, May 4, 2016 at 8:54 AM, Julian Bradfield <jcb+unicode at inf.ed.ac.uk
> > wrote:
>
>> See
>> http://xkcd.com/1676/
>> (making sure to look at the mouse-over text)
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160504/71d4fd86/attachment.html>

From leoboiko at namakajiri.net  Wed May  4 07:59:04 2016
From: leoboiko at namakajiri.net (Leonardo Boiko)
Date: Wed, 4 May 2016 09:59:04 -0300
Subject: non-breaking snakes
In-Reply-To: <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <CAJ2xs_FLMB+seFmHPpRb0=skwQ30mjHipoyD_wvLFGmoKCo-CA@mail.gmail.com>
 <CAH-HCWWSVsYxS=iYsQLrmH0Qh2t6Cv3D6k5p-XHp+Z2RW+RVcQ@mail.gmail.com>
Message-ID: <CAJ6uix7bBggRn5kZ4-21uW2c73M2TuZHb84UL9abTs3viDRFbA@mail.gmail.com>

2016-05-04 4:14 GMT-03:00 Shriramana Sharma <samjnaa at gmail.com>:
> Isn't there some Japanese orthography feature that already does
> something like this?

Japanese (and Chinese) vertical calligraphy can do arbitrary-length
stretching of lines (like the Arabic kashida under discussion, and
like most cursive scripts in the world, I guess). Notice e.g. the long
lines here: https://www.instagram.com/seiichirou_uemura/ . The
hiragana letter ?? in particular, often becomes a long vertical line.

However, traditionally this is used for ?sthetic rhythm, not for
justification.  In fact, most kinds of Japanese calligraphy prize
variation in line length, not uniformity. And when uniformity is
sought (e.g. certain sutras), they don't use stretched lines, but just
fill a grid with non-cursive, block (kaisho) characters.

I'm not aware of similar features for typography. Because the script
doesn't separate words, justification is comparatively simple?you just
break lines mid-word, mostly wherever (with a few restrictions to
avoid hanging punctuation and so on.)


From doug at ewellic.org  Wed May  4 09:29:20 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 04 May 2016 07:29:20 -0700
Subject: non-breaking snakes
Message-ID: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com>

1F40D FE0F

The VS just makes extra, extra sure that it?s emoji.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From charupdate at orange.fr  Thu May  5 21:35:59 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 6 May 2016 04:35:59 +0200 (CEST)
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
Message-ID: <1246651385.10.1462502159794.JavaMail.www@wwinf1k18>

On Sat, 30 Apr 2016 13:27:02 -0400, Don Osborn  wrote:

> If the latter be the case, that would seem to have implications
> regarding dissemination of information about Unicode. "If you
> standardize it, they will adopt" certainly holds for industry and
> well-informed user communities (such as in open source software), but
> not necessarily for more localized initiatives. This is not to seek to
> assign blame in any way, but rather to point out what seems to be a
> persistent issue with long term costs in terms of usability of text in
> writing systems as diverse as Bambara, Hausa boko, and Chinese pinyin.

The situation Don describes is challenging the work that is already done and on-going in Mali, with several keyboard layouts at hand. If widening the range is really suitable, one might wish to test a couple of other solutions than already mentioned, that roughly fall into two subsets:

1) Letters on the digits row. Thanks to a kindly shared resource, I?m able to tell that over one dozen Windows layouts?mainly French, as used in Mali, but also Lithuanian, Czech, Slovak, and Vietnamese, have the digits in the Shift or AltGr shift states. The latter is the only useful way of mapping letters on digit keys and becomes handy if the Kana toggle is added, either alone or in synergy with the Kana modifier instead of AltGr. With all bracketing characters in group?2 level?1 on the home row and so on, there is enough place to have all characters for Bambara and French directly accessed.

2) Letters through dead keys. This is the ISO/IEC?9995 way of making more characters available in additional groups with dead key group selectors (referred to as remnant modifiers but actually implemented as dead keys). This is also one way SIL/Tavultesoft?s layouts work for African and notably for Malian languages. IME-based keyboarding software may additionally offer a transparent input experience.


On Mon, 2 May 2016 12:03:58 -0400, Ed Trager  wrote:

> Also with web applications the "software installation" issue is eliminated.
> Remember that while it is easy for technologically savvy folks like members
> of this mailing list to install keyboard drivers on any platform we like,
> this process is somewhat beyond the reach of many people I know, even when
> they are otherwise fairly comfortable using computers. 

I can?t easily believe that people who are comfortable with computers may have trouble using the widely automatted keyboard layout installation feature, because I?ve as well experienced myself as got the opportunity to observe on other persons I know, that in fact there is some kind of reluctance based on the belief?call it a myth or an urban legend?that Windows plus preinstalled software plus MS?Office come along with everything any user may need until the next update. Though informing about Microsoft?s help to customize the keyboard is more complicated in that the display is part of the hardware, and the functioning behind has more of a blackbox.


Being actually working on such a project for the fr-FR locale, I?ve already got some ideas for Bambara. I hope it can soon be on-line.

Kind regards,

Marcel


From charupdate at orange.fr  Fri May  6 10:21:28 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Fri, 6 May 2016 17:21:28 +0200 (CEST)
Subject: non-breaking snakes
In-Reply-To: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com>
References: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com>
Message-ID: <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39>

On Wed, 4 May 2016 08:27:55 +0100, Richard Wordingham  wrote:

> On Wed, 4 May 2016 07:54:48 +0100 (BST)
> Julian Bradfield  wrote:
> 
> > See
> > http://xkcd.com/1676/
> > (making sure to look at the mouse-over text)
> 
> I though kashida (TATWEEL) was a precedent not to be followed. The
> issue of course, is that chained snakes do not reflow well, just as
> filler text doesn't.


On Wed, 4 May 2016 13:15:08 +0200, Philippe Verdy  wrote:

> Those "snakes" do exist in Arabic for justification purpose (they are
> formatting controls insertable between pairs of joined letters and possibly
> used as base holders for diacritics).
> 
> [?]


On Wed, 4 May 2016 09:59:04 -0300, Leonardo Boiko  wrote:

> 2016-05-04 4:14 GMT-03:00 Shriramana Sharma :
> > Isn't there some Japanese orthography feature that already does
> > something like this? 
> 
> [?] In fact, most kinds of Japanese calligraphy prize
> variation in line length, not uniformity. [?]


On Wed, 04 May 2016 07:29:20 -0700, Doug Ewell  wrote:

> 1F40D FE0F
> 
> The VS just makes extra, extra sure that it?s emoji.


Hmm? I guess the principle of diversity should then 
allow for other long animals too: various caterpillars, 
squirrel running on a branch?

More seriously, if animal pictographs are downgraded 
to mere line-fillers, I?m not sure whether the text style 
variation selector U+FE0E would not be a good choice.

Why not tackle it the other way around: standardize 
sequences of U+2012..U+2015, U+2E3A with some of 
the other ~250 variation selectors to make them look 
like extensible vegetal or animal ornaments. Or simply 
chain the VSes with repeated U+002D.

If there were a vote, I?d prefer word-break in scripts 
that allow for, in case justification is really required 
(to make a hieratic look); or in scripts that cannot break 
words, as Hebrew, using the letter extension mechanisms. 

As of letter spacing, abusing it for justifiction purposes 
is current in some languages but is not semantically neutral
?TUS recalls?in others that may be very close geographically. 
What helps making a proper layout on one side of the Rhine, 
is yelling on the other.

So yes, then abusing emoji is the lesser evil???:)

Marcel


From steve at swales.us  Fri May  6 10:49:09 2016
From: steve at swales.us (Steve Swales)
Date: Fri, 6 May 2016 08:49:09 -0700
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
Message-ID: <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>

This discussion seems to have fizzled out, but I?m concerned that there?s a real world problem here which is at least partially the concern of the consortium, so let me stir the pot and see if there?s still any meat left.

On the current release of MacOS (including the developer beta, for your reference, Peter), if you use Calibri font, for example, in any app (e.g. notes), to write words with ?ti? (like internationalization), then press ?Print" and ?Open PDF in Preview?, you get a PDF document with the joined ?ti?.  Subsequently cutting and pasting produces mojibake, and searching the document for words with?ti? doesn?t work, as previously noted.

I suppose we can look on this as purely a font handling/MacOS bug, but I?m wondering if we should be providing accommodations or conveniences in Unicode for it to work as desired.

-steve


> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...)
> 
> Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain).
> 
> PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately).
> 
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <tom at bluesky.org <mailto:tom at bluesky.org>>:
> 
> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com <mailto:asmus-inc at ix.netcom.com>> wrote:
> >
> > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document.
> 
> My understanding is that PDF/A-1a is supposed to be searchable.
> 
> 
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160506/a4c39bb6/attachment.html>

From verdy_p at wanadoo.fr  Fri May  6 12:24:12 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Fri, 6 May 2016 19:24:12 +0200
Subject: non-breaking snakes
In-Reply-To: <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39>
References: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com>
 <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39>
Message-ID: <CAGa7JC2_KaBtjGZj+VWVHUecGGPDHVBLjEpdW6+dDvTp9F9fHA@mail.gmail.com>

My opjion is that the choice of graphics for these fillers is just a matter
of style. A single filler (format control) would be enough to encode
(simplying later the text handling in order to ignore them for plain text
searches or collation). These fillers are only made for specific text
layouts with specific fonts at specific sizes, the number of actual
symbols/graphics you would need is unpredictable in all other cases.

The format control would only be used to mark where these fillers are
safely insertable automatically (just like SHY marks).

The situation however would be different if these marks are also used as
bases for holding diacritics (this is the case of the Arabic Tatweel). But
using CGJ (or some other control with combining class 0) is generally
enough to mark their separation from the base letter to which they would
normally attach. The diacritic will be positioned relative to this
zero-width CGJ, above or below.

But CGJ itself is not freely "extensible" in width for line justification.
So the encoding would be <CGJ, diacritics, FILLER> if you want all
diacritics to remain attached located to the start side of the filler. If
the diacritics should come at the end side of the filler, they would be
encoded as <FILLER, diacritics>. In summary that FILLER would be just
another form for CGJ, except that it is extensible like whitespaces for
line justification purpose. Also the FILLER would not necessarily hold
diacritics and could be used alone, even without letters on either sides of
it.

The Arabic Tatweel is behaving mosly like CGJ (diacritics are normally
rendered on the start side of the filler, but there are some cases where
the Arabic diacritics are centered on the filler: it behaves more like a
normal letter for rendering, even if it's ignorable for plain-text
searches, and may not be rendered at all if there's no need to justify
lines or diacritics may still fit around the base letter before it or even
in its normal position with that base letter).


2016-05-06 17:21 GMT+02:00 Marcel Schneider <charupdate at orange.fr>:

> On Wed, 4 May 2016 08:27:55 +0100, Richard Wordingham  wrote:
>
> > On Wed, 4 May 2016 07:54:48 +0100 (BST)
> > Julian Bradfield  wrote:
> >
> > > See
> > > http://xkcd.com/1676/
> > > (making sure to look at the mouse-over text)
> >
> > I though kashida (TATWEEL) was a precedent not to be followed. The
> > issue of course, is that chained snakes do not reflow well, just as
> > filler text doesn't.
>
>
> On Wed, 4 May 2016 13:15:08 +0200, Philippe Verdy  wrote:
>
> > Those "snakes" do exist in Arabic for justification purpose (they are
> > formatting controls insertable between pairs of joined letters and
> possibly
> > used as base holders for diacritics).
> >
> > [?]
>
>
> On Wed, 4 May 2016 09:59:04 -0300, Leonardo Boiko  wrote:
>
> > 2016-05-04 4:14 GMT-03:00 Shriramana Sharma :
> > > Isn't there some Japanese orthography feature that already does
> > > something like this?
> >
> > [?] In fact, most kinds of Japanese calligraphy prize
> > variation in line length, not uniformity. [?]
>
>
> On Wed, 04 May 2016 07:29:20 -0700, Doug Ewell  wrote:
>
> > 1F40D FE0F
> >
> > The VS just makes extra, extra sure that it?s emoji.
>
>
> Hmm? I guess the principle of diversity should then
> allow for other long animals too: various caterpillars,
> squirrel running on a branch?
>
> More seriously, if animal pictographs are downgraded
> to mere line-fillers, I?m not sure whether the text style
> variation selector U+FE0E would not be a good choice.
>
> Why not tackle it the other way around: standardize
> sequences of U+2012..U+2015, U+2E3A with some of
> the other ~250 variation selectors to make them look
> like extensible vegetal or animal ornaments. Or simply
> chain the VSes with repeated U+002D.
>
> If there were a vote, I?d prefer word-break in scripts
> that allow for, in case justification is really required
> (to make a hieratic look); or in scripts that cannot break
> words, as Hebrew, using the letter extension mechanisms.
>
> As of letter spacing, abusing it for justifiction purposes
> is current in some languages but is not semantically neutral
> ?TUS recalls?in others that may be very close geographically.
> What helps making a proper layout on one side of the Rhine,
> is yelling on the other.
>
> So yes, then abusing emoji is the lesser evil???:)
>
> Marcel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160506/a137a664/attachment.html>

From lang.support at gmail.com  Fri May  6 16:22:16 2016
From: lang.support at gmail.com (Andrew Cunningham)
Date: Sat, 7 May 2016 07:22:16 +1000
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
 <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
Message-ID: <CAGJ7U-U07EpMpFbuOZXxs7QPBhyoUaxyOBjCEJPFv30a2otFNw@mail.gmail.com>

My understand ing is searchability comes down to twho factors:

1) the ToUnicode mapping ...I which mapps glyphs in the font or subsetted
font to Unicode codepoints. Mappings take the form of one glyph to one
codepoint or one glyph to two or more codepoints.

Obviously any glyph that doesnt resolve by default to a codepoint isn't in
the mapping , nor does the mapping handle glyphs that have been visually
reordered during rendering.

2) the next step is to tag the PDF then use the ActualText label of each
tag.

So for some languages with the right fonts step one is all that is needed.
And this is fairly standard in pdf generation tools. The font itself can
impact on this of course.

But for other languages you need to go to the second step.

Woth languages I work with I might have some pdfs tat just require the
visible text layer.others will have a visible text layer. For the pdf to be
eearchable, the search tools not only need to be able to handle the text
layer but also actualtext attributes when necessary.

And that all comes down to decisions the tool developer has taken on how to
handle searching when both visible text layers and ActualText labels are
present.

I have been told in accessibility lists that the PDF specs leave that
implementation detail to the developer based on their requirements.

So in some cases you need to go the extra step and ActualText. But you also
need to evaluate your search tools to ensure they fo what you expect.

Andrew


On Saturday, 7 May 2016, Steve Swales <steve at swales.us> wrote:
> This discussion seems to have fizzled out, but I?m concerned that there?s
a real world problem here which is at least partially the concern of the
consortium, so let me stir the pot and see if there?s still any meat left.
> On the current release of MacOS (including the developer beta, for your
reference, Peter), if you use Calibri font, for example, in any app (e.g.
notes), to write words with ?ti? (like internationalization), then press
?Print" and ?Open PDF in Preview?, you get a PDF document with the joined
?ti?.  Subsequently cutting and pasting produces mojibake, and searching
the document for words with?ti? doesn?t work, as previously noted.
> I suppose we can look on this as purely a font handling/MacOS bug, but
I?m wondering if we should be providing accommodations or conveniences in
Unicode for it to work as desired.
> -steve
>
>
> On Mar 21, 2016, at 1:40 AM, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> Are those PDF supposed to be searchable inside of them ? For archival
purpose, the PDF are stored in their final form, and search is performed by
creating a database of descriptive metadata. Each time one wants formal
details, they have to read the original the way it was presented (many PDFs
are jsut scanned facsimiles of old documents which originately were not
even in numeric plain-text, they were printed or typewritten, frequently
they include graphics, handwritten signatures, stamped seals...)
> Being able to search plain-text inside a PDF is not the main objective
(and not the priority). The archival however is a top priority (and there's
no money to finance a numerisation and no human resource available to redo
this old work, if needed other contributors will recreate a plain-text
version, possibly with rich-text features, e.g. in Wikisource for old
documents that fall in the public domain).
> PDF/A-1a is meant only for creating new documents from a original
plain-text or rich-text document created with modern word-processing
applications. But this specification will frequently have to be broken, if
there's the need to include handwritten or supplementary elements
(signatures, seals...) whose source is not the original electronic document
but the printed paper over which the annotations were made: it is this
paper document, not the electronic document which is the official final
source (we've got some important legal paper whose original has other marks
including traces of beer or coffee, or partly burnt, the paper itself has
several alterations, but it is the original "as is", and for legal purpose
the only acceptable archival form as a PDF must ignore all the PDF/A-1a
constraints, not meant to represent originals accurately).
> 2016-03-20 20:52 GMT+01:00 Tom Gewecke <tom at bluesky.org>:
>>
>> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) <
asmus-inc at ix.netcom.com> wrote:
>> >
>> > Usually, the archive feature pertains only to the fact that you can
reproduce the final form, not to being able to get at the correct source
(plain text backbone) for the document.
>>
>> My understanding is that PDF/A-1a is supposed to be searchable.
>>
>>
>>
>
>
>

-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160507/13228f1e/attachment.html>

From tuvalkin at gmail.com  Fri May  6 23:54:57 2016
From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=)
Date: Sat, 7 May 2016 05:54:57 +0100
Subject: non-breaking snakes
In-Reply-To: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
Message-ID: <572D7521.9090203@gmail.com>

On 2016.05.04 07:54, Julian Bradfield wrote:

> See http://xkcd.com/1676/
> (making sure to look at the mouse-over text)

The new snake character needs to have in its remarks field see-also 
links to these:

U+115F HANGUL CHOSEONG FILLER
U+1160 HANGUL JUNGSEONG FILLER
U+3164 HANGUL FILLER : chaeum
U+A8F9 DEVANAGARI GAP FILLER
U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164)
U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER

--                                                              ____.
Ant?nio MARTINS-Tuv?lkin                                       |  ()|
<tuvalkin at gmail.com>                                           |####|
PT-1500-124 Lisboa                    N?o me invejo de quem tem     |
PT-2695-010 Bobadela LRS              carros, parelhas e montes     |
+351 934 821 700, +351 212 463 477    s? me invejo de quem bebe     |
facebook.com/profile.php?id=744658416 a ?gua em todas as fontes     |
---------------------------------------------------------------------
De sable uma fonte e bordadura escaqueada de jalde e goles por timbre
bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!"
---------------------------------------------------------------------

From leob at mailcom.com  Sat May  7 00:35:34 2016
From: leob at mailcom.com (Leo Broukhis)
Date: Fri, 6 May 2016 22:35:34 -0700
Subject: non-breaking snakes
In-Reply-To: <572D7521.9090203@gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <572D7521.9090203@gmail.com>
Message-ID: <CAFmvRscbFiQpGuW05Sra5rKe8z1YUkK1-eZh5Q-Q1PQn+u88-g@mail.gmail.com>

Also, or rather foremost, to U+2766 ? FLORAL HEART

????? - what does the (almost) connecting vine remind me of? Hmmm...

Leo


2016-05-06 21:54 GMT-07:00 Ant?nio Martins-Tuv?lkin <tuvalkin at gmail.com>:

> On 2016.05.04 07:54, Julian Bradfield wrote:
>
> See http://xkcd.com/1676/
>> (making sure to look at the mouse-over text)
>>
>
> The new snake character needs to have in its remarks field see-also links
> to these:
>
> U+115F HANGUL CHOSEONG FILLER
> U+1160 HANGUL JUNGSEONG FILLER
> U+3164 HANGUL FILLER : chaeum
> U+A8F9 DEVANAGARI GAP FILLER
> U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164)
> U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER
>
> --                                                              ____.
> Ant?nio MARTINS-Tuv?lkin                                       |  ()|
> <tuvalkin at gmail.com>                                           |####|
> PT-1500-124 Lisboa                    N?o me invejo de quem tem     |
> PT-2695-010 Bobadela LRS              carros, parelhas e montes     |
> +351 934 821 700, +351 212 463 477    s? me invejo de quem bebe     |
> facebook.com/profile.php?id=744658416 a ?gua em todas as fontes     |
> ---------------------------------------------------------------------
> De sable uma fonte e bordadura escaqueada de jalde e goles por timbre
> bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!"
> ---------------------------------------------------------------------
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160506/29425d79/attachment.html>

From verdy_p at wanadoo.fr  Sat May  7 08:05:24 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sat, 7 May 2016 15:05:24 +0200
Subject: non-breaking snakes
In-Reply-To: <CAFmvRscbFiQpGuW05Sra5rKe8z1YUkK1-eZh5Q-Q1PQn+u88-g@mail.gmail.com>
References: <slrnnij75n.6h9.jcb@home.stevens-bradfield.com>
 <572D7521.9090203@gmail.com>
 <CAFmvRscbFiQpGuW05Sra5rKe8z1YUkK1-eZh5Q-Q1PQn+u88-g@mail.gmail.com>
Message-ID: <CAGa7JC3YcEGtnFn832Ak4QmaSRP6hfqVMjVNf6RSQ8_r=HXqTg@mail.gmail.com>

This is the same thing as:
____________
.....................
:::::::::::::::::::::
############
*****************
===========
/////////////////////
---------------------
TTTTTTTTTTT

You can use any characters (punctuation, symbols, even letters) or graphics
aligned in a row to create such fillers But isolately these characters have
their own meaning, independantly of their "snake" usage. The vine symbol is
not special. It also maps to "leaders" dots used in TOCs or input forms.

That's why I suggest that this usage being only a matter of style for
graphically representing (with known fonts and layouts) a "snake", which
may still be represented by a format control where they are authorized for
insertion by line justification, instead of just whitespaces.

Then a stylesheet, specific to a page layout, will do the rest, specifying
the graphics or characters to use for these insertions, without having the
document to specify a specific number of signs. In a plain-text format with
unspecified layout, it should not even be visible.

2016-05-07 7:35 GMT+02:00 Leo Broukhis <leob at mailcom.com>:

> Also, or rather foremost, to U+2766 ? FLORAL HEART
>
> ????? - what does the (almost) connecting vine remind me of? Hmmm...
>
> Leo
>
>
> 2016-05-06 21:54 GMT-07:00 Ant?nio Martins-Tuv?lkin <tuvalkin at gmail.com>:
>
>> On 2016.05.04 07:54, Julian Bradfield wrote:
>>
>> See http://xkcd.com/1676/
>>> (making sure to look at the mouse-over text)
>>>
>>
>> The new snake character needs to have in its remarks field see-also links
>> to these:
>>
>> U+115F HANGUL CHOSEONG FILLER
>> U+1160 HANGUL JUNGSEONG FILLER
>> U+3164 HANGUL FILLER : chaeum
>> U+A8F9 DEVANAGARI GAP FILLER
>> U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164)
>> U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER
>>
>> --                                                              ____.
>> Ant?nio MARTINS-Tuv?lkin                                       |  ()|
>> <tuvalkin at gmail.com>                                           |####|
>> PT-1500-124 Lisboa                    N?o me invejo de quem tem     |
>> PT-2695-010 Bobadela LRS              carros, parelhas e montes     |
>> +351 934 821 700, +351 212 463 477    s? me invejo de quem bebe     |
>> facebook.com/profile.php?id=744658416 a ?gua em todas as fontes     |
>> ---------------------------------------------------------------------
>> De sable uma fonte e bordadura escaqueada de jalde e goles por timbre
>> bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!"
>> ---------------------------------------------------------------------
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160507/06b63944/attachment.html>

From hospes02 at scholarsfonts.net  Sat May  7 12:00:31 2016
From: hospes02 at scholarsfonts.net (David Perry)
Date: Sat, 07 May 2016 13:00:31 -0400
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
 <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
Message-ID: <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net>

I agree that it's a real-world problem -- PDFs really should be 
searchable -- but I do not see that it's a Unicode issue.  Unicode 
defines the basic building blocks of LATIN SMALL LETTER T and LATIN 
SMALL LETTER I; that's its job. Unicode is not responsible for font 
construction or creating PDF software.  Furthermore, even if Unicode did 
want to do something about it, I can't imagine what that could be -- 
aside perhaps from using its bully pulpit to urge PDF creators and font 
creators to do their jobs better.

The fact that some PDF apps do not search and copy/paste text correctly 
when unencoded characters are given PUA values has been known for many 
years.  In the case of Calibri, I looked at the font (version installed 
on my Win7 system) and found that the 'ti' ligature is named t_i, which 
follows good naming practices, and it does not have a PUA assignment. 
Given this, any well-constructed PDF app should be able to decode the 
ligature correctly.

David

On 5/6/2016 11:49 AM, Steve Swales wrote:
> This discussion seems to have fizzled out, but I?m concerned that
> there?s a real world problem here which is at least partially the
> concern of the consortium, so let me stir the pot and see if there?s
> still any meat left.
>
> On the current release of MacOS (including the developer beta, for
> your reference, Peter), if you use Calibri font, for example, in any
> app (e.g. notes), to write words with ?ti? (like
> internationalization), then press ?Print" and ?Open PDF in Preview?,
> you get a PDF document with the joined ?ti?.  Subsequently cutting and
> pasting produces mojibake, and searching the document for words
> with?ti? doesn?t work, as previously noted.
>
> I suppose we can look on this as purely a font handling/MacOS bug, but
> I?m wondering if we should be providing accommodations or conveniences
> in Unicode for it to work as desired.
>
> -steve
>

From lang.support at gmail.com  Sun May  8 03:13:48 2016
From: lang.support at gmail.com (Andrew Cunningham)
Date: Sun, 8 May 2016 18:13:48 +1000
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
 <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
 <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net>
Message-ID: <CAGJ7U-XOpZ3+S+BbMgNDGdzjmx=Vhv620bETZnovUbqzwXEPAA@mail.gmail.com>

The t_i instance will depend on the quality of the font. If its a standard
ligature there should be a glyph to codepoints assignment in the cmap table
or the ToUnicode mapping in the PDF file.

As David indicates, it isnt a Unicode issue.

It is an issue with the font used and/or the tools used.

PDFs have always been problematic. That isn't going to change anytime soon.
Partly for archiveable or accessible PDFs, the person generating the PDFs
should select the best tools for the job and test the PDF. Then fix any
problems.

Andrew

On Sunday, 8 May 2016, David Perry <hospes02 at scholarsfonts.net> wrote:
> I agree that it's a real-world problem -- PDFs really should be
searchable -- but I do not see that it's a Unicode issue.  Unicode defines
the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I;
that's its job. Unicode is not responsible for font construction or
creating PDF software.  Furthermore, even if Unicode did want to do
something about it, I can't imagine what that could be -- aside perhaps
from using its bully pulpit to urge PDF creators and font creators to do
their jobs better.
>
> The fact that some PDF apps do not search and copy/paste text correctly
when unencoded characters are given PUA values has been known for many
years.  In the case of Calibri, I looked at the font (version installed on
my Win7 system) and found that the 'ti' ligature is named t_i, which
follows good naming practices, and it does not have a PUA assignment. Given
this, any well-constructed PDF app should be able to decode the ligature
correctly.
>
> David
>
> On 5/6/2016 11:49 AM, Steve Swales wrote:
>>
>> This discussion seems to have fizzled out, but I?m concerned that
>> there?s a real world problem here which is at least partially the
>> concern of the consortium, so let me stir the pot and see if there?s
>> still any meat left.
>>
>> On the current release of MacOS (including the developer beta, for
>> your reference, Peter), if you use Calibri font, for example, in any
>> app (e.g. notes), to write words with ?ti? (like
>> internationalization), then press ?Print" and ?Open PDF in Preview?,
>> you get a PDF document with the joined ?ti?.  Subsequently cutting and
>> pasting produces mojibake, and searching the document for words
>> with?ti? doesn?t work, as previously noted.
>>
>> I suppose we can look on this as purely a font handling/MacOS bug, but
>> I?m wondering if we should be providing accommodations or conveniences
>> in Unicode for it to work as desired.
>>
>> -steve
>>
>

-- 
Andrew Cunningham
lang.support at gmail.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/2490809a/attachment.html>

From dzo at bisharat.net  Sun May  8 07:42:13 2016
From: dzo at bisharat.net (Don Osborn)
Date: Sun, 8 May 2016 08:42:13 -0400
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <CAGJ7U-XOpZ3+S+BbMgNDGdzjmx=Vhv620bETZnovUbqzwXEPAA@mail.gmail.com>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
 <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
 <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net>
 <CAGJ7U-XOpZ3+S+BbMgNDGdzjmx=Vhv620bETZnovUbqzwXEPAA@mail.gmail.com>
Message-ID: <f13db731-ca65-7580-b8bb-e759880f06a5@bisharat.net>

Could it be said that a PDF conversion app generating unusual coding of 
characters, and doing so without advising users, is an instance of 
"Unicode malpractice"? (per David's mention of using the "bully pulpit")

Some earlier posts in this thread made the observation that PDF is for 
presentation not archiving. However, since the format makes it possible 
to search text instead of having just an image of the pages, it seems 
that distinction is at least somewhat blurred. PDFs are archived and 
searched, and people expect to use those functions. So yes this 
font/coding issue in PDFs is a real world problem, but of the sort that 
Unicode was created to relegate to the past.

An analogy that comes to mind is continued use of old hacked 8-bit 
fonts, which were created before Unicode was widely adopted, for 
printing and limited sharing ("you need to install this font to view 
correctly"). Documents produced with them, however, are shared as PDFs 
(such as some Chinese-Hausa learning materials up to at least 2010, 
which of course look and print fine, but which run into the same search 
and re-use issues), and even escape into the wild as text (with unhappy 
results like a Bambara translation of a handwashing poster during the 
ebola crisis).

Any digital text these days can't be treated as just producing something 
visually correct.

By the way, the "?" in the original title changed to "O" somewhere back 
in the thread. A luta continua.

Don


On 5/8/2016 4:13 AM, Andrew Cunningham wrote:
> The t_i instance will depend on the quality of the font. If its a 
> standard ligature there should be a glyph to codepoints assignment in 
> the cmap table or the ToUnicode mapping in the PDF file.
>
> As David indicates, it isnt a Unicode issue.
>
> It is an issue with the font used and/or the tools used.
>
> PDFs have always been problematic. That isn't going to change anytime 
> soon. Partly for archiveable or accessible PDFs, the person generating 
> the PDFs should select the best tools for the job and test the PDF. 
> Then fix any problems.
>
> Andrew
>
> On Sunday, 8 May 2016, David Perry <hospes02 at scholarsfonts.net 
> <mailto:hospes02 at scholarsfonts.net>> wrote:
> > I agree that it's a real-world problem -- PDFs really should be 
> searchable -- but I do not see that it's a Unicode issue. Unicode 
> defines the basic building blocks of LATIN SMALL LETTER T and LATIN 
> SMALL LETTER I; that's its job. Unicode is not responsible for font 
> construction or creating PDF software. Furthermore, even if Unicode 
> did want to do something about it, I can't imagine what that could be 
> -- aside perhaps from using its bully pulpit to urge PDF creators and 
> font creators to do their jobs better.
> >
> > The fact that some PDF apps do not search and copy/paste text 
> correctly when unencoded characters are given PUA values has been 
> known for many years.  In the case of Calibri, I looked at the font 
> (version installed on my Win7 system) and found that the 'ti' ligature 
> is named t_i, which follows good naming practices, and it does not 
> have a PUA assignment. Given this, any well-constructed PDF app should 
> be able to decode the ligature correctly.
> >
> > David
> >
> > On 5/6/2016 11:49 AM, Steve Swales wrote:
> >>
> >> This discussion seems to have fizzled out, but I?m concerned that
> >> there?s a real world problem here which is at least partially the
> >> concern of the consortium, so let me stir the pot and see if there?s
> >> still any meat left.
> >>
> >> On the current release of MacOS (including the developer beta, for
> >> your reference, Peter), if you use Calibri font, for example, in any
> >> app (e.g. notes), to write words with ?ti? (like
> >> internationalization), then press ?Print" and ?Open PDF in Preview?,
> >> you get a PDF document with the joined ?ti?. Subsequently cutting and
> >> pasting produces mojibake, and searching the document for words
> >> with?ti? doesn?t work, as previously noted.
> >>
> >> I suppose we can look on this as purely a font handling/MacOS bug, but
> >> I?m wondering if we should be providing accommodations or conveniences
> >> in Unicode for it to work as desired.
> >>
> >> -steve
> >>
> >
>
> -- 
> Andrew Cunningham
> lang.support at gmail.com <mailto:lang.support at gmail.com>
>
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/0cf3dc01/attachment.html>

From verdy_p at wanadoo.fr  Sun May  8 08:35:15 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 8 May 2016 15:35:15 +0200
Subject: Joined "ti" coded as "O" in PDF
In-Reply-To: <f13db731-ca65-7580-b8bb-e759880f06a5@bisharat.net>
References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net>
 <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us>
 <56EB1723.7030301@bisharat.net>
 <CAGJ7U-Ww3wdqcSEcMZ1zLX2dk5iUYpGqDUmtET=RsiR2e0tHHg@mail.gmail.com>
 <56ED91DE.5080700@bisharat.net>
 <CAGJ7U-WGQN5NoVr1PjcMz1yC1ZbmwuuCLybFcvjN8cb0m0Y5Ng@mail.gmail.com>
 <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl>
 <56EEF8DD.2090808@ix.netcom.com>
 <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org>
 <CAGa7JC2b4AWeCdGix84wqZBtYifHz+R4N0N41CPxaqR-m06CEA@mail.gmail.com>
 <DB70551D-B714-4E05-8E91-C57034427CAD@swales.us>
 <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net>
 <CAGJ7U-XOpZ3+S+BbMgNDGdzjmx=Vhv620bETZnovUbqzwXEPAA@mail.gmail.com>
 <f13db731-ca65-7580-b8bb-e759880f06a5@bisharat.net>
Message-ID: <CAGa7JC2U7=T5ym5oGH_+=Skyu1MQf_97wi-xOp=iY64iv_AWEQ@mail.gmail.com>

2016-05-08 14:42 GMT+02:00 Don Osborn <dzo at bisharat.net>:

> Some earlier posts in this thread made the observation that PDF is for
> presentation not archiving.
>
I tend to disagree. PDF are hugely used for archiving and for that purpose
it does not matter how it was generated, it is only meant to be a
facsimile, possibly with equal value as the original (printed) paper. The
initial numeric format is just a working draft with no legal value in most
cases.

That's why PDF files can contain a digistal signature, to give them the
same value as the original paper. The initial numeric draft has no value,
even if it's easier to search in it.

Many (many!) laws and treaties in the world are kept only as PDF, not all
of them being searchable in plain text, unless there's been some OCR (and
often correction to this process). The original papers (which have legal
value) are kept in museums or official national libraries and no longer
freely accessible to the public and that's why there are facsimile PDF
created to make them accessible (and possibly signed numerically by the
official library or some national authority).

Lots of organisations are only archiving their legal papers as PDF and
recycle their original paper. This is authorized by national laws, provided
they insert a verificable signature in them, certifying their date. No
alteration of the content is then authorized  as these PDF become the new
original (except adding new digital signatures, or possibly dropping some
of them except the initial dated one whose security may have become loose
over time, and for which it is needed to add new stronger signatures by the
legitimate right holder; the history of signatures will be kept).

Being able to search in a PDF is a distinct goal, not meant directly for
archiving, but for using PDFs isolately as *working* documents. However for
archives, the ability of searching in them may be provided by separate data
(without legal bindings) stored in the archive index, along with the
unaltered (and legal) PDF.

PDFs are not being meant to be used for presentation (there are much better
way to present the content and *adapt* it to the audience or presentation
medium. But presentation is also a different goal than being able to search
in it. A PDF is just a collection of rendered pages (possibly with a
limited resolution, where rendered characters may be a bit fuzzy or some
non meaningful color distinctions may be voluntarily lost) to be used "as
is" and meant to be read by human eyes (even being able to produce an
accurate OCR is not a goal of this format).

When producing the PDF, there's choice by the human editor to reduce the
resolution, reduce the colorspace and so on if this helps reducing the
numeric storage size and helps archiving, or helps protecting the author's
rights

E.g. there are different PDF versions for free online editions of
newspapers, where text may be to fuzzy to be read. But there are versions
for subscribers with much better quality (but possibly less ads), and kept
in archives if needed, but still not really meant to be searchable in plain
text; in fact the producer may want to limit the searchability so that
readers will have to look at the pages directly, and see the embedded
advertizing boxes even if they are not related directly to what is being
searched for; the producer may provide only a limited plain-text index for
some headings, but not for the content itself: readers have to scan it
visually so that they cannot completely ignore the surrounding context.

The producer of the PDF then has the choice of the different options. It
has different goals for the document. For legal use, there are some goals
to follow, but this does not (most often) include the need to perform plain
text search in them. May this means that some OCR or human work will be
needed later in order to index it, but this operation may be limited by
author's rights and the user will assume its own respondability if he makes
a false interpretation when using only automated tools. PDFs are maent to
be read and interpreted by humans, not machines.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/b21d7baa/attachment.html>

From dzo at bisharat.net  Sun May  8 09:19:54 2016
From: dzo at bisharat.net (Don Osborn)
Date: Sun, 8 May 2016 10:19:54 -0400
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <1246651385.10.1462502159794.JavaMail.www@wwinf1k18>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
 <1246651385.10.1462502159794.JavaMail.www@wwinf1k18>
Message-ID: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net>

Thanks all for the replies on this matter. Concerning the keyboard side 
of the issue, there has been a lot of discussion about unified standards 
over the years, but what we end up with is maybe another case of "The 
nice thing about standards is that there are so many to choose from." 
Within that, there seem to be two main questions addressed by keyboard 
creation: production and popular use. It Many keyboards are made with 
production in one or maybe a couple of languages in mind - this is in 
line with the thinking behind creation of old 8-bit modified fonts. On 
the other hand, is the need for keyboard layouts that can be accessed 
broadly without the users having to learn new key assignments at each 
new device. In terms of philosophy, I'd see common keyboards as more in 
line with the intent of Unicode.

In the ideal world, there would be no distinction between keyboards 
created with limited/focused production in mind (limited in the sense of 
one language in a multilingual society and/or focused on a particular 
production need), and keyboards intended to facilitate broad usage. Like 
a QWERTY+ or AZERTY+ perhaps? That has not been easy - kind of another 
theory of everything problem.

The flexibility of touchpad keyboards in theory gets beyond the 
limitations of the physical keyboards - has anyone tried adding a row to 
say a QWERY layout, which includes additional characters, rather than 
sweating the issues about shoehorning them in other levels or key 
sequences? Is that even possible? Still would be helpful to have 
standards, but where something is visible, it is easy to use.

On the font side, my impression (a bit dated) is that there is/was a 
policy dimension or gap. Back when Unicode was becoming more widely 
adopted, there were new computers marketed in Africa without the then 
limited repertoire of fonts with extended Latin. Even when these were 
included, there are some instances where it is possible that 8-bit fonts 
with extended characters were created on machines that already had one 
or two Unicode fonts - evidently unbeknownst to the user. So there was, 
and always has been, a public education side to this that none of us in 
position or interest to do so have been able to address.

In the background one should bring in the issue of whether computer 
science students and IT experts in Africa had any introduction to 
Unicode. That could be a big missing piece in the equation.

The case of the Chinese publications using modified 8-bit fonts for both 
Hausa boko and Chinese pinyin is a specialized one. Given the small 
number of people working on both those languages it may be just the 
chance outcome of their not being aware that Unicode already had their 
needs covered. A specialized keyboard for production of text including 
hooked consonants and tone-marked vowels, plus awareness of Unicode 
would probably set them on a new course.

Marcel, I would be very interested to know more about what you are 
working on wrt Bambara - perhaps offline.

Don


On 5/5/2016 10:35 PM, Marcel Schneider wrote:
> On Sat, 30 Apr 2016 13:27:02 -0400, Don Osborn  wrote:
>
>> If the latter be the case, that would seem to have implications
>> regarding dissemination of information about Unicode. "If you
>> standardize it, they will adopt" certainly holds for industry and
>> well-informed user communities (such as in open source software), but
>> not necessarily for more localized initiatives. This is not to seek to
>> assign blame in any way, but rather to point out what seems to be a
>> persistent issue with long term costs in terms of usability of text in
>> writing systems as diverse as Bambara, Hausa boko, and Chinese pinyin.
> The situation Don describes is challenging the work that is already done and on-going in Mali, with several keyboard layouts at hand. If widening the range is really suitable, one might wish to test a couple of other solutions than already mentioned, that roughly fall into two subsets:
>
> 1) Letters on the digits row. Thanks to a kindly shared resource, I?m able to tell that over one dozen Windows layouts?mainly French, as used in Mali, but also Lithuanian, Czech, Slovak, and Vietnamese, have the digits in the Shift or AltGr shift states. The latter is the only useful way of mapping letters on digit keys and becomes handy if the Kana toggle is added, either alone or in synergy with the Kana modifier instead of AltGr. With all bracketing characters in group?2 level?1 on the home row and so on, there is enough place to have all characters for Bambara and French directly accessed.
>
> 2) Letters through dead keys. This is the ISO/IEC?9995 way of making more characters available in additional groups with dead key group selectors (referred to as remnant modifiers but actually implemented as dead keys). This is also one way SIL/Tavultesoft?s layouts work for African and notably for Malian languages. IME-based keyboarding software may additionally offer a transparent input experience.
>
>
> On Mon, 2 May 2016 12:03:58 -0400, Ed Trager  wrote:
>
>> Also with web applications the "software installation" issue is eliminated.
>> Remember that while it is easy for technologically savvy folks like members
>> of this mailing list to install keyboard drivers on any platform we like,
>> this process is somewhat beyond the reach of many people I know, even when
>> they are otherwise fairly comfortable using computers.
> I can?t easily believe that people who are comfortable with computers may have trouble using the widely automatted keyboard layout installation feature, because I?ve as well experienced myself as got the opportunity to observe on other persons I know, that in fact there is some kind of reluctance based on the belief?call it a myth or an urban legend?that Windows plus preinstalled software plus MS?Office come along with everything any user may need until the next update. Though informing about Microsoft?s help to customize the keyboard is more complicated in that the display is part of the hardware, and the functioning behind has more of a blackbox.
>
>
> Being actually working on such a project for the fr-FR locale, I?ve already got some ideas for Bambara. I hope it can soon be on-line.
>
> Kind regards,
>
> Marcel
>


From verdy_p at wanadoo.fr  Sun May  8 10:24:17 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Sun, 8 May 2016 17:24:17 +0200
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
 <1246651385.10.1462502159794.JavaMail.www@wwinf1k18>
 <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net>
Message-ID: <CAGa7JC3duudMwKJ9eAok=+pAa5Fuy-CL43RzLVcj_QYaG=PG-g@mail.gmail.com>

2016-05-08 16:19 GMT+02:00 Don Osborn <dzo at bisharat.net>:

> The flexibility of touchpad keyboards in theory gets beyond the
> limitations of the physical keyboards - has anyone tried adding a row to
> say a QWERY layout, which includes additional characters, rather than
> sweating the issues about shoehorning them in other levels or key
> sequences? Is that even possible? Still would be helpful to have standards,
> but where something is visible, it is easy to use.
>

It is technically possible, but the problem is to add distinctive hardware
"scan codes" to keys in this row.

See this table:
https://msdn.microsoft.com/en-us/library/aa299374(v=vs.60).aspx

You'll note that almost all scancodes in the 7-bit range are used. So you'd
need "extended scancodes", i.e. prefixing the special virtual scancode 00
on Windows (or the hardware scancode E0) before the extended scan code for
the actual key. (The special scancode "00" turns the 7-bit table into an
equivalent 8-bit table, but note that keyboards use 7-bit scancodes only,
as the 8th bit is used for the press/release flag)

For that, you could then reuse the scancodes of the first row (those for
digits). Note that the scancodes for the row of "standard" function keys
(F1..F12) is already extended this way (for additional function keys).

Bit note also this table:
https://www.win.tue.nl/~aeb/linux/kbd/scancodes-10.html

You'll see that the hardware scancodes E0-0A and E0-0B are already assigned
on PC for special functions, and so cannot be used to "extend" the keys for
digits 9 and 0 on the first row (whose scancodes are 0A and 0B
respectively). This is not so critical: you can perfectly have additional
keys assigend for a row using non-contiguous hardware scancodes (after all
the alphabetic part of the keyboard is already using multiple ranges of
hardware and virtual scancodes).

But you'd need a new keyboard driver (and an extension to MSKLC on Windows)
to allow mapping this supplementary row, and a industry agreement to assign
new extended keys in non-conflicting ways (these days, it is the Microsoft
hardware labs that centralize the extensions used on PC-compatible
hardware, Apple used to have its own registry for its own keyboards, but
now Macs are PC and can use the same keyboards not necessarily built by
Apple, e.g. by Logitech). The connectors are compatible with the same USB
interface.

There are some differences in hardware scancodes used on the USB interface
(Windows internally translated hardware scancodes for some interfaces into
the same virtual scancodes before sending them to upper keyboard drivers
and applications: this is where scancode E0 on the old PC-keyboard
interface or the newer PS/2 interface or USB interface, or in the old BIOS
interface is remapped into the same virtual scancode 00 for Windows drivers
and apps).

There's also an additional hardware extension code E1 for a few function
keys (it is used for a few functions encoded on 3 bytes, for upward
compatibility reasons, such as the "Pause" key).

Various other vendors have used specific hardware scancodes, but today
almost everyone agrees to the same PC standard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160508/1b224b80/attachment.html>

From doug at ewellic.org  Sun May  8 11:50:29 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 8 May 2016 10:50:29 -0600
Subject: Non-standard 8-bit fonts still in use
Message-ID: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell>

Don Osborn wrote:

> Concerning the keyboard side of the issue, there has been a lot of
> discussion about unified standards over the years, but what we end up
> with is maybe another case of "The nice thing about standards is that
> there are so many to choose from."

There are a zillion keyboard layouts, not because of too many 
conflicting standards per se, but primarily because people don't want to 
change away from the layout they're familiar with, and secondarily 
because different languages have different needs.

--
Doug Ewell | http://ewellic.org | Thornton, CO ???? 


From dzo at bisharat.net  Sun May  8 13:11:20 2016
From: dzo at bisharat.net (Don Osborn)
Date: Sun, 8 May 2016 14:11:20 -0400
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell>
References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell>
Message-ID: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net>

Thanks Doug. You're right as far as that goes, but I'd suggest there's 
more to it.

Languages (by which of course we mean their written forms) have 
requirements, and for cross-border languages, requirements may be 
defined differently by the different countries where they are spoken. 
And users have needs and experience.

In the multilingual settings I'm most interested in, the language 
requirements often overlap, sometimes considerably (thinking here of 
extended Latin alphabets). This is because in many languages use 
characters that are part of the African Reference Alphabet. So it is 
possible to have one keyboard layout for each language, or merge 
requirements if you will for two or more. When the A12n-collab group was 
active* one concept discussed at some length was a "pan-Sahelian" layout 
that could serve many languages across a number of countries.

But even then, considering variations by country (orthographies often 
set by country not by language), there can be several possible sets of 
language requirements, in a "pan-Sahelian" layout. And that's just one 
example.

Then there is the question of key assignments for any given character. 
Unfortunately in Africa there are not established layouts to deal with - 
most formally educated people will be most familiar with QWERTY or 
AZERTY for the official languages. Everything else is pretty much a 
matter of choice, although some small communities of users may have 
developed familiarity with particular layouts (perhaps a reason for 
persistence of something like Bambara Arial). So another reason there 
are a zillion keyboards is that people are inventing them - for good 
reasons and intent, we can admit, but often without awareness of other 
efforts, or communication with other communities of users.

You are right however that none of these are standards (with a possible 
exception - would have to go back and check) - I was trying to be clever 
- but there are different layouts.

Another thing about user needs is that the polyglot/pluriliterate user 
may prefer something that reflects that, as opposed to having multiple 
keyboards for languages whose character repertoires are much the same. 
 From a national or regional (sub-continental) point of view I would 
think a one-size fits all/many standard or set of keyboard standards 
would be ideal. But no one seems to be going there yet, after all these 
years.

And one could go on. To get this a little on-topic for the list, the 
good news is that Unicode means we're talking just about keyboards and 
not about multiple incompatible fonts as well.

Don

* I'm floating the idea of a new list on the full spectrum of African 
languages & technology issues. Anyone interested or who has thoughts on 
that idea one way or another, please contact me offline.


On 5/8/2016 12:50 PM, Doug Ewell wrote:
> Don Osborn wrote:
>
>> Concerning the keyboard side of the issue, there has been a lot of
>> discussion about unified standards over the years, but what we end up
>> with is maybe another case of "The nice thing about standards is that
>> there are so many to choose from."
>
> There are a zillion keyboard layouts, not because of too many 
> conflicting standards per se, but primarily because people don't want 
> to change away from the layout they're familiar with, and secondarily 
> because different languages have different needs.
>
> -- 
> Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Sun May  8 13:31:59 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sun, 8 May 2016 12:31:59 -0600
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net>
References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell>
 <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net>
Message-ID: <00D2CA093AAB485D9C618AC09856ECCD@DougEwell>

Don Osborn wrote:

> In the multilingual settings I'm most interested in, the language
> requirements often overlap, sometimes considerably (thinking here of
> extended Latin alphabets). This is because in many languages use
> characters that are part of the African Reference Alphabet. So it is
> possible to have one keyboard layout for each language, or merge
> requirements if you will for two or more. When the A12n-collab group
> was active* one concept discussed at some length was a "pan-Sahelian"
> layout that could serve many languages across a number of countries.

I wonder if there is a good and fairly comprehensive reference to the 
most common Latin-based alphabets used for African languages, comparable 
to Michael Everson's "The Alphabets of Europe" [1]. Such would be 
helpful for determining the level of effort to create a pan-African 
keyboard layout, or to adapt (if necessary) an existing multilingual 
layout like John Cowan's Moby Latin [2].

[1] http://www.evertype.com/alphabets/
[2] 
http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From dzo at bisharat.net  Sun May  8 14:15:20 2016
From: dzo at bisharat.net (dzo at bisharat.net)
Date: Sun, 8 May 2016 19:15:20 +0000
Subject: Non-standard 8-bit fonts still in use
Message-ID: <132317239-1462734909-cardhu_decombobulator_blackberry.rim.net-615064916-@b2.c1.bise6.blackberry>

Rhonda Hartell did a compilation based on available info, published 23 yrs ago by SIL. Christian Chanard put that info into a database, Systemes alphabetiques, accessible via links from http://www.bisharat.net/wikidoc/pmwiki.php/PanAfrLoc/WritingSystems#toc11

All I have right now (taking break from shoveling leaf compost). 

Don


------Original Message------
From: Doug Ewell
Sender: Unicode
To: unicode at unicode.org
To: Don Osborn
Subject: Re: Non-standard 8-bit fonts still in use
Sent: May 8, 2016 2:31 PM

Don Osborn wrote:

> In the multilingual settings I'm most interested in, the language
> requirements often overlap, sometimes considerably (thinking here of
> extended Latin alphabets). This is because in many languages use
> characters that are part of the African Reference Alphabet. So it is
> possible to have one keyboard layout for each language, or merge
> requirements if you will for two or more. When the A12n-collab group
> was active* one concept discussed at some length was a "pan-Sahelian"
> layout that could serve many languages across a number of countries.

I wonder if there is a good and fairly comprehensive reference to the 
most common Latin-based alphabets used for African languages, comparable 
to Michael Everson's "The Alphabets of Europe" [1]. Such would be 
helpful for determining the level of effort to create a pan-African 
keyboard layout, or to adapt (if necessary) an existing multilingual 
layout like John Cowan's Moby Latin [2].

[1] http://www.evertype.com/alphabets/
[2] 
http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


Sent via BlackBerry by AT&T


From charupdate at orange.fr  Mon May  9 10:16:42 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Mon, 9 May 2016 17:16:42 +0200 (CEST)
Subject: Non-standard 8-bit fonts still in use
In-Reply-To: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net>
References: <56204330.6010106@bisharat.net>
 <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net>
 <1246651385.10.1462502159794.JavaMail.www@wwinf1k18>
 <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net>
Message-ID: <1339982434.13627.1462807003259.JavaMail.www@wwinf1n25>

On Sun, 8 May 2016 10:19:54 -0400, Don Osborn  wrote:

> Marcel, I would be very interested to know more about what you are
> working on wrt Bambara - perhaps offline.

Thank you for your interest. I?m glad to come in touch 
with on-going work and I already started mailing but 
eventually would like to acknowledge on-list; although?

On Sun, 8 May 2016 14:11:20 -0400, Don Osborn  wrote:

>?To get this a little on-topic for the list, the
>?good news is that Unicode means we're talking just about keyboards and
>?not about multiple incompatible fonts as well.

Indeed, however font issues are IMHO even more suitable 
for the List (though strictly they are out of scope too), 
as opposed to keyboard layouts, that must not be discussed 
on the Unicode List. Only giving some hints is suitable, 
as had been done in this thread up to now. Consequently 
I switched off-list?immediately. But here I?m doing some 
metadiscussion, so please disregard.

> In the background one should bring in the issue of whether computer
> science students and IT experts in Africa had any introduction to
> Unicode. That could be a big missing piece in the equation.

For future archive readers there may be some need to recall 
that this phenomenon is a global one. Missing training to Unicode 
is observed in Europe as well, and on other continents. 
Please see the following recent thread:

Unicode in the Curriculum? 
from Andre Schappo on 2015-12-30 (Unicode Mail List Archive). 
Retrieved March 11, 2016, from 
http://www.unicode.org/mail-arch/unicode-ml/y2015-m12/0073.html

> On the font side, my impression (a bit dated) is that there is/was a
> policy dimension or gap. Back when Unicode was becoming more widely
> adopted, there were new computers marketed in Africa without the then
> limited repertoire of fonts with extended Latin. Even when these were
> included, there are some instances where it is possible that 8-bit fonts
> with extended characters were created on machines that already had one
> or two Unicode fonts - evidently unbeknownst to the user. So there was,
> and always has been, a public education side to this that none of us in
> position or interest to do so have been able to address.

Please see also the capital left-hook N glyph issue Don documented 
at the very beginning of this thread:

Non-standard 8-bit fonts still in use from Don Osborn on 2015-10-15 (Unicode Mail List Archive). 
(2015, October 21). Retrieved October 21, 2015, from 
http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0135.html

For one more comment on that issue:
http://unicode.org/mail-arch/unicode-ml/y2015-m10/0214.html


On Sun, 8 May 2016 12:31:59 -0600, Doug Ewell  wrote:

>?Don Osborn wrote:
>?
>?> In the multilingual settings I'm most interested in, the language
>?> requirements often overlap, sometimes considerably (thinking here of
>?> extended Latin alphabets). This is because in many languages use
>?> characters that are part of the African Reference Alphabet. So it is
>?> possible to have one keyboard layout for each language, or merge
>?> requirements if you will for two or more. When the A12n-collab group
>?> was active* one concept discussed at some length was a "pan-Sahelian"
>?> layout that could serve many languages across a number of countries.
>?
>?I wonder if there is a good and fairly comprehensive reference to the
>?most common Latin-based alphabets used for African languages, comparable
>?to Michael Everson's "The Alphabets of Europe" [1]. Such would be
>?helpful for determining the level of effort to create a pan-African
>?keyboard layout, or to adapt (if necessary) an existing multilingual
>?layout like John Cowan's Moby Latin [2].
>?
>?[1] http://www.evertype.com/alphabets/
>?[2]
>?http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html

On Sun, 8 May 2016 19:15:20 +0000, dzo at bisharat.net replied:

>?Rhonda Hartell did a compilation based on available info, 
>?published 23 yrs ago by SIL. Christian Chanard put that info 
>?into a database, Systemes alphabetiques, accessible via links from 
>?http://www.bisharat.net/wikidoc/pmwiki.php/PanAfrLoc/WritingSystems#toc11
>?
>?All I have right now (taking break from shoveling leaf compost). 

Thanks for this resource. I?ve taken a look and I like the interface. 
But there is some update missing, or more accurately, the source was outdated, 
as shows up when looking at the Bambara section that does not take into account 
the new orthography, though this had already been valid during over one decade 
(1982..1993).

Sadly this valuable database is unreliable unless the data is revised. 
I hope that can be done soon. However unfortunately I?m unable to do this job.


Best regards,

Marcel


From otto.stolz at uni-konstanz.de  Tue May 10 05:10:35 2016
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Tue, 10 May 2016 12:10:35 +0200
Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use)
In-Reply-To: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net>
References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell>
 <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net>
Message-ID: <5731B39B.8000501@uni-konstanz.de>

Hello,

am 2016-05-08 um 20:11 Uhr schrieb Don Osborn:
> Another thing about user needs is that the polyglot/pluriliterate user
> may prefer something that reflects that, as opposed to having multiple
> keyboards for languages whose character repertoires are much the same.
>  From a national or regional (sub-continental) point of view I would
> think a one-size fits all/many standard or set of keyboard standards
> would be ideal. But no one seems to be going there yet, after all these
> years.

Yes, there is somebody going there. E. g., the German standard
DIN 2137:2012-06 defines a ?T2? layout which is meant
for all official, Latin-based orthographies worldwide, and
additionally for the Latin-based minority languages of Germany
and Austria. The layout is based on the traditional QWERTZU layout
for German and Austrian keyboards (which is now dubbed ?T1?).
Cf. <https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.

There is also a ?T3? layout defined which comprises all characters
mentioned in ISO/IEC 9995-3:2010.

You can even buy a hardware T2 keyboard; however I have not tried it,
because I have defined my own keyboard layout suite (pan-European Latin,
pan-European Cyrillic, monotonic Greek, and Yiddish) for personal use,
long ago.

Best wishes,
   Otto Stolz

From doug at ewellic.org  Tue May 10 09:55:42 2016
From: doug at ewellic.org (Doug Ewell)
Date: Tue, 10 May 2016 07:55:42 -0700
Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use)
Message-ID: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com>

Otto Stolz wrote:

> Yes, there is somebody going there. E. g., the German standard
> DIN 2137:2012-06 defines a ?T2? layout which is meant
> for all official, Latin-based orthographies worldwide, and
> additionally for the Latin-based minority languages of Germany
> and Austria. The layout is based on the traditional QWERTZU layout
> for German and Austrian keyboards (which is now dubbed ?T1?).
> Cf. <https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.

Yes, but there's the rub. QWERTY users are about as willing to switch to
QWERTZ in the name of global standardization as Germans would be to
switch to QWERTY.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Tue May 10 10:30:25 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Tue, 10 May 2016 17:30:25 +0200
Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use)
In-Reply-To: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com>
References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC25n5BCMjHm-aY++OCReKYb7qvDLBKhD5taco-KsHPCow@mail.gmail.com>

Very true, and this will likely not change.
Even users of "ergonomic" layouts want to keep this ergonomy for their
letters (an letter pairs).
All that can be made reasonable is to extend existing layouts with minimal
changes: basic letters, decimal digits, and basic punctuation must remain
at the same place (and there's also some resistance for the most common few
additional letters used in each language that are typically placed on the
1st row, or near the Enter key).
What is likely to change is the placement of combinations using AltGr on
the first row (but on non-US keyboards, these also include some ASCII
characters considered essential on a computer like the backslash, hash
sign, tilde, arrobace, or underscore)

This leaves little freedom for changes except for keys currently assigned
to less essential characters such as the degree sign, the micro sign, the
pound sign (in countries not usingf this symbol daily), the "universal"
currency sign, the paragraph mark... Those can be used to fit better
candidates for extensions.

But without an extension of keyboard rows, it will be difficult to have a
wide adoption on physical keyboards. Function keys F1..F12 may be easily
reduced to fit additional keys for letters and diacritics.

Keyboards have instead been extended for many things that most people in
fact almost never use or don't need there such as multimedia keys,
shortcuts to launch the browser or calculator app. or the contextual
menu/options key (added by Windows), or TWO (sic!) keys for the Windows key
(Keep only one and map the few additional keys found on Japanese keyboards).

But it is challenging to have decent sizes for keys on notebooks keyboards
which are already extremely packed (F1..F12 are already reduced
vertically). They invented another way: using a new "Fn" mode key for
additional multimedia keys (or keys for switching the Wifi, Bluetooth or
display adapters, or control the display lightness or sound volume/mute, or
to eliminate the PrintScreen function, or the ScrollLock or NumLock mode
switch keys). A few of them added a couple of character keys for currency
units ($ and ?) instead of the Japanese mode keys.

In fact every brand has done what it wanted to extend the keyboards...
except for extending really the usable alphabets.

For virtual on-screen layouts, there's much more freedom as the display
panel is adaptative and allows more innovative input methods, of things
never dound on physical keyboards such as entering emojis.


2016-05-10 16:55 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> Otto Stolz wrote:
>
> > Yes, there is somebody going there. E. g., the German standard
> > DIN 2137:2012-06 defines a ?T2? layout which is meant
> > for all official, Latin-based orthographies worldwide, and
> > additionally for the Latin-based minority languages of Germany
> > and Austria. The layout is based on the traditional QWERTZU layout
> > for German and Austrian keyboards (which is now dubbed ?T1?).
> > Cf. <https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.
>
> Yes, but there's the rub. QWERTY users are about as willing to switch to
> QWERTZ in the name of global standardization as Germans would be to
> switch to QWERTY.
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/cd2d98b1/attachment.html>

From otto.stolz at uni-konstanz.de  Tue May 10 11:42:27 2016
From: otto.stolz at uni-konstanz.de (Otto Stolz)
Date: Tue, 10 May 2016 18:42:27 +0200
Subject: Polyglot keyboards
In-Reply-To: <CAGa7JC25n5BCMjHm-aY++OCReKYb7qvDLBKhD5taco-KsHPCow@mail.gmail.com>
References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com>
 <CAGa7JC25n5BCMjHm-aY++OCReKYb7qvDLBKhD5taco-KsHPCow@mail.gmail.com>
Message-ID: <57320F73.2010001@uni-konstanz.de>

Hello,

I had written:
> <https://de.wikipedia.org/wiki/T2_(Tastaturbelegung)>.

On 2016-05-10 16:55 GMT+02:00 Doug Ewell has written:
> QWERTY users are about as willing to switch to QWERTZ

I have never meant that QWERTY ? or AZERTY ? users should
switch to QWERTZ. I just wanted point to one instance of
an officially standardized polyglot keyboard layout.

E. g, there is already the Canadian multilingual keyboard, cf.
<https://en.wikipedia.org/wiki/File:KB_Canadian_Multilingual_Standard_comment-en.svg>,
based on the traditional QWERTY layout.  I do hope that other
standard bodies will follow suit and define their own QWERTY,
or AZERTY, or whatever versions of polyglot keyboard layouts,
in accordance with ISO/IEC 9995.

Am 2016-05-10 um 17:30 Uhr schrieb Philippe Verdy:
> All that can be made reasonable is to extend existing layouts with minimal
> changes:
?
> This leaves little freedom for changes except for keys currently assigned
> to less essential characters such as the degree sign, the micro sign, the
> pound sign (in countries not usingf this symbol daily), the "universal"
> currency sign, the paragraph mark... Those can be used to fit better
> candidates for extensions.

Another option (which I exploited for my personal keyboard layouts) is
the re-definition of a special-character key to work as a dead key.
E. g., on my personal keyboard, the ?"? key gives access to all sorts
of quote characters (for French, German, English, ?, even ASCII),
depending on the following key; the ?~? key works as tilde accent
on the letter typed subsequently; and so on. This scheme allows the
conventional QWERTZ hardware to be used for multilingual typing ?
with minimal re-learning and training. And still the ??? key  produces
the ??? character :-)

Best wishes,
   Otto Stolz

From charupdate at orange.fr  Tue May 10 18:09:42 2016
From: charupdate at orange.fr (Marcel Schneider)
Date: Wed, 11 May 2016 01:09:42 +0200 (CEST)
Subject: Polyglot keyboards
In-Reply-To: <57320F73.2010001@uni-konstanz.de>
References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com>
 <CAGa7JC25n5BCMjHm-aY++OCReKYb7qvDLBKhD5taco-KsHPCow@mail.gmail.com>
 <57320F73.2010001@uni-konstanz.de>
Message-ID: <220047750.20065.1462921782885.JavaMail.www@wwinf1h10>

On Tue, 10 May 2016 12:10:35 +0200, Otto Stolz  wrote:

> [?] the German standard
> DIN 2137:2012-06 defines a ?T2? layout which is meant
> for all official, Latin-based orthographies worldwide, and
> additionally for the Latin-based minority languages of Germany
> and Austria. The layout is based on the traditional QWERTZU layout
> for German and Austrian keyboards (which is now dubbed ?T1?).
> Cf. .
> 
> There is also a ?T3? layout defined which comprises all characters
> mentioned in ISO/IEC 9995-3:2010.

Wasn?t it the other way round? As far as I?remember the sources, 
to stick with the tradition of referring to an ISO subset of Unicode 
(MES-1 for ISO/IEC 9995-3:2002), the German NB urged ISO to adopt 
a new subset tailored for the then on-coming ISO/IEC 9995-3:2010, 
that in turn was intended to hold the invoked DIN 2137:2012, which 
was overflowing the ISO keyboard framework on other sides too, 
leading to the addition of part 11 past year.

As of the new Unicode subset?s extent, there were other problems 
raised through its being tailored for a given keyboard layout 
that did not make full use of the existing keyboard resources 
of the mainstream operating system. As a result, several Latin letters 
are missing, ending up in a twilighty mix of support and unsupport 
across Latin script using continents. While claiming coverage of 
several African and American languages, again several African and 
American languages are unsupported, notably through the lack 
of ?, ?, ?. Remember that Bamanankan is an official language of Mali.

Having promised not to stay discussing keyboard layouts on 
the Unicode List, I can?t help recalling in this *new* thread 
the harm done to Latin script using communities by excluding 
their alphabets from an internationally designed keyboard standard 
in the era of globalisation.

Everybody on this List remembers the oddities that have followed 
the launch of the Multilingual Latin Subset, redubbed so on the spot 
from the originally proposed ?Multilingual International Subset? for 
its not covering Greek nor Cyrillic, and subsequently annotated 
on demand of the ANSI, initiated by a paper from Denis?Jacquerye, 
as not covering all Latin script using languages, in order to avoid 
misleading future font designers.

Marcel


From rwhlk142 at gmail.com  Tue May 10 18:55:23 2016
From: rwhlk142 at gmail.com (Robert Wheelock)
Date: Tue, 10 May 2016 19:55:23 -0400
Subject: The Hebrew Extended (Proposed) Block
Message-ID: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>

Hello again, y?all!

?BAD NEWS! (CRUCIALLY IMPORTANT):  The Unicode Consortium has assigned
OTHER characters into the U+00860-U+008FF areas in the BMP of
Unicode?Malayalam extended additional characters for Garshuni, and more
additional Arabic characters.

We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended
characters...  either somewhere ELSE within the BMP, *or* somewhere within
either SMP areas 1 or 2.
It?ll be the same arrangement originally planned for the U+00860 area?but
relocated and expanded upon!

?Additional characters for correct typesetting of Hebrew
?Hebrew Palestinian vowel and pronunciation points
?The small superscript signs *?in* and *shin* for the letter *shin*
?Hebrew Palestinian cantillation
?Hebrew Babylonian vowel and pronunciation points
?Hebrew Babylonian cantillation
?Hebrew Samaritan vowel and pronunciation points
?Additional Hebrew characters for other Jewish languages
A new TXT listing of this subblock (with the new CORRECT location) will be
forthcoming.  STAY TUNED!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/27047803/attachment.html>

From rwhlk142 at gmail.com  Tue May 10 20:08:58 2016
From: rwhlk142 at gmail.com (Robert Wheelock)
Date: Tue, 10 May 2016 21:08:58 -0400
Subject: Moving The Hebrew Extended Block Into The SMP
Message-ID: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>

Hello again!  Shalom!

After reading through the V. 9? code charts PDF document, I DID find a new
area to relocate our new Hebrew Extended block (a very important area to
add into Unicode):
THE AREA FROM U+30000 TO U+3014F (336 codepoints)
?U+30000?U+30014 (21 codepoints):  Additional characters for typesetting
Biblical/Classical Hebrew
?U+30015?U+3001F (11 codepoints):  Palestinian vowel and pronunciation
points for Hebrew and Galilean Aramaic
?U+30020?U+30021 (2 codepoints):  Small superscript top-left signs for the
letter *shin*?superscript ?in and superscript shin
?U+30022?U+30041 (32 codepoints):  Palestinian cantillation signs for
Hebrew and Galilean Aramaic
?U+30042 is reserved
?U+30043?U+3005C (26 codepoints):  Babylonian vowel and pronunciation
points for Hebrew
?U+3005D?U+3005F are reserved
?U+30060?U+30071 (18 codepoints):  Babylonian cantillation signs for Hebrew
?U+30072?U+3007D are reserved
?U+3007E?U+3008F (18 codepoints):  Samaritan vowel points, pronunciation
points, and cantillation signs for Hebrew (copies of those also being used
for Samaritan script in BMP)
?U+30090?U+3010F (128 codepoints):  Additional characters in Hebrew script
for other Jewish languages (these are pointed like the corresponding Arabic
characters in the BMP)
?U+30110?U+3012F (32 codepoints):  Basic Hebrew superscript characters
(regular letters+5 final forms+top-left pointed *?in*+top-right pointed
*shin*+*maqqef*)
?U+30130?U+3014F (32 codepoints):  Basic Hebrew subscript characters
(regular letters+5 final forms+top-left pointed *?in*+top-right pointed
*shin*+*maqqef*)
Please STAY TUNED for updates.  Thank You!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/ee727daa/attachment.html>

From mark at kli.org  Tue May 10 21:23:58 2016
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 10 May 2016 22:23:58 -0400
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
Message-ID: <301fc24e-eb15-74f9-415b-bb1d24c9bf3a@kli.org>

Sounds like a plan; most additional Hebrew characters can probably 
safely live in the SMP, as they are not all that common (except, of 
course, TETRAGRAMMATON, which I'll be writing another proposal about).

What Samaritan vowel and accent points did we miss when we did Samaritan 
the first time around?  We tried to be pretty comprehensive with it, 
including contact with the user community and inspecting books and MSS.

Somewhere I have a list of signs I started making by reading an entry in 
an encyclopedia (Encyclopedia Judaica?) s.v. "Masorah". Ah, found it.  
Various lines, strokes, dots, colons, pairs of dots in assorted 
configurations around letters (Palestinian and Babylonian vowel points, 
etc)...  A bunch of combining letters (COMBINING SAMEKH ABOVE, etc), 
some not exactly normal (SLANTED NUN ABOVE)... I think I had about 
sixty.  But it isn't particularly well-organized or researched.

There is also the "Expanded" Tiberian cantillation system I have seen 
mentioned (in Yeivin's book on Masorah for example, in the part on 
accents, para. #220).  It seems to distinguish things like different 
flavors of MUNAH; I have never really found much about it, so I don't 
know if it needs special graphemes.  The only examples in the Yeivin 
book that I see appear to use existing symbols in combinations (e.g. 
MUNAH plus a MERKHA KEFULA for a "mekarbel").

What other Hebrew characters have you got in mind?  Could be 
interesting.  Are you considering symbols for PETUHA and SETUMA 
pericopes in your "typesetting" section?  Are those fit to be encoded?  
I think they've been mentioned before, but it's hard to show that they 
are anything other than specialized uses of PEH and SAMEKH (unless we're 
talking about using them as formatters, and then they're pretty 
definitely out of scope).

~mark

On 05/10/2016 07:55 PM, Robert Wheelock wrote:
> Hello again, y?all!
>
> ?BAD NEWS! (CRUCIALLY IMPORTANT):  The Unicode Consortium has assigned 
> OTHER characters into the U+00860-U+008FF areas in the BMP of 
> Unicode?Malayalam extended additional characters for Garshuni, and 
> more additional Arabic characters.
>
> We?ll need to find a DIFFERENT subblock to plant down our Hebrew 
> extended characters...  either somewhere ELSE within the BMP, 
> _or_ somewhere within either SMP areas 1 or 2.
> It?ll be the same arrangement originally planned for the U+00860 
> area?but relocated and expanded upon!
>
> ?Additional characters for correct typesetting of Hebrew
> ?Hebrew Palestinian vowel and pronunciation points
> ?The small superscript signs /?in/ and /shin/ for the letter /shin/
> ?Hebrew Palestinian cantillation
> ?Hebrew Babylonian vowel and pronunciation points
> ?Hebrew Babylonian cantillation
> ?Hebrew Samaritan vowel and pronunciation points
> ?Additional Hebrew characters for other Jewish languages
> A new TXT listing of this subblock (with the new CORRECT location) 
> will be forthcoming. STAY TUNED!
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/276b2928/attachment.html>

From mark at kli.org  Tue May 10 21:32:33 2016
From: mark at kli.org (Mark E. Shoulson)
Date: Tue, 10 May 2016 22:32:33 -0400
Subject: Moving The Hebrew Extended Block Into The SMP
In-Reply-To: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
References: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
Message-ID: <fd052fc5-eb57-93d9-3d3b-76823864f3dd@kli.org>

Oh yeah.  I also wonder a bit about things like the "half-letters" that 
were used sometimes in early Hebrew printing to fill out space left at 
the end of a line.  They would often write part of the next word, the 
first few letters, but maybe the last letter was missing part of it, or 
just random semi-characters (things like a SHIN with only two heads 
shows up a lot, or even complete SHINs). http://xkcd.com/1676/ got me 
thinking of it.  They're probably not encodable... or are they?

I'll have to find some example scans.  If it's as common as I say, that 
should be easy... unless I'm wrong about that, which I guess would make 
the whole question easier too.

~mark

From mark at kli.org  Tue May 10 21:46:04 2016
From: mark at kli.org (Mark Shoulson)
Date: Tue, 10 May 2016 22:46:04 -0400
Subject: Moving The Hebrew Extended Block Into The SMP
In-Reply-To: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
References: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
Message-ID: <8f8e40e7-c930-1988-9ea4-1d8aceb3900c@kli.org>

On 05/10/2016 09:08 PM, Robert Wheelock wrote:
>
> ?U+30000?U+30014 (21 codepoints):  Additional characters for 
> typesetting Biblical/Classical Hebrew

Do you have this list available yet?  I'm curious about these points, 
and others.

> ?U+30015?U+3001F (11 codepoints):  Palestinian vowel and pronunciation 
> points for Hebrew and Galilean Aramaic
> ?U+30020?U+30021 (2 codepoints):  Small superscript top-left signs for 
> the letter /shin/?superscript ?in and superscript shin

I thought SIN was indicated sometimes by a SAMEKH written above the 
letter.  How would putting a SIN (which is just a SHIN with a dot on the 
left instead of the right) on top of the letter be any improvement (or 
difference) over just putting the dot on the left of the base letter in 
the first place?

> ?U+30022?U+30041 (32 codepoints):  Palestinian cantillation signs for 
> Hebrew and Galilean Aramaic
> ?U+30042 is reserved
> ?U+30043?U+3005C (26 codepoints):  Babylonian vowel and pronunciation 
> points for Hebrew
> ?U+3005D?U+3005F are reserved
> ?U+30060?U+30071 (18 codepoints):  Babylonian cantillation signs for 
> Hebrew
> ?U+30072?U+3007D are reserved
> ?U+3007E?U+3008F (18 codepoints):  Samaritan vowel points, 
> pronunciation points, and cantillation signs for Hebrew (copies of 
> those also being used for Samaritan script in BMP)

OK, here I'm confused.  Why do we need copies?  Unicode doesn't like to 
encode redundant things, and it only makes for messes (when do you use 
which ZIQAA?)  If we have the characters in the BMP, we don't need them 
in the SMP.

> ?U+30090?U+3010F (128 codepoints):  Additional characters in Hebrew 
> script for other Jewish languages (these are pointed like the 
> corresponding Arabic characters in the BMP)

So additional Hebrew "letters" that take Arabic vowel-points?  Makes 
sense; I saw some of that with Samaritan (particularly with DAMMA). We 
should probably just use the Arabic vowel code-points though.

> ?U+30110?U+3012F (32 codepoints):  Basic Hebrew superscript characters 
> (regular letters+5 final forms+top-left pointed /?in/+top-right 
> pointed /shin/+/maqqef/)
> ?U+30130?U+3014F (32 codepoints):  Basic Hebrew subscript characters 
> (regular letters+5 final forms+top-left pointed /?in/+top-right 
> pointed /shin/+/maqqef/)

When you say "superscript" (or "subscript"), do you mean "spacing 
character that's written small and raised/lowered"?  Or do you mean 
"combining character that's written above/below another character"? cf. 
the difference between U+2071 SUPERSCRIPT LATIN SMALL LETTER I and 
U+0365 COMBINING LATIN SMALL LETTER I).  If the former, is there a 
reason this has to be done as plain-text and can't be handled by 
higher-level markup?  Probably every major script has been written small 
and high in some places, but we don't have superscript versions of every 
letter in Unicode.


~mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/c8adc888/attachment.html>

From markus.icu at gmail.com  Tue May 10 22:34:05 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Tue, 10 May 2016 20:34:05 -0700
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
Message-ID: <CAN49p6oS1y-KrNv50FRfS60tCvGfzbk6iH9W4VsSJNBEkJTdMw@mail.gmail.com>

FYI
It seems like 08xx is reserved for RTL scripts.

http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt

# The unassigned code points that default to R are in the ranges:
#     [\u0590-\u05FF *\u07C0-\u089F* \uFB1D-\uFB4F
\U00010800-\U00010FFF \U0001E800-\U0001EDFF \U0001EF00-\U0001EFFF]


http://unicode.org/roadmaps/bmp/
08 Samaritan <http://www.unicode.org/charts/PDF/U0800.pdf> Mandaic
<http://www.unicode.org/charts/PDF/U0840.pdf> (SyrSup)
<http://www.unicode.org/L2/L2015/15156-syriac-malayalam.pdf> ??? ??? ??? Arabic
Extended-A <http://www.unicode.org/charts/PDF/U08A0.pdf>

http://unicode.org/roadmaps/smp/

00010800-00010FFF Alphabetic and syllabic RTL scripts

0001E800-0001EFFF RTL scripts


   - Color highlighting is used to indicate blocks and unassigned ranges
   which default to right-to-left character behavior.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160510/c4653c44/attachment.html>

From verdy_p at wanadoo.fr  Wed May 11 07:46:10 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 11 May 2016 14:46:10 +0200
Subject: Moving The Hebrew Extended Block Into The SMP
In-Reply-To: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
References: <CAPKujtTPk0=isdZdDJqYGhr7mUZpzRxy-uRB45v=r98C4w7nuQ@mail.gmail.com>
Message-ID: <CAGa7JC3hYawnK=D0sntW219K7q+4srf=vT2sJ-r0M=gXMuNVLg@mail.gmail.com>

Effectively, if you need Arabic diacritics on top of Hebrew letters, just
use them. There will be no defect on script breaking, except in  strict
security checks for identifiers  where such usage is very unlikely or only
"aspirational".

You could as well use Latin/generic  diacritics if needed such as a
circumflex or cedilla. You could also use Latin letter-like diacritics, but
not the spacing ones, such as superscript o.

Combining characters should not ne desunified even if they are used un
several scripts, and even if those script have different directions, unless
they behave differently, i.e when they don't stack properly.

Hebrew diacritics written above or below normally don't stack vertically
but are ordered horizontally, but even in this case this can be infered
from the  base letter  which determines the effective layout  and even the
effective glyph to use for the diacritic (e.g. with the cedilla which
attaches sometimes above left instead of  below with some Latin letters
that have descenders like "g", or when some accents are added to Greek
letters and  placed on the left of capital letters instead of above).

Desunification of these diacritics however is needed when layouts are
distinguished both visually and semantically (such as the sin vs. shin
dots), and when their normalisation would cause major problems  requiring
systematic use of CGJ to block their reordering.

So don't fear using Arabic points or Latin accents, on top of Hebrew
letters they will be interpreted correctly in their Hebrew context, and by
themseves those combining diacritics have no direction (for the Bidi
algorithm which preverves the combining clusters).
Le 11 mai 2016 03:28, "Robert Wheelock" <rwhlk142 at gmail.com> a ?crit :

> Hello again!  Shalom!
>
> After reading through the V. 9? code charts PDF document, I DID find a new
> area to relocate our new Hebrew Extended block (a very important area to
> add into Unicode):
> THE AREA FROM U+30000 TO U+3014F (336 codepoints)
> ?U+30000?U+30014 (21 codepoints):  Additional characters for typesetting
> Biblical/Classical Hebrew
> ?U+30015?U+3001F (11 codepoints):  Palestinian vowel and pronunciation
> points for Hebrew and Galilean Aramaic
> ?U+30020?U+30021 (2 codepoints):  Small superscript top-left signs for the
> letter *shin*?superscript ?in and superscript shin
> ?U+30022?U+30041 (32 codepoints):  Palestinian cantillation signs for
> Hebrew and Galilean Aramaic
> ?U+30042 is reserved
> ?U+30043?U+3005C (26 codepoints):  Babylonian vowel and pronunciation
> points for Hebrew
> ?U+3005D?U+3005F are reserved
> ?U+30060?U+30071 (18 codepoints):  Babylonian cantillation signs for Hebrew
> ?U+30072?U+3007D are reserved
> ?U+3007E?U+3008F (18 codepoints):  Samaritan vowel points, pronunciation
> points, and cantillation signs for Hebrew (copies of those also being used
> for Samaritan script in BMP)
> ?U+30090?U+3010F (128 codepoints):  Additional characters in Hebrew script
> for other Jewish languages (these are pointed like the corresponding Arabic
> characters in the BMP)
> ?U+30110?U+3012F (32 codepoints):  Basic Hebrew superscript characters
> (regular letters+5 final forms+top-left pointed *?in*+top-right pointed
> *shin*+*maqqef*)
> ?U+30130?U+3014F (32 codepoints):  Basic Hebrew subscript characters
> (regular letters+5 final forms+top-left pointed *?in*+top-right pointed
> *shin*+*maqqef*)
> Please STAY TUNED for updates.  Thank You!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160511/7c11a2d0/attachment.html>

From verdy_p at wanadoo.fr  Wed May 11 08:01:09 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 11 May 2016 15:01:09 +0200
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
Message-ID: <CAGa7JC2finrzN+jBtEhaze1iHVcQwukv5S9YewGyYp5Z4tRWUw@mail.gmail.com>

Si this assignent does not respect the default rtl property of the range.
It would not be a probleme for combining characters, but for LTR base
letters in Malayalam this is a major problem...
Induc scripts are already complexe enough without this additional
incompatibility which will act against experimentations and effective use
mater.

The UTC should reconsider its beta allocation before the approval by ISO.

The SMP is not a problem, and there are already several Indic scripts in
the smp, that also  borrows some devanagari non-letter signs such as
punctuation without reencoding them.

We'll also have Latin extensions in the smp, just like there are
ideographic extensions outsider the BMP.

It's more important to preserve the default properties for compatibility.
Le 11 mai 2016 02:13, "Robert Wheelock" <rwhlk142 at gmail.com> a ?crit :

> Hello again, y?all!
>
> ?BAD NEWS! (CRUCIALLY IMPORTANT):  The Unicode Consortium has assigned
> OTHER characters into the U+00860-U+008FF areas in the BMP of
> Unicode?Malayalam extended additional characters for Garshuni, and more
> additional Arabic characters.
>
> We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended
> characters...  either somewhere ELSE within the BMP, *or* somewhere
> within either SMP areas 1 or 2.
> It?ll be the same arrangement originally planned for the U+00860 area?but
> relocated and expanded upon!
>
> ?Additional characters for correct typesetting of Hebrew
> ?Hebrew Palestinian vowel and pronunciation points
> ?The small superscript signs *?in* and *shin* for the letter *shin*
> ?Hebrew Palestinian cantillation
> ?Hebrew Babylonian vowel and pronunciation points
> ?Hebrew Babylonian cantillation
> ?Hebrew Samaritan vowel and pronunciation points
> ?Additional Hebrew characters for other Jewish languages
> A new TXT listing of this subblock (with the new CORRECT location) will be
> forthcoming.  STAY TUNED!
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160511/bcf418c9/attachment.html>

From doug at ewellic.org  Wed May 11 09:40:41 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 11 May 2016 07:40:41 -0700
Subject: The Hebrew Extended (Proposed) Block
Message-ID: <20160511074041.665a7a7059d7ee80bb4d670165c8327d.08c76f277a.wbe@email03.godaddy.com>

Robert Wheelock wrote:

> ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned
> OTHER characters into the U+00860-U+008FF areas in the BMP of
> Unicode?Malayalam extended additional characters for Garshuni, and
> more additional Arabic characters.

Philippe Verdy replied:

> Si this assignent does not respect the default rtl property of the
> range. It would not be a probleme for combining characters, but for
> LTR base letters in Malayalam this is a major problem...

The characters proposed for U+0860 through U+086A are Syriac letters
used for writing the Malayalam language. Pandey's proposal suggests they
should have General Category AL, like other Syriac letters. There is no
conflict in assigning these to a range designated for RTL scripts.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Wed May 11 09:47:01 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 11 May 2016 07:47:01 -0700
Subject: The Hebrew Extended (Proposed) Block
Message-ID: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com>

I wrote:

> Pandey's proposal suggests they
> should have General Category AL, like other Syriac letters.

AL is a bidi type, not a General Category. Still.

http://www.unicode.org/reports/tr9/#AL

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From verdy_p at wanadoo.fr  Wed May 11 11:05:04 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 11 May 2016 18:05:04 +0200
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com>
References: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC0A+2Hr1oJnceGB7Mv4B9nQNyXs4x1FDySmoWcHMN7eFQ@mail.gmail.com>

But are these supplemental Malayalam letters borrowed from Syriac really
RTL like in the Syriac script ? I have doubts (it would seriously impact
the Malayalam script which is LTR).

May be the letter forms are identical (or similar) but they are changed to
LTR (so the disunicification is justified if these are really letters).

Or these are combining diacritics (working within the Indic letter
clusters), i.e. in a "C*" general category but not in a "L*" general
category (in which case they are Bidi neutral and don't really need to be
in the RTL range).

2016-05-11 16:47 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> I wrote:
>
> > Pandey's proposal suggests they
> > should have General Category AL, like other Syriac letters.
>
> AL is a bidi type, not a General Category. Still.
>
> http://www.unicode.org/reports/tr9/#AL
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160511/a63f5600/attachment.html>

From doug at ewellic.org  Wed May 11 11:24:29 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 11 May 2016 09:24:29 -0700
Subject: The Hebrew Extended (Proposed) Block
Message-ID: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com>

Philippe Verdy wrote:

> But are these supplemental Malayalam letters borrowed from Syriac
> really RTL like in the Syriac script ? I have doubts (it would
> seriously impact the Malayalam script which is LTR).
>
> May be the letter forms are identical (or similar) but they are
> changed to LTR (so the disunicification is justified if these are
> really letters).
>
> Or these are combining diacritics (working within the Indic letter
> clusters), i.e. in a "C*" general category but not in a "L*" general
> category (in which case they are Bidi neutral and don't really need to
> be in the RTL range).

It might help to read the proposal:

http://www.unicode.org/L2/L2015/15088-syriac-malayalam.pdf

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From frederic.grosshans at gmail.com  Wed May 11 11:32:50 2016
From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=)
Date: Wed, 11 May 2016 18:32:50 +0200
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAGa7JC0A+2Hr1oJnceGB7Mv4B9nQNyXs4x1FDySmoWcHMN7eFQ@mail.gmail.com>
References: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com>
 <CAGa7JC0A+2Hr1oJnceGB7Mv4B9nQNyXs4x1FDySmoWcHMN7eFQ@mail.gmail.com>
Message-ID: <57335EB2.4090501@gmail.com>

Le 11/05/2016 18:05, Philippe Verdy a ?crit :
> But are these supplemental Malayalam letters borrowed from Syriac 
> really RTL like in the Syriac script ? I have doubts (it would 
> seriously impact the Malayalam script which is LTR).
Since these character are uses to write the Malayalam *language* in the 
Syriac *script*, the borrowing is the other way around, they are 
essentially Malayalam (script) characters borrowed into Syriac.

Fig 17 of http://www.unicode.org/L2/L2015/15156-syriac-malayalam.pdf 
shows an example of the look of this text.


From verdy_p at wanadoo.fr  Wed May 11 12:07:54 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Wed, 11 May 2016 19:07:54 +0200
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com>
References: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com>
Message-ID: <CAGa7JC0Y-tCRUQHCxFbqF8aZZm++zLL8gRerPkFG8DDZybaakg@mail.gmail.com>

2016-05-11 18:24 GMT+02:00 Doug Ewell <doug at ewellic.org>:

> It might help to read the proposal:
>
> http://www.unicode.org/L2/L2015/15088-syriac-malayalam.pdf


Thanks for pointing this document. Initially I had incorrectly understood
that this was an extension of the Malayalam script.

But it appears now to be an extension of the Syriac script instead (used to
write a variant of the Malayalam language, but fully in the Syriac script
instead of the Malayalam Indic script, so OK it is fully RTL).

So OK the assignment in the RTL range (of the BMP) is correct (though it
could still have been in an RTL range of the SMP planes).

And this is clearly not a duplication of the existing Malayalam letters due
to the different properties). The encoding is justified.

I apologize.

Note: where is the ISO form containing the formal summary of
characteristics and justifications (the list of questions and checkboxes) ?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160511/11e85eb0/attachment.html>

From petercon at microsoft.com  Wed May 11 18:40:32 2016
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 11 May 2016 23:40:32 +0000
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
Message-ID: <SN1PR0301MB19662A63D508C25FF4F9CA77D5720@SN1PR0301MB1966.namprd03.prod.outlook.com>

Robert, your statement seems to have an implicit assumption that the range 0860..08FF has somehow been reserved for Hebrew. That is not the case. As Markus reference elsewhere, people can refer to the Roadmap charts to see what is tentatively planned for a given range:

http://unicode.org/roadmaps/bmp/

If you or others are working on or considering working on a proposal for additional Hebrew characters, you should not make any firm assumptions about code point assignments until some indication of suitable ranges have been given by the Unicode Technical Committee and that has been added to the Roadmap.


Peter


From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Robert Wheelock
Sent: Tuesday, May 10, 2016 4:55 PM
To: unicode at unicode.org
Subject: RE: The Hebrew Extended (Proposed) Block

Hello again, y?all!

?BAD NEWS! (CRUCIALLY IMPORTANT):  The Unicode Consortium has assigned OTHER characters into the U+00860-U+008FF areas in the BMP of Unicode?Malayalam extended additional characters for Garshuni, and more additional Arabic characters.

We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended characters...  either somewhere ELSE within the BMP, or somewhere within either SMP areas 1 or 2.
It?ll be the same arrangement originally planned for the U+00860 area?but relocated and expanded upon!

?Additional characters for correct typesetting of Hebrew
?Hebrew Palestinian vowel and pronunciation points
?The small superscript signs ?in and shin for the letter shin
?Hebrew Palestinian cantillation
?Hebrew Babylonian vowel and pronunciation points
?Hebrew Babylonian cantillation
?Hebrew Samaritan vowel and pronunciation points
?Additional Hebrew characters for other Jewish languages
A new TXT listing of this subblock (with the new CORRECT location) will be forthcoming.  STAY TUNED!


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160511/7e578e9d/attachment.html>

From ori at avtalion.name  Fri May 13 12:31:35 2016
From: ori at avtalion.name (Ori Avtalion)
Date: Fri, 13 May 2016 20:31:35 +0300
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
Message-ID: <CALgdb5KDjswXSWZrj5VrC3pLDyik+jHPoNdXcbO=YNQzmf=GzQ@mail.gmail.com>

On Wed, May 11, 2016 at 2:55 AM, Robert Wheelock <rwhlk142 at gmail.com> wrote:
> ?Additional characters for correct typesetting of Hebrew
Will this include BROKEN VAV?
http://www.sofer.co.uk/html/broken_vav.html

> ?Additional Hebrew characters for other Jewish languages
Can you please provide some examples?

Any plans for Rashi Script? It doesn't seem to fit any of the
categories you listed. Arguably, it's just a font, but there's
precedence in Unicode :)
https://en.wikipedia.org/wiki/Rashi_script


From everson at evertype.com  Fri May 13 12:59:22 2016
From: everson at evertype.com (Michael Everson)
Date: Fri, 13 May 2016 18:59:22 +0100
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <CALgdb5KDjswXSWZrj5VrC3pLDyik+jHPoNdXcbO=YNQzmf=GzQ@mail.gmail.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
 <CALgdb5KDjswXSWZrj5VrC3pLDyik+jHPoNdXcbO=YNQzmf=GzQ@mail.gmail.com>
Message-ID: <DFE8EDBC-E250-4906-AC68-8DE18D81F825@evertype.com>

On 13 May 2016, at 18:31, Ori Avtalion <ori at avtalion.name> wrote:

> Any plans for Rashi Script? It doesn't seem to fit any of the
> categories you listed. Arguably, it's just a font, but there's
> precedence in Unicode :)

Not good precedent, I think. Rashi would be best considered like Fraktur and Latin.

Michael Everson

From jonathan.rosenne at gmail.com  Fri May 13 14:10:15 2016
From: jonathan.rosenne at gmail.com (Jonathan Rosenne)
Date: Fri, 13 May 2016 22:10:15 +0300
Subject: The Hebrew Extended (Proposed) Block
In-Reply-To: <DFE8EDBC-E250-4906-AC68-8DE18D81F825@evertype.com>
References: <CAPKujtTnwWfw+MM58v2mZUzyfEcegz7fzTdwW0e4nfL-CnE2cg@mail.gmail.com>
 <CALgdb5KDjswXSWZrj5VrC3pLDyik+jHPoNdXcbO=YNQzmf=GzQ@mail.gmail.com>
 <DFE8EDBC-E250-4906-AC68-8DE18D81F825@evertype.com>
Message-ID: <000001d1ad4b$15ddb9a0$41992ce0$@gmail.com>

Rashi is a font, not a script. It has a one-to-one correspondence with
standard Hebrew.

Best Regards,

Jonathan Rosenne

054-4246522
-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael
Everson
Sent: Friday, May 13, 2016 8:59 PM
To: unicode at unicode.org
Subject: Re: The Hebrew Extended (Proposed) Block

On 13 May 2016, at 18:31, Ori Avtalion <ori at avtalion.name> wrote:

> Any plans for Rashi Script? It doesn't seem to fit any of the 
> categories you listed. Arguably, it's just a font, but there's 
> precedence in Unicode :)

Not good precedent, I think. Rashi would be best considered like Fraktur and
Latin.

Michael Everson


From jameskasskrv at gmail.com  Sat May 14 15:41:11 2016
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 14 May 2016 12:41:11 -0800
Subject: Klingon text in legal brief
In-Reply-To: <CALBHtZz_gtCuyMpuhPk591-UgHZFUJJyWgaND8k4ai3u0KOQtg@mail.gmail.com>
References: <CALBHtZz_gtCuyMpuhPk591-UgHZFUJJyWgaND8k4ai3u0KOQtg@mail.gmail.com>
Message-ID: <CABPY6Z21xSFmskeK+29MYtNDFVgWXKxdcH5LHHKoqcoY6EO7fg@mail.gmail.com>

As a certain character from TOS would say, "fascinating".

It's surprising that nobody commented on Ken Shirriff's post.  If this had
been posted ten years ago it probably would have generated more activity on
this list.

Best regards,

James Kass

On Thu, Apr 28, 2016 at 7:49 AM, Ken Shirriff <ken.shirriff at gmail.com>
wrote:

> Since encoding Klingon in Unicode comes up occasionally, you might be
> amused to see a legal brief that was written partly in Klingon:
> https://drive.google.com/file/d/0BzmetJxi-p0VM19nbUpyNXE0a28/view
>
> Details are here: http://conlang.org/axanar/
>
> Ken
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160514/abe8d3f3/attachment.html>

From everson at evertype.com  Sat May 14 18:29:18 2016
From: everson at evertype.com (Michael Everson)
Date: Sun, 15 May 2016 00:29:18 +0100
Subject: Klingon text in legal brief
In-Reply-To: <CABPY6Z21xSFmskeK+29MYtNDFVgWXKxdcH5LHHKoqcoY6EO7fg@mail.gmail.com>
References: <CALBHtZz_gtCuyMpuhPk591-UgHZFUJJyWgaND8k4ai3u0KOQtg@mail.gmail.com>
 <CABPY6Z21xSFmskeK+29MYtNDFVgWXKxdcH5LHHKoqcoY6EO7fg@mail.gmail.com>
Message-ID: <A2211797-6520-47F4-A234-43A36E7685D4@evertype.com>

One keeps one?s cards to one?s chest.

> On 14 May 2016, at 21:41, James Kass <jameskasskrv at gmail.com> wrote:
> 
> 
> As a certain character from TOS would say, "fascinating".
> 
> It's surprising that nobody commented on Ken Shirriff's post.  If this had been posted ten years ago it probably would have generated more activity on this list.
> 
> Best regards,
> 
> James Kass
> 
> 
> On Thu, Apr 28, 2016 at 7:49 AM, Ken Shirriff <ken.shirriff at gmail.com> wrote:
> Since encoding Klingon in Unicode comes up occasionally, you might be amused to see a legal brief that was written partly in Klingon: https://drive.google.com/file/d/0BzmetJxi-p0VM19nbUpyNXE0a28/view
> 
> Details are here: http://conlang.org/axanar/
> 
> Ken
> 
> 
> 


From jameskasskrv at gmail.com  Sat May 14 21:46:44 2016
From: jameskasskrv at gmail.com (James Kass)
Date: Sat, 14 May 2016 18:46:44 -0800
Subject: Klingon text in legal brief
Message-ID: <CABPY6Z2=u-AXVo2ZbEOftUt3cST9o9DS8fnWM=N-Se-5ZQ00JQ@mail.gmail.com>

Best wishes towards a winning hand.

Best regards,

James Kass

From haberg-1 at telia.com  Sun May 15 13:57:54 2016
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Sun, 15 May 2016 20:57:54 +0200
Subject: Math upright Latin and Greek styles
Message-ID: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>

Are there any plans to add math upright Latin and Greek styles, in order to distinguish them from regular (non-math) Latin and Greek? ?In programs like TeX, the latter are normally used for italics, so it means that there is a conflict with using them for upright.


From haberg-1 at telia.com  Sun May 15 16:47:03 2016
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Sun, 15 May 2016 23:47:03 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
Message-ID: <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>


> On 15 May 2016, at 23:19, Murray Sargent <murrays at exchange.microsoft.com> wrote:
> 
> Hans ?berg asked, ?Are there any plans to add math upright Latin and Greek styles, in order to distinguish them from regular (non-math) Latin and Greek? ?In programs like TeX, the latter are normally used for italics, so it means that there is a conflict with using them for upright?.
>  
> Math upright Latin is unified with the ASCII alphabetics and math upright Greek is unified with Unicode Greek letters in the U+0390 block. TeX and MathML upright Latin and upright lower-case Greek letters are converted to math italic by default. In the Linear Format, upright letters are enclosed in quotes and marked as ?ordinary text?. In Microsoft Word and other Microsoft Office apps, you can control math italicization in math zones using the italics hot key Ctrl+I and other italic formatting tools.
>  
> There is ambiguity as to whether a span of upright ASCII alphabetics is a function name or a product or a combination of the two. Such ambiguities are rare since spans of upright ASCII alphabetics are usually words or abbreviations of some kind such as function names. Individual upright letters can be distinguished as individual variables if desired by inserting appropriate invisible times (U+2062) characters.
>  
> We are thinking about adding other math alphabets as discussed in the post Unicode Math Calligraphic Alphabets. Comments are welcome.

The question arose on the ConTeXt mailing list [1]. Changing Basic Latin and Greek to upright does not seem practical, due to legacy and lack of efficient input methods. So the idea came up to have these reserved for text and computer input, while a specific math upright style would be used when wanting to indicate that.

1. https://mailman.ntg.nl/pipermail/ntg-context/2016/085523.html


From haberg-1 at telia.com  Sun May 15 17:25:51 2016
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Mon, 16 May 2016 00:25:51 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
 <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>
 <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
Message-ID: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>


> On 16 May 2016, at 00:05, Murray Sargent <murrays at exchange.microsoft.com> wrote:
> 
> Hans ?berg mentioned "Changing Basic Latin and Greek to upright does not seem practical, due to legacy and lack of efficient input methods."
> 
> Have to say that it's really easy for the user to switch between math upright, italic, bold, and bold italic letters in Microsoft Word by just using the usual hot keys as discussed in 
> 
> https://blogs.msdn.microsoft.com/murrays/2007/05/30/using-math-italic-and-bold-in-word-2007/. 
> 
> This capability has been shipping for over 10 years now. But admittedly implementing such input functionality is a little tricky since the alphanumerics need to be converted to the desired Unicode Math Alphanumerics.

I am not familiar with the product, so it unclear to me whether it it produces a UTF-8 text file with the correct Unicode code points, as is a requirement for the LuaTeX engine that ConTeXt defaults to. One can design a new key map on OS X that selects the correct Unicode code points, but that is a huge task, given the large number of math symbols.

The legacy issue is that there are already loads of TeX code that translates the Basic Latin into Unicode math italic style. So it is hard to break the habit, and old code cannot readily be reused.

And one can ignore the problem altogether, and use the traditional TeX backslash ?\?? commands, but using Unicode helps the readability of the source code. This is even more so in the case of theorem proof assistants.


From verdy_p at wanadoo.fr  Sun May 15 20:30:42 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 16 May 2016 03:30:42 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
 <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>
 <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
 <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>
Message-ID: <CAGa7JC21Cyr=43qe5wy+d6xdHp7_8auOAhKajLUALc4fNTNAtw@mail.gmail.com>

isn't it specified in TeX using a font selection package instead of the
default one? Also the only upright letters I saw was for inserting normal
text (not mathematical symbols) or comments/descriptions, or when using the
standardized "monospace", or "serif" font (which are not italic by default).

2016-05-16 0:25 GMT+02:00 Hans ?berg <haberg-1 at telia.com>:

>
> > On 16 May 2016, at 00:05, Murray Sargent <murrays at exchange.microsoft.com>
> wrote:
> >
> > Hans ?berg mentioned "Changing Basic Latin and Greek to upright does not
> seem practical, due to legacy and lack of efficient input methods."
> >
> > Have to say that it's really easy for the user to switch between math
> upright, italic, bold, and bold italic letters in Microsoft Word by just
> using the usual hot keys as discussed in
> >
> >
> https://blogs.msdn.microsoft.com/murrays/2007/05/30/using-math-italic-and-bold-in-word-2007/
> .
> >
> > This capability has been shipping for over 10 years now. But admittedly
> implementing such input functionality is a little tricky since the
> alphanumerics need to be converted to the desired Unicode Math
> Alphanumerics.
>
> I am not familiar with the product, so it unclear to me whether it it
> produces a UTF-8 text file with the correct Unicode code points, as is a
> requirement for the LuaTeX engine that ConTeXt defaults to. One can design
> a new key map on OS X that selects the correct Unicode code points, but
> that is a huge task, given the large number of math symbols.
>
> The legacy issue is that there are already loads of TeX code that
> translates the Basic Latin into Unicode math italic style. So it is hard to
> break the habit, and old code cannot readily be reused.
>
> And one can ignore the problem altogether, and use the traditional TeX
> backslash ?\?? commands, but using Unicode helps the readability of the
> source code. This is even more so in the case of theorem proof assistants.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160516/e99220fd/attachment.html>

From haberg-1 at telia.com  Mon May 16 03:05:11 2016
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Mon, 16 May 2016 10:05:11 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <CAGa7JC21Cyr=43qe5wy+d6xdHp7_8auOAhKajLUALc4fNTNAtw@mail.gmail.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
 <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>
 <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
 <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>
 <CAGa7JC21Cyr=43qe5wy+d6xdHp7_8auOAhKajLUALc4fNTNAtw@mail.gmail.com>
Message-ID: <A9A43821-413D-45C7-8B8C-153667B1C107@telia.com>


> On 16 May 2016, at 03:30, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> isn't it specified in TeX using a font selection package instead of the default one? Also the only upright letters I saw was for inserting normal text (not mathematical symbols) or comments/descriptions, or when using the standardized "monospace", or "serif" font (which are not italic by default).

Most use a macro package like ConTeXt, which is more recent and modern than LaTeX, and it is not difficult to change so that the Basic Latin produces math upright style. But legacy is that it is used for math italic, and it is hard to change that legacy.


From verdy_p at wanadoo.fr  Mon May 16 11:56:23 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Mon, 16 May 2016 18:56:23 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <A9A43821-413D-45C7-8B8C-153667B1C107@telia.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
 <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>
 <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
 <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>
 <CAGa7JC21Cyr=43qe5wy+d6xdHp7_8auOAhKajLUALc4fNTNAtw@mail.gmail.com>
 <A9A43821-413D-45C7-8B8C-153667B1C107@telia.com>
Message-ID: <CAGa7JC2mmxtDDDgQiyF1TZOZfR6xj1zqZ=TKtnuD=ug4iovWnA@mail.gmail.com>

I do not advocate changing that, but these legacy *TeX variants have their
own builtin sets of supported fonts with their implicit style and use them
with the normal letters, just like what is done in HTML when you apply an
italic style. Has these *TeX variants exist this way they don't need these
additions that will be needed only on newer *TeX variants that will not use
explicit font variants in their encoding, but directly new distinguished
code points (without explicit font style tagging).

There are now many *TeX variants each one having its own local assumptions
about the default styles (and layouts) they will apply. If you want to
convert any one of them to HTML (or similar rich-text format), you always
need to know how these *TeX variants have been "profiled": you cannot
simply use the same conversion rules for all *TeX.

Now, if new upright maths characters are added, this will just add new
complications in the rules used by these converters, with little benefit.
The benefit will be visible only when converting to plain-text only (but
such conversion is already defective in many aspects, as the maths layout
is not representable directly without adding additional notations such as
parentheses or some "\"-escaped notations: such conversion to plain-text is
in fact, most often, keeping the original *TeX syntax/notation if they want
to "preserve" the original semantics)

2016-05-16 10:05 GMT+02:00 Hans ?berg <haberg-1 at telia.com>:

>
> > On 16 May 2016, at 03:30, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> >
> > isn't it specified in TeX using a font selection package instead of the
> default one? Also the only upright letters I saw was for inserting normal
> text (not mathematical symbols) or comments/descriptions, or when using the
> standardized "monospace", or "serif" font (which are not italic by default).
>
> Most use a macro package like ConTeXt, which is more recent and modern
> than LaTeX, and it is not difficult to change so that the Basic Latin
> produces math upright style. But legacy is that it is used for math italic,
> and it is hard to change that legacy.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160516/6e04c0d5/attachment.html>

From haberg-1 at telia.com  Mon May 16 12:02:38 2016
From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=)
Date: Mon, 16 May 2016 19:02:38 +0200
Subject: Math upright Latin and Greek styles
In-Reply-To: <CAGa7JC2mmxtDDDgQiyF1TZOZfR6xj1zqZ=TKtnuD=ug4iovWnA@mail.gmail.com>
References: <AC07F0F8-7C48-4989-B221-B79B15FA53F0@telia.com>
 <b5dd94a6ccdc4decac024409b06e634d@exchange.microsoft.com>
 <AC8157A6-9A33-4B79-B89C-55FB183F7F6C@telia.com>
 <e596ccea73ea468dbc9099a306cb4132@exchange.microsoft.com>
 <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com>
 <CAGa7JC21Cyr=43qe5wy+d6xdHp7_8auOAhKajLUALc4fNTNAtw@mail.gmail.com>
 <A9A43821-413D-45C7-8B8C-153667B1C107@telia.com>
 <CAGa7JC2mmxtDDDgQiyF1TZOZfR6xj1zqZ=TKtnuD=ug4iovWnA@mail.gmail.com>
Message-ID: <C52F4344-02B3-4BAC-8991-BE7212359AE2@telia.com>


> On 16 May 2016, at 18:56, Philippe Verdy <verdy_p at wanadoo.fr> wrote:
> 
> I do not advocate changing that, but these legacy *TeX variants have their own builtin sets of supported fonts with their implicit style and use them with the normal letters, just like what is done in HTML when you apply an italic style. Has these *TeX variants exist this way they don't need these additions that will be needed only on newer *TeX variants that will not use explicit font variants in their encoding, but directly new distinguished code points (without explicit font style tagging).

The ConTeXt macro package default engine is LuaTeX, which uses UTF-8 for text files and UTF-32 internally, and combines the effort of several of those other, older versions. Then one can use the STIX fonts (or XITS) which are Unicode.


From ori at avtalion.name  Thu May 19 10:53:46 2016
From: ori at avtalion.name (Ori Avtalion)
Date: Thu, 19 May 2016 18:53:46 +0300
Subject: Broken link on 2016 Document Register
Message-ID: <CALgdb5JHFg=VsZt43B-d8bWNhi-TcLbq-TXoK28shgX_Q_F84A@mail.gmail.com>

On the page:
  http://www.unicode.org/L2/L-curdoc.htm

The link for "L2/16-164" points at:
  http://www.unicode.org/L2/L2016/

when it should point at:
  http://www.unicode.org/L2/L2016/16164-ucas-font-support.pdf

From davidj_faulks at yahoo.ca  Thu May 19 13:06:05 2016
From: davidj_faulks at yahoo.ca (David Faulks)
Date: Thu, 19 May 2016 18:06:05 +0000 (UTC)
Subject: Proposal not reviewed, what to do?
References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com>
Message-ID: <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com>

Hello,

Although I am glad that mostof my recent proposals have been accepeted, it does seem that one of them:
http://www.unicode.org/L2/L2016/16080-add-astrology.pdf
was not reviewed at the recent UTC meeting.

I'm feeling a bit unsure of what to make of that, especially since that was the proposal I was most unsure about. The SEI recommended that some of the characters proposed there be accepeted and others not.

Do I do nothing, and wait until after the next UTC meeting? Should I try to submit a revised proposal? Should I assume some characters are likely to be encoded and only concentrate on others?

I would like any feedback ?.

David Faulks


From dwanders at sonic.net  Thu May 19 13:23:06 2016
From: dwanders at sonic.net (Deborah W. Anderson)
Date: Thu, 19 May 2016 11:23:06 -0700
Subject: Proposal not reviewed, what to do?
In-Reply-To: <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com>
References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com>
 <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com>
Message-ID: <007201d1b1fb$7d5539a0$77fface0$@sonic.net>

Hi David,
I was present last week, and can relate the outcome. We ran short on time at the UTC, so L2/16-080 was postponed until the next meeting. What would be helpful, I think, would be to take on board the comments from http://www.unicode.org/L2/L2016/16156-script-recs.pdf and revise your doc accordingly (i.e., include the ones recommended for encoding, and, if you can, see if you can provide additional information on others).
With best wishes,
Debbie Anderson


-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Faulks
Sent: Thursday, May 19, 2016 11:06 AM
To: Unicode Mailing List <unicode at unicode.org>
Subject: Proposal not reviewed, what to do?

Hello,

Although I am glad that mostof my recent proposals have been accepeted, it does seem that one of them:
http://www.unicode.org/L2/L2016/16080-add-astrology.pdf
was not reviewed at the recent UTC meeting.

I'm feeling a bit unsure of what to make of that, especially since that was the proposal I was most unsure about. The SEI recommended that some of the characters proposed there be accepeted and others not.

Do I do nothing, and wait until after the next UTC meeting? Should I try to submit a revised proposal? Should I assume some characters are likely to be encoded and only concentrate on others?

I would like any feedback ?.

David Faulks


From verdy_p at wanadoo.fr  Thu May 19 14:13:42 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 19 May 2016 21:13:42 +0200
Subject: Proposal not reviewed, what to do?
In-Reply-To: <007201d1b1fb$7d5539a0$77fface0$@sonic.net>
References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com>
 <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com>
 <007201d1b1fb$7d5539a0$77fface0$@sonic.net>
Message-ID: <CAGa7JC0Y2sE14Y6oR6Br3Sd=MYPO+XJsAKfMQejqXGDQA7ah9A@mail.gmail.com>

Why those extra punctuation marks would need a separate allocation?
Couldn't they be encoded as *variants* of existing punctuation marks (ie.
the existing standard punctuation followed by a VS)?

I think they are exactly in the scope of encoding of variants (even if most
encoded variants ar for the Sinographic scripts, there should not be any
prohibition for them in the Latin script)

Remark:

- with the EXCLAMATIVUS PUNCTUS (from which the current "!" character
derives directly). The laternative encoding being to use the standard
exclamation mark "!" followed by either a combining dot below (but this dot
would be too low, under the base line), or a (more appropriate) combining
middle dot (note how this middle dot combines specially with the Latin
letter l to appear on the right of the ascender, rather than over it, and
for the capital it fits in the middle of the gap left by the lower right
leg: this is already handled as exception pairs in fonts for Catalan and a
few other languages; we also already have examples of punctuations used
with diacritics such as the macron).

- on the opposite, the two variants of "colon" with sideway comma, could be
in fact simply a pair of characters (the standard colon or semi-colon
followed by the character for the sideway comma), without needing any VS.
The sideway comma is not really a variant as its own spacing glyph and does
not really attach to the colon or semicolon on the left ; such combination
is akin to other combination of punctuation signs (such as "::" or "!?" or
":-" or "--"), I don't think it is a case for the encoding of the sideway
comma as a diacritic. If there are cases were the two characters may need
to be ligated we could bind them with a joiner control in the middle.


2016-05-19 20:23 GMT+02:00 Deborah W. Anderson <dwanders at sonic.net>:

> Hi David,
> I was present last week, and can relate the outcome. We ran short on time
> at the UTC, so L2/16-080 was postponed until the next meeting. What would
> be helpful, I think, would be to take on board the comments from
> http://www.unicode.org/L2/L2016/16156-script-recs.pdf and revise your doc
> accordingly (i.e., include the ones recommended for encoding, and, if you
> can, see if you can provide additional information on others).
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160519/4e361282/attachment.html>

From doug at ewellic.org  Wed May 25 10:27:49 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 25 May 2016 08:27:49 -0700
Subject: Emoji for subdivision flags
Message-ID: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>

Now that UTR #52 has been suspended, are any *specific* alternative
plans for representing subdivision flags being bandied about?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From petercon at microsoft.com  Wed May 25 13:28:23 2016
From: petercon at microsoft.com (Peter Constable)
Date: Wed, 25 May 2016 18:28:23 +0000
Subject: Emoji for subdivision flags
In-Reply-To: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>
References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>
Message-ID: <SN1PR0301MB196699D2C18A43BA5D65DB2AD5400@SN1PR0301MB1966.namprd03.prod.outlook.com>

Nothing discussed at this point. The highest priority item that UTS#52 might have covered are female emoji, and that's were the main emoji attention is at present.

After all, there's only so much attention we should be spending on emoji, right? ;-)


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell
Sent: Wednesday, May 25, 2016 8:28 AM
To: Unicode Mailing List <unicode at unicode.org>
Subject: Emoji for subdivision flags

Now that UTR #52 has been suspended, are any *specific* alternative plans for representing subdivision flags being bandied about?

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From doug at ewellic.org  Wed May 25 13:55:50 2016
From: doug at ewellic.org (Doug Ewell)
Date: Wed, 25 May 2016 11:55:50 -0700
Subject: Emoji for subdivision flags
Message-ID: <20160525115550.665a7a7059d7ee80bb4d670165c8327d.90f758a44f.wbe@email03.godaddy.com>

Peter Constable wrote:

> After all, there's only so much attention we should be spending on
> emoji, right? ;-)

But my expectations have been exceeded so many times before...

I remember when flags were considered the #1 use case for these
extensions, at least among those publicly discussed. That was a year ago
and I guess that's a long time.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From public at khwilliamson.com  Wed May 25 19:34:57 2016
From: public at khwilliamson.com (Karl Williamson)
Date: Wed, 25 May 2016 18:34:57 -0600
Subject: Emoji for subdivision flags
In-Reply-To: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>
References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>
Message-ID: <574644B1.8070609@khwilliamson.com>

On 05/25/2016 09:27 AM, Doug Ewell wrote:
> Now that UTR #52 has been suspended, are any *specific* alternative
> plans for representing subdivision flags being bandied about?
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>

What I'd like to know is how does one find out about such decisions in a 
timely manner?


From petercon at microsoft.com  Wed May 25 22:47:36 2016
From: petercon at microsoft.com (Peter Constable)
Date: Thu, 26 May 2016 03:47:36 +0000
Subject: Emoji for subdivision flags
In-Reply-To: <574644B1.8070609@khwilliamson.com>
References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com>
 <574644B1.8070609@khwilliamson.com>
Message-ID: <SN1PR0301MB196685DD7BEE8525B4C3FAB5D5410@SN1PR0301MB1966.namprd03.prod.outlook.com>

Watch for UTC minutes to be posted?


Peter

-----Original Message-----
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson
Sent: Wednesday, May 25, 2016 5:35 PM
To: Doug Ewell <doug at ewellic.org>; Unicode Mailing List <unicode at unicode.org>
Subject: Re: Emoji for subdivision flags

On 05/25/2016 09:27 AM, Doug Ewell wrote:
> Now that UTR #52 has been suspended, are any *specific* alternative 
> plans for representing subdivision flags being bandied about?
>
> --
> Doug Ewell | http://ewellic.org | Thornton, CO ????
>
>
>

What I'd like to know is how does one find out about such decisions in a timely manner?


From mathias at qiwi.be  Thu May 26 03:17:02 2016
From: mathias at qiwi.be (Mathias Bynens)
Date: Thu, 26 May 2016 10:17:02 +0200
Subject: Canonical block names: spaces vs. underscores
Message-ID: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>

`Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.

However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.

Which is it?

If proper canonical block names use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that? 
If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that?


From mathias at qiwi.be  Thu May 26 08:44:51 2016
From: mathias at qiwi.be (Mathias Bynens)
Date: Thu, 26 May 2016 15:44:51 +0200
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
Message-ID: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>


> On 26 May 2016, at 10:17, Mathias Bynens <mathias at qiwi.be> wrote:
> 
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.
> 
> However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> 
> Which is it?
> 
> If proper canonical block names use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that? 
> If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that?
> 

Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in addition to the underscores, the case of the `A` changed as well. Which is the canonical name?

The same goes for other blocks with ?and? in the name, e.g. `Miscellaneous Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.

From doug at ewellic.org  Thu May 26 10:43:35 2016
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 26 May 2016 08:43:35 -0700
Subject: Emoji for subdivision flags
Message-ID: <20160526084335.665a7a7059d7ee80bb4d670165c8327d.e24d72e063.wbe@email03.godaddy.com>

Peter Constable replied to Karl Williamson:

>>> Now that UTR #52 has been suspended, are any *specific* alternative
>>> plans for representing subdivision flags being bandied about?
>>
>> What I'd like to know is how does one find out about such decisions
>> in a timely manner?
>
> Watch for UTC minutes to be posted?

Apparently the key is to look at this list [1], which is up to date, and
not this one [2], which isn't.

The relevant minutes are at [3]. Search for "Issue 321" and in
particular look through the review comments at [4] to find out what
happened to the original scope and intent of PDUTS #52.

[1] http://www.unicode.org/L2/meetings/utc-meetings.html
[2] http://www.unicode.org/consortium/utc-minutes.html
[3] http://www.unicode.org/L2/L2016/16121.htm
[4] http://www.unicode.org/review/pri321/feedback.html

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From mark at macchiato.com  Thu May 26 10:47:27 2016
From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=)
Date: Thu, 26 May 2016 08:47:27 -0700
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
Message-ID: <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>

The canonical property and property value formats are in the *Alias* files.

{phone}
On May 26, 2016 06:57, "Mathias Bynens" <mathias at qiwi.be> wrote:

>
> > On 26 May 2016, at 10:17, Mathias Bynens <mathias at qiwi.be> wrote:
> >
> > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists
> blocks such as `Cyrillic Supplement`.
> >
> > However, `PropertyValueAliases.txt` (
> http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to
> this block as `Cyrillic_Supplement`, with an underscore instead of a space.
> >
> > Which is it?
> >
> > If proper canonical block names use spaces instead of underscores, why
> doesn?t `PropertyValueAliases.txt` reflect that?
> > If proper canonical block names use underscores instead of spaces, why
> doesn?t `Blocks.txt` reflect that?
> >
>
> Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas
> `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in
> addition to the underscores, the case of the `A` changed as well. Which is
> the canonical name?
>
> The same goes for other blocks with ?and? in the name, e.g. `Miscellaneous
> Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/bbe2f978/attachment.html>

From doug at ewellic.org  Thu May 26 10:56:44 2016
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 26 May 2016 08:56:44 -0700
Subject: Canonical block names: spaces vs. underscores
Message-ID: <20160526085644.665a7a7059d7ee80bb4d670165c8327d.9e0b0bde9f.wbe@email03.godaddy.com>

Mathias Bynens wrote:

> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists
> blocks such as `Cyrillic Supplement`.
>
> However, `PropertyValueAliases.txt`
> (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to
> this block as `Cyrillic_Supplement`, with an underscore instead of a
> space.
>
> Which is it?

It's both:

http://www.unicode.org/reports/tr44/#Matching_Symbolic

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From kenwhistler at att.net  Thu May 26 11:03:20 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 26 May 2016 09:03:20 -0700
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
Message-ID: <31a8a43d-90d8-fdd8-ea13-4ecd5974e571@att.net>


On 5/26/2016 1:17 AM, Mathias Bynens wrote:
> `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`.
>
> However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space.
>
> Which is it?
>
> If proper canonical block names

Well, first of all, "canonical block name" is not a defined term in the 
standard. Unlike
normalization of Unicode strings, there is no "normalization" of 
property values that
defines a particular form as *the* canonical form to which other strings 
normalize.

>   use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that?
> If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that?
>
>
>

See the matching rules in UAX #44:

http://www.unicode.org/reports/tr44/#Matching_Rules

and in particular, the matching rule for symbolic values, which applies 
in this case:

http://www.unicode.org/reports/tr44/#UAX44-LM3

For enumerated properties, and especially for catalog properties such as 
Block and Script,
the value of the property may be multi-word, and the best form to use in 
one context might
not be exactly (as in binary string equality exact) the same as in another.

For Blocks.txt, all block names are given with spaces and with the 
casing conventions that
would be most consistent with returning values for a block name in an 
API. The
property values used in PropertyValueAliases.txt, on the other hand, are 
systematically
turned into forms that are more identifier friendly, as the typical 
context of use for those
values is in regex expressions and the like.

There are invariant rules in place that guarantee that any new property 
values for properties
subject to the Loose Matching Rule #3 noted above are always unique in 
their namespace,
given the application of that matching rule.

--Ken


From mathias at qiwi.be  Thu May 26 12:05:05 2016
From: mathias at qiwi.be (Mathias Bynens)
Date: Thu, 26 May 2016 19:05:05 +0200
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
 <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
Message-ID: <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>


> On 26 May 2016, at 17:47, Mark Davis ?? <mark at macchiato.com> wrote:
> 
> The canonical property and property value formats are in the *Alias* files.

Thanks for confirming!

Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files.

> On 26 May 2016, at 18:03, Ken Whistler <kenwhistler at att.net> wrote:
> 
> [?] "canonical block name" is not a defined term in the standard.

I didn?t mean to imply it was ? it?s just an English word. I meant ?canonical? as in ?without loose matching applied?.

> See the matching rules in UAX #44:
> 
> http://www.unicode.org/reports/tr44/#Matching_Rules
> 
> and in particular, the matching rule for symbolic values, which applies in this case:
> 
> http://www.unicode.org/reports/tr44/#UAX44-LM3

I know about loose matching, having recently implemented it (https://github.com/mathiasbynens/unicode-loose-match).

> For enumerated properties, and especially for catalog properties such as Block and Script,
> the value of the property may be multi-word, and the best form to use in one context might
> not be exactly (as in binary string equality exact) the same as in another.

That makes sense, but shouldn?t it be consistent throughout the Unicode database text files?

From kenwhistler at att.net  Thu May 26 13:07:14 2016
From: kenwhistler at att.net (Ken Whistler)
Date: Thu, 26 May 2016 11:07:14 -0700
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
 <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
 <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>
Message-ID: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>


On 5/26/2016 10:05 AM, Mathias Bynens wrote:
>> On 26 May 2016, at 17:47, Mark Davis ?? <mark at macchiato.com> wrote:
>>
>> The canonical property and property value formats are in the *Alias* files.
> Thanks for confirming!

Well, not quite... See below.

>
> Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files.

There's always a chance, I guess. But if we did so, we'd end up having 
to just invent some
other more-or-less ad hoc property: Block_Name_Usable_For_Display, with 
the values
we already have in the Blocks.txt file. Or we would have to change the 
format to include
the block short alias as an additional field in the file, which would 
have its own maintenance
and consistency issues. Or we would be introducing a historical 
inconsistency in the UCD
between versions, which would *complicate* certain other scripts that 
parse the UCD.

>
>> On 26 May 2016, at 18:03, Ken Whistler <kenwhistler at att.net> wrote:
>>
>> [?] "canonical block name" is not a defined term in the standard.
> I didn?t mean to imply it was ? it?s just an English word. I meant ?canonical? as in ?without loose matching applied?.

Ah, but "canonical" is a very freighted word in Unicode parlance. There 
are 58 instances
of the word "canonical" in the current version of UAX #44, Unicode 
Character Database.
Every one of them is a term of art, and none of them means what you mean 
there. ;-)

What are actually in PropertyValueAliases.txt are "preferred aliases" 
(one "abbreviated",
and one "long"), plus a few "other aliases" for various compatibility 
reasons.

UAX #42 follows suit. The block property is represented by the blk 
attribute, and the
enumerated values of the blk attribute:

http://www.unicode.org/reports/tr42/#w1aac13c13c19b1

use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt.

>
>> For enumerated properties, and especially for catalog properties such as Block and Script,
>> the value of the property may be multi-word, and the best form to use in one context might
>> not be exactly (as in binary string equality exact) the same as in another.
> That makes sense, but shouldn?t it be consistent throughout the Unicode database text files?

Well, let's take an example. The entry in Blocks.txt for the Arabic 
Presentation Forms-A block is:

FB50..FDFF; Arabic Presentation Forms-A

The entry for that block in PropertyValueAliases.txt is:

blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      
; Arabic_Presentation_Forms-A

So then which would it be? Should Blocks.txt be changed to the long 
preferred alias:

FB50..FDFF; Arabic_Presentation_Forms_A

or to the abbreviated preferred alias:

FB50..FDFF; Arabic_PF_A

which would be more consistent with the XML attribute and with most 
regex usage?
If the latter, you would end up with systematically less identifiable 
labels in Blocks.txt,
which would make it a bit more obscure for other uses, and which would 
also then
create ambiguities about what might be the "best" or "preferred" label 
for blocks for
an API returning a block name -- which certainly wouldn't be the 
abbreviated "preferred alias".

I suppose a proposal to the UTC to further modify the UCD handling of 
block names
could change this situation. But I'm not convinced that we shouldn't 
just leave
things as they stand -- for stability. And then live with the 
complications required
for scripts or other parsing algorithms that actually need to deal with 
Blocks.txt to
either parse out block ranges (its main function) or to get usable block 
names
(its subsidiary function).

--Ken


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/44151018/attachment.html>

From verdy_p at wanadoo.fr  Thu May 26 13:44:55 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 May 2016 20:44:55 +0200
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
 <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
 <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>
 <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>
Message-ID: <CAGa7JC0UaS5NgxOOWVEeiN-CZpann5e5NH4fHM8E66i5B2AYXQ@mail.gmail.com>

2016-05-26 20:07 GMT+02:00 Ken Whistler <kenwhistler at att.net>:

> Well, let's take an example. The entry in Blocks.txt for the Arabic
> Presentation Forms-A block is:
>
> FB50..FDFF; Arabic Presentation Forms-A
>
> The entry for that block in PropertyValueAliases.txt is:
>
> blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      ;
> Arabic_Presentation_Forms-A
>
> So then which would it be? Should Blocks.txt be changed to the long
> preferred alias:
>
> FB50..FDFF; Arabic_Presentation_Forms_A
>
> or to the abbreviated preferred alias:
>
> FB50..FDFF; Arabic_PF_A
>

I think that this would break parsers that expect the alias used in
Blocks.txt to be directly "readable" with spaces. My opinion is to keep
Blocks.txt untouched (with spaces) as it's part of the core standard since
too long (and in sync with the ISO standard) as being the *normative* block
name.

But we could add this normative value (with spaces) into
PropertyValueAliases.txt (that ISO 10646 does not have or need in its
standard):

blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      ;
Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A

The other solution would be to *add* the abbreviated prefered alias in
Blocks.txt:

FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A

But this could break existing Block.txt parsers, when parsers should not
bug if finding new aliases in PropertyValueAliases.txt

Another solution would be to properly explain that to lookup values in
PropertyValues.txt, you can search it by replacing spaces in block names by
underscores, or make sure that underscores and spaces in the *middle* of
values are considered equivalent (so that even if they are rendered
visually, we can also display the listed aliases using spaces instead of
underscores.

However it must be clear that these aliases are case-sensitive by default
("Arabic_Presentation_Forms_A" is not the same as
"Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms
A), unless the block names property is normatively said to be
case-insensitive (in that case the followings are also aliases:
"arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost,
which is much higher than *only* allowing basic replacements of spaces and
underscores (this will work, provided that there's no "special" aliases
starting by underscores, or using pairs of underscores: I doubt ISO will
use pairs of spaces in block names which are supposed to be trimmed with
whitespaces in the middle compressed).

Removing or replacing the space-separated words in block names in the UCD
would break the compatibility and synchronization with the ISO standard
which list them with spaces.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/e315b231/attachment.html>

From mathias at qiwi.be  Thu May 26 13:48:48 2016
From: mathias at qiwi.be (Mathias Bynens)
Date: Thu, 26 May 2016 20:48:48 +0200
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
 <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
 <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>
 <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>
Message-ID: <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be>


> On 26 May 2016, at 20:07, Ken Whistler <kenwhistler at att.net> wrote:
> 
> Well, let's take an example. The entry in Blocks.txt for the Arabic Presentation Forms-A block is:
> 
> FB50..FDFF; Arabic Presentation Forms-A
> 
> The entry for that block in PropertyValueAliases.txt is:
> 
> blk; Arabic_PF_A                      ; Arabic_Presentation_Forms_A      ; Arabic_Presentation_Forms-A
> 
> So then which would it be? Should Blocks.txt be changed to the long preferred alias:
> 
> FB50..FDFF; Arabic_Presentation_Forms_A
> 
> or to the abbreviated preferred alias:
> 
> FB50..FDFF; Arabic_PF_A
> 
> which would be more consistent with the XML attribute and with most regex usage?

This sounds like a strawman argument (?). The long preferred alias definitely seems more suitable for a ?canonical? name.

> I suppose a proposal to the UTC to further modify the UCD handling of block names
> could change this situation. But I'm not convinced that we shouldn't just leave
> things as they stand -- for stability. And then live with the complications required
> for scripts or other parsing algorithms that actually need to deal with Blocks.txt to
> either parse out block ranges (its main function) or to get usable block names
> (its subsidiary function).

Perhaps the ?Note:? in the commented header in `Blocks.txt` could be extended to point out that the ~~canonical block names~~, nay, ++preferred block aliases++ are listed in `PropertyValueAliases.txt`? That would?ve been enough to avoid the question that spawned this thread.

From verdy_p at wanadoo.fr  Thu May 26 14:32:12 2016
From: verdy_p at wanadoo.fr (Philippe Verdy)
Date: Thu, 26 May 2016 21:32:12 +0200
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be>
References: <D03FDD88-24D2-41AB-981B-F27D76166ED4@qiwi.be>
 <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be>
 <CAJ2xs_Hk3_QviioSg4zxN--L=pqJCbPGh73ZVKMa3XWWStXBBw@mail.gmail.com>
 <D3041ACA-0B7C-45C1-8E36-9E1CDDA5D75D@qiwi.be>
 <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net>
 <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be>
Message-ID: <CAGa7JC1Z_+=HCCz-mvYhOiEok_c3Ynst8-QxjsDRA6g+hFKr8g@mail.gmail.com>

2016-05-26 20:48 GMT+02:00 Mathias Bynens <mathias at qiwi.be>:

>
> > On 26 May 2016, at 20:07, Ken Whistler <kenwhistler at att.net> wrote:
>
> Perhaps the ?Note:? in the commented header in `Blocks.txt` could be
> extended to point out that the ~~canonical block names~~, nay, ++preferred
> block aliases++ are listed in `PropertyValueAliases.txt`? That would?ve
> been enough to avoid the question that spawned this thread.
>

I'd say that the "preferred block aliases" should be stable and always in
the first entry.

And the last entry should be the preferred version for display and
unabbreviated (but not necessarily stable, it may change over time, and
applications are free to use better display names, including translations;
this last entry should be the best suitable for US English in a *technical*
glossary and preferably used in Unicode documentations and proposals, but
may be different for British English, or for vernacular names, but for
reference the 1st entry should not change)

Note also that the 1st entry in property aliases is not necessarily the
most abbreviated one: there may be other aliases in the middle of the list
using shorter names, provided that they don't conflict with others; or
special aliases used for specific lookups matching some pattern with a
known prefixes/suffixes (e.g. Hangul syllable types) so that another
specification specific for this usage could simply drop those implied
prefixes/suffixes, using even shorter aliases internally than the listed
aliases)

The rules for lookling up aliases in PropertyAliases should be independant
of the property type:
- capitalization should be preserved (with lookups always case-sensive,
even of the listed values for a property type are currently using only
ASCII capital letters, or only ASCII lowercase letters): the capitalization
form may need to be distinguished in some future of the standard (without
having to use a broken orthography to distinguish them), and we should not
be using a slow UCA collator to match entries.
- only underscores/spaces should be considered equivalent, and there will
NEVER be special entries using leading or trailing underscores, or pairs of
underscores, or pairs of whitespaces (all aliases are assumed to be
trimmable and compressible, like in XML or HTML by default): applications
may then choose the "canonicalization" form they prefer (with underscores,
or with spaces)
- some "camelCased" bijective transform could suppress spaces/underscores,
provided that the transform includes an "escaping" mechanism for case
distinctions; but alternatively we could also list conforming "camelCased"
aliases (from which lowercase-only aliases with ASCII hyphens could be
infered for use in CSS selectors also with a bijective transform)
- however some programming languages (e.g. BASIC) do not have any case
distinction for identifiers (and there's no easy escaping mechanism without
using separators like underscores, which should also not be used in leading
or traling positions), or use lettercase (of the initial) for special
meaning (e.g. in several IA languages to distinguish variables and atoms:
the escaping mechanism may need to prepend a leading underscore or some
common prefix).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/e4be0b95/attachment.html>

From doug at ewellic.org  Thu May 26 15:41:49 2016
From: doug at ewellic.org (Doug Ewell)
Date: Thu, 26 May 2016 13:41:49 -0700
Subject: Canonical block names: spaces vs. underscores
Message-ID: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com>

Mathias Bynens wrote:

> Any chance the canonical names can be used in `Blocks.txt` as well,
> for consistency? This would simplify scripts that parse the Unicode
> database text files. 

I don't see the problem here. The loose-matching rule is well-defined
and not complicated, either visually or algorithmically; and if Mathias
has an implementation up on GitHub, he should be able to use it wherever
it's needed.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????


From markus.icu at gmail.com  Fri May 27 00:14:44 2016
From: markus.icu at gmail.com (Markus Scherer)
Date: Thu, 26 May 2016 22:14:44 -0700
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com>
References: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com>
Message-ID: <CAN49p6rvNTXGEV7GZwQ_8V_Fscp=jo8YiWDHRsOBZYDBJetevQ@mail.gmail.com>

Note that the Block property is an artifact of how the committee organizes
the encoding of characters. It is not very useful for processing. For that,
the Script property, Script_Extensions, and others are normally much better.

markus
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/unicode/attachments/20160526/77ce0f74/attachment.html>

From doug at ewellic.org  Sat May 28 10:51:55 2016
From: doug at ewellic.org (Doug Ewell)
Date: Sat, 28 May 2016 09:51:55 -0600
Subject: Canonical block names: spaces vs. underscores
In-Reply-To: <mailman.1.1464368401.10525.unicode@unicode.org>
References: <mailman.1.1464368401.10525.unicode@unicode.org>
Message-ID: <A19815E26C1A49E7A06B9B0A4877D806@DougEwell>

Philippe Verdy wrote:

> However it must be clear that these aliases are case-sensitive by
> default ("Arabic_Presentation_Forms_A" is not the same as
> "Arabic_presentation_forms_A" but is the same as "Arabic
> Presentation_Forms A), unless the block names property is normatively
> said to be case-insensitive (in that case the followings are also
> aliases: "arabic_pf_a", "arabic pf a"). But adding case insensitivity
> has a cost, which is much higher than *only* allowing basic
> replacements of spaces and underscores [...]

UAX #44 says:

> 5.9.2 Matching Character Names
>
> UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial
> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>
> 5.9.3 Matching Symbolic Values
>
> UAX44-LM3. Ignore case, whitespace, underscore ('_'), hyphens, and any
> initial prefix string "is".

I read the words "ignore case" in these two rules to mean that case 
should be ignored.

--
Doug Ewell | http://ewellic.org | Thornton, CO ????