From doug at ewellic.org Sun May 1 14:32:19 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 1 May 2016 13:32:19 -0600 Subject: Non-standard 8-bit fonts still in use In-Reply-To: References: Message-ID: <1FD941D6110E4641A78751CC2BB73CCB@DougEwell> Don Osborn wrote: > Substituting characters such that the key for an otherwise unused > character yields a hooked letter or a tone-marked vowel may be seen as > sufficient for their purposes and easier than switching to Unicode and > sorting out a new keyboard system. The myth is that switching to Unicode requires switching to a new and { unfamiliar, complex, hard to adopt } keyboard layout. Even when the "new" part is true, the rest need not be. Assuming they are currently using a Windows U.S. English layout, someone could easily provide them with a layout that either: 1. puts the non-ASCII letters on the keys corresponding to the ASCII symbols currently repurposed by their font (for example, pressing q yields ?), or 2. puts them on AltGr combinations (for example, pressing AltGr+e yields ?). In the first case, there would be no apparent change for the user, but the mapping from q to ? would be moved out of the font and into the input process. The second case would allow access to both English and (e.g.) Bambara characters, but would require a change for the user typing Bambara, so would probably meet with more resistance. Tools could be easily written to convert existing text like "tqgq" to the real spelling, so compatibility with the hacked fonts would become less of a concern. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From duerst at it.aoyama.ac.jp Mon May 2 02:34:08 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 2 May 2016 16:34:08 +0900 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> Message-ID: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp> Hello Don, I agree with Doug that creating a good keyboard layout is a good thing to do. Among the people on this list, you probably have the best contacts, and can help create some test layouts and see how people react. Also, creating fonts that have the necessary coverage but are encoded in Unicode may help, depending on how well the necessary characters are supported out of the box in the OS version in use on the ground (which may be quite old). Also, a conversion program will help. It shouldn't be too difficult, because as far as I understand, it's essentially just a few characters than need conversion, and it's 1 byte to multibyte. Even in a low level language such as C, that's just a few lines, and any of the students in my programming course could write that (they just wrote something similar as an exercise last week). On 2016/05/01 02:27, Don Osborn wrote: > Last October I posted about persistence of old modified/hacked 8-bit > fonts, with an example from Mali. This is a quick follow up, with > belated thanks to those who responded to that post on and off list, and > a set of examples from China and Nigeria. I conclude below with some > thoughts about what this says about dissemination of information about > Unicode. I'm not familiar with the actual situation on the ground, which may vary in each place, but in general, what will convince people is not theoretical information, but practical tools and examples about what works better with Unicode (e.g.: if you do it this way, it will show correctly in the Web browser on your new smart phone, or if you do it this way, even your relative in Europe can read it without installing a special font,...). Even in the developed world, where most people these days are using Unicode, most don't know what it is, and that's just fine, because it just works. Regards, Martin. From ed.trager at gmail.com Mon May 2 11:03:58 2016 From: ed.trager at gmail.com (Ed Trager) Date: Mon, 2 May 2016 12:03:58 -0400 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp> Message-ID: In addition to creating platform-specific keyboard layouts as Doug suggested, I would also like to point out that it is now also possible ?and possibly even easier? to create web-based keyboard and input method engines that may allow a greater degree of cross-platform support, reducing platform-specific work. Also with web applications the "software installation" issue is eliminated. Remember that while it is easy for technologically savvy folks like members of this mailing list to install keyboard drivers on any platform we like, this process is somewhat beyond the reach of many people I know, even when they are otherwise fairly comfortable using computers. As an example, see http://unifont.org/keycurry/, a Javascript/jQuery-based web app that I wrote and use for myself all of the time. One limitation of keycurry is that currently almost all of the keyboard maps assume an American QWERTY layout. But honestly it would not be very difficult to generate variant maps for AZERTY or whatever else one wants. I just have not bothered myself to do that extra work because I bought my laptop in the U.S. and the default QWERTY layout works fine for me, especially now that I can write new keyboard maps for most scripts and languages in a matter of a few minutes (unifont.org/keycurry now uses JSON-based keyboard maps with UTF-8, in addition to an older format based on Yudit; obviously IMEs for scripts like Korean or Chinese take a lot longer to write, but simple keymaps for Latin and many other scripts are super easy to make). In fact, with web-based solutions, users don't even have to download or install the fonts, as obviously we can just use web fonts to supply Unicode-based fonts to the web app. (In fact this is exactly what I do for the Tai Tham keyboards in keycurry, inter alia). Best - Ed On Mon, May 2, 2016 at 3:34 AM, Martin J. D?rst wrote: > Hello Don, > > I agree with Doug that creating a good keyboard layout is a good thing to > do. Among the people on this list, you probably have the best contacts, and > can help create some test layouts and see how people react. > > Also, creating fonts that have the necessary coverage but are encoded in > Unicode may help, depending on how well the necessary characters are > supported out of the box in the OS version in use on the ground (which may > be quite old). > > Also, a conversion program will help. It shouldn't be too difficult, > because as far as I understand, it's essentially just a few characters than > need conversion, and it's 1 byte to multibyte. Even in a low level language > such as C, that's just a few lines, and any of the students in my > programming course could write that (they just wrote something similar as > an exercise last week). > > On 2016/05/01 02:27, Don Osborn wrote: > >> Last October I posted about persistence of old modified/hacked 8-bit >> fonts, with an example from Mali. This is a quick follow up, with >> belated thanks to those who responded to that post on and off list, and >> a set of examples from China and Nigeria. I conclude below with some >> thoughts about what this says about dissemination of information about >> Unicode. >> > > I'm not familiar with the actual situation on the ground, which may vary > in each place, but in general, what will convince people is not theoretical > information, but practical tools and examples about what works better with > Unicode (e.g.: if you do it this way, it will show correctly in the Web > browser on your new smart phone, or if you do it this way, even your > relative in Europe can read it without installing a special font,...). > > Even in the developed world, where most people these days are using > Unicode, most don't know what it is, and that's just fine, because it just > works. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Mon May 2 11:31:36 2016 From: oren.watson at gmail.com (Oren Watson) Date: Mon, 2 May 2016 12:31:36 -0400 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> <322f13f4-1e52-2b44-f3ff-44156884b6f1@it.aoyama.ac.jp> Message-ID: Hm... I don't think that simply search-replacing of ascii characters for the characters the font uses them for will work, except on .txt files. Microsoft Word documents, HTML files, and any other non-plaintext files will almost certainly be corrupted by such a program, because the tags might contain those letters. (in addition, unlike .docx files, .doc files from windows xp contain binary data which could have arbitrary bytes.) Probably in practical terms a good solution is to make a Microsoft Word macro to do the replacement, and post instruction to copypaste it. On Mon, May 2, 2016 at 3:34 AM, Martin J. D?rst wrote: > Hello Don, > > I agree with Doug that creating a good keyboard layout is a good thing to > do. Among the people on this list, you probably have the best contacts, and > can help create some test layouts and see how people react. > > Also, creating fonts that have the necessary coverage but are encoded in > Unicode may help, depending on how well the necessary characters are > supported out of the box in the OS version in use on the ground (which may > be quite old). > > Also, a conversion program will help. It shouldn't be too difficult, > because as far as I understand, it's essentially just a few characters than > need conversion, and it's 1 byte to multibyte. Even in a low level language > such as C, that's just a few lines, and any of the students in my > programming course could write that (they just wrote something similar as > an exercise last week). > > On 2016/05/01 02:27, Don Osborn wrote: > >> Last October I posted about persistence of old modified/hacked 8-bit >> fonts, with an example from Mali. This is a quick follow up, with >> belated thanks to those who responded to that post on and off list, and >> a set of examples from China and Nigeria. I conclude below with some >> thoughts about what this says about dissemination of information about >> Unicode. >> > > I'm not familiar with the actual situation on the ground, which may vary > in each place, but in general, what will convince people is not theoretical > information, but practical tools and examples about what works better with > Unicode (e.g.: if you do it this way, it will show correctly in the Web > browser on your new smart phone, or if you do it this way, even your > relative in Europe can read it without installing a special font,...). > > Even in the developed world, where most people these days are using > Unicode, most don't know what it is, and that's just fine, because it just > works. > > Regards, Martin. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Wed May 4 01:54:48 2016 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Wed, 4 May 2016 07:54:48 +0100 (BST) Subject: non-breaking snakes Message-ID: See http://xkcd.com/1676/ (making sure to look at the mouse-over text) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From mark at macchiato.com Wed May 4 02:07:19 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 May 2016 09:07:19 +0200 Subject: non-breaking snakes In-Reply-To: References: Message-ID: Very nice! Mark On Wed, May 4, 2016 at 8:54 AM, Julian Bradfield wrote: > See > http://xkcd.com/1676/ > (making sure to look at the mouse-over text) > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From samjnaa at gmail.com Wed May 4 02:14:00 2016 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 4 May 2016 12:44:00 +0530 Subject: non-breaking snakes In-Reply-To: References: Message-ID: Isn't there some Japanese orthography feature that already does something like this? -- Shriramana Sharma ???????????? ???????????? From mark at macchiato.com Wed May 4 02:23:04 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 4 May 2016 09:23:04 +0200 Subject: non-breaking snakes In-Reply-To: References: Message-ID: Arabic has tatweel/kashida for justification; rather similar in principle. https://en.wikipedia.org/wiki/Kashida Mark On Wed, May 4, 2016 at 9:14 AM, Shriramana Sharma wrote: > Isn't there some Japanese orthography feature that already does > something like this? > > -- > Shriramana Sharma ???????????? ???????????? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Wed May 4 02:23:54 2016 From: textexin at xencraft.com (Tex Texin) Date: Wed, 4 May 2016 00:23:54 -0700 Subject: non-breaking snakes In-Reply-To: References: Message-ID: <00e401d1a5d5$ea92a070$bfb7e150$@xencraft.com> Non-breaking snake is English for Kashida right? -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Julian Bradfield Sent: Tuesday, May 03, 2016 11:55 PM To: unicode at unicode.org Subject: non-breaking snakes See http://xkcd.com/1676/ (making sure to look at the mouse-over text) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From richard.wordingham at ntlworld.com Wed May 4 02:27:55 2016 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Wed, 4 May 2016 08:27:55 +0100 Subject: non-breaking snakes In-Reply-To: References: Message-ID: <20160504082755.6b3b9f9d@JRWUBU2> On Wed, 4 May 2016 07:54:48 +0100 (BST) Julian Bradfield wrote: > See > http://xkcd.com/1676/ > (making sure to look at the mouse-over text) I though kashida (TATWEEL) was a precedent not to be followed. The issue of course, is that chained snakes do not reflow well, just as filler text doesn't. Richard. From khaledhosny at eglug.org Wed May 4 05:46:58 2016 From: khaledhosny at eglug.org (Khaled Hosny) Date: Wed, 4 May 2016 12:46:58 +0200 Subject: non-breaking snakes In-Reply-To: References: Message-ID: <20160504104658.GA24870@macbook> That sounds more like traditional Tibetan justification than kashida: http://rishida.net/scripts/tibetan/#justification On Wed, May 04, 2016 at 09:23:04AM +0200, Mark Davis ?? wrote: > Arabic has tatweel/kashida for justification; rather similar in principle. > > https://en.wikipedia.org/wiki/Kashida > > Mark > > On Wed, May 4, 2016 at 9:14 AM, Shriramana Sharma wrote: > > > Isn't there some Japanese orthography feature that already does > > something like this? > > > > -- > > Shriramana Sharma ???????????? ???????????? > > From simon at simon-cozens.org Wed May 4 06:07:05 2016 From: simon at simon-cozens.org (Simon Cozens) Date: Wed, 4 May 2016 21:07:05 +1000 Subject: non-breaking snakes In-Reply-To: References: Message-ID: <5729D7D9.7020201@simon-cozens.org> On 04/05/2016 17:07, Mark Davis ?? wrote: > Very nice! The SILE typesetting engine now implements full support for this new justification strategy. Please see http://www.sile-typesetter.org/ From verdy_p at wanadoo.fr Wed May 4 06:15:08 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 4 May 2016 13:15:08 +0200 Subject: non-breaking snakes In-Reply-To: References: Message-ID: Those "snakes" do exist in Arabic for justification purpose (they are formatting controls insertable between pairs of joined letters and possibly used as base holders for diacritics). Otherwise they are just normal "filler" (punctuation-like symbols like leader dots, otherwise "crap text"). The Arabic tatweel is very smart (better than extending the only spacing that applies only between words and better than breaking words with interletter spacing or changing the shape of letters, or packing letters to remove their normal spacing gap and creating collisions). Technically such "tatweel" also exist in Latin with its cursive form (with joined letters), and possibly as well in cursive forms of Greek and Cyrillic. But they are still not encoded at all (as formatting controls), even if they could also be used as base holders for some left-side or right-side diacritics. 2016-05-04 9:07 GMT+02:00 Mark Davis ?? : > Very nice! > > Mark > > On Wed, May 4, 2016 at 8:54 AM, Julian Bradfield > wrote: > >> See >> http://xkcd.com/1676/ >> (making sure to look at the mouse-over text) >> >> -- >> The University of Edinburgh is a charitable body, registered in >> Scotland, with registration number SC005336. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Wed May 4 07:59:04 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Wed, 4 May 2016 09:59:04 -0300 Subject: non-breaking snakes In-Reply-To: References: Message-ID: 2016-05-04 4:14 GMT-03:00 Shriramana Sharma : > Isn't there some Japanese orthography feature that already does > something like this? Japanese (and Chinese) vertical calligraphy can do arbitrary-length stretching of lines (like the Arabic kashida under discussion, and like most cursive scripts in the world, I guess). Notice e.g. the long lines here: https://www.instagram.com/seiichirou_uemura/ . The hiragana letter ?? in particular, often becomes a long vertical line. However, traditionally this is used for ?sthetic rhythm, not for justification. In fact, most kinds of Japanese calligraphy prize variation in line length, not uniformity. And when uniformity is sought (e.g. certain sutras), they don't use stretched lines, but just fill a grid with non-cursive, block (kaisho) characters. I'm not aware of similar features for typography. Because the script doesn't separate words, justification is comparatively simple?you just break lines mid-word, mostly wherever (with a few restrictions to avoid hanging punctuation and so on.) From doug at ewellic.org Wed May 4 09:29:20 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 04 May 2016 07:29:20 -0700 Subject: non-breaking snakes Message-ID: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com> 1F40D FE0F The VS just makes extra, extra sure that it?s emoji. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Thu May 5 21:35:59 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 6 May 2016 04:35:59 +0200 (CEST) Subject: Non-standard 8-bit fonts still in use In-Reply-To: <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> Message-ID: <1246651385.10.1462502159794.JavaMail.www@wwinf1k18> On Sat, 30 Apr 2016 13:27:02 -0400, Don Osborn wrote: > If the latter be the case, that would seem to have implications > regarding dissemination of information about Unicode. "If you > standardize it, they will adopt" certainly holds for industry and > well-informed user communities (such as in open source software), but > not necessarily for more localized initiatives. This is not to seek to > assign blame in any way, but rather to point out what seems to be a > persistent issue with long term costs in terms of usability of text in > writing systems as diverse as Bambara, Hausa boko, and Chinese pinyin. The situation Don describes is challenging the work that is already done and on-going in Mali, with several keyboard layouts at hand. If widening the range is really suitable, one might wish to test a couple of other solutions than already mentioned, that roughly fall into two subsets: 1) Letters on the digits row. Thanks to a kindly shared resource, I?m able to tell that over one dozen Windows layouts?mainly French, as used in Mali, but also Lithuanian, Czech, Slovak, and Vietnamese, have the digits in the Shift or AltGr shift states. The latter is the only useful way of mapping letters on digit keys and becomes handy if the Kana toggle is added, either alone or in synergy with the Kana modifier instead of AltGr. With all bracketing characters in group?2 level?1 on the home row and so on, there is enough place to have all characters for Bambara and French directly accessed. 2) Letters through dead keys. This is the ISO/IEC?9995 way of making more characters available in additional groups with dead key group selectors (referred to as remnant modifiers but actually implemented as dead keys). This is also one way SIL/Tavultesoft?s layouts work for African and notably for Malian languages. IME-based keyboarding software may additionally offer a transparent input experience. On Mon, 2 May 2016 12:03:58 -0400, Ed Trager wrote: > Also with web applications the "software installation" issue is eliminated. > Remember that while it is easy for technologically savvy folks like members > of this mailing list to install keyboard drivers on any platform we like, > this process is somewhat beyond the reach of many people I know, even when > they are otherwise fairly comfortable using computers. I can?t easily believe that people who are comfortable with computers may have trouble using the widely automatted keyboard layout installation feature, because I?ve as well experienced myself as got the opportunity to observe on other persons I know, that in fact there is some kind of reluctance based on the belief?call it a myth or an urban legend?that Windows plus preinstalled software plus MS?Office come along with everything any user may need until the next update. Though informing about Microsoft?s help to customize the keyboard is more complicated in that the display is part of the hardware, and the functioning behind has more of a blackbox. Being actually working on such a project for the fr-FR locale, I?ve already got some ideas for Bambara. I hope it can soon be on-line. Kind regards, Marcel From charupdate at orange.fr Fri May 6 10:21:28 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 6 May 2016 17:21:28 +0200 (CEST) Subject: non-breaking snakes In-Reply-To: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com> References: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com> Message-ID: <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39> On Wed, 4 May 2016 08:27:55 +0100, Richard Wordingham wrote: > On Wed, 4 May 2016 07:54:48 +0100 (BST) > Julian Bradfield wrote: > > > See > > http://xkcd.com/1676/ > > (making sure to look at the mouse-over text) > > I though kashida (TATWEEL) was a precedent not to be followed. The > issue of course, is that chained snakes do not reflow well, just as > filler text doesn't. On Wed, 4 May 2016 13:15:08 +0200, Philippe Verdy wrote: > Those "snakes" do exist in Arabic for justification purpose (they are > formatting controls insertable between pairs of joined letters and possibly > used as base holders for diacritics). > > [?] On Wed, 4 May 2016 09:59:04 -0300, Leonardo Boiko wrote: > 2016-05-04 4:14 GMT-03:00 Shriramana Sharma : > > Isn't there some Japanese orthography feature that already does > > something like this? > > [?] In fact, most kinds of Japanese calligraphy prize > variation in line length, not uniformity. [?] On Wed, 04 May 2016 07:29:20 -0700, Doug Ewell wrote: > 1F40D FE0F > > The VS just makes extra, extra sure that it?s emoji. Hmm? I guess the principle of diversity should then allow for other long animals too: various caterpillars, squirrel running on a branch? More seriously, if animal pictographs are downgraded to mere line-fillers, I?m not sure whether the text style variation selector U+FE0E would not be a good choice. Why not tackle it the other way around: standardize sequences of U+2012..U+2015, U+2E3A with some of the other ~250 variation selectors to make them look like extensible vegetal or animal ornaments. Or simply chain the VSes with repeated U+002D. If there were a vote, I?d prefer word-break in scripts that allow for, in case justification is really required (to make a hieratic look); or in scripts that cannot break words, as Hebrew, using the letter extension mechanisms. As of letter spacing, abusing it for justifiction purposes is current in some languages but is not semantically neutral ?TUS recalls?in others that may be very close geographically. What helps making a proper layout on one side of the Rhine, is yelling on the other. So yes, then abusing emoji is the lesser evil???:) Marcel From steve at swales.us Fri May 6 10:49:09 2016 From: steve at swales.us (Steve Swales) Date: Fri, 6 May 2016 08:49:09 -0700 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> Message-ID: This discussion seems to have fizzled out, but I?m concerned that there?s a real world problem here which is at least partially the concern of the consortium, so let me stir the pot and see if there?s still any meat left. On the current release of MacOS (including the developer beta, for your reference, Peter), if you use Calibri font, for example, in any app (e.g. notes), to write words with ?ti? (like internationalization), then press ?Print" and ?Open PDF in Preview?, you get a PDF document with the joined ?ti?. Subsequently cutting and pasting produces mojibake, and searching the document for words with?ti? doesn?t work, as previously noted. I suppose we can look on this as purely a font handling/MacOS bug, but I?m wondering if we should be providing accommodations or conveniences in Unicode for it to work as desired. -steve > On Mar 21, 2016, at 1:40 AM, Philippe Verdy wrote: > > Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...) > > Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain). > > PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately). > > 2016-03-20 20:52 GMT+01:00 Tom Gewecke >: > > > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) > wrote: > > > > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document. > > My understanding is that PDF/A-1a is supposed to be searchable. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri May 6 12:24:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 6 May 2016 19:24:12 +0200 Subject: non-breaking snakes In-Reply-To: <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39> References: <20160504072920.665a7a7059d7ee80bb4d670165c8327d.2b95910407.wbe@email03.godaddy.com> <1586659303.7147.1462548088451.JavaMail.www@wwinf1h39> Message-ID: My opjion is that the choice of graphics for these fillers is just a matter of style. A single filler (format control) would be enough to encode (simplying later the text handling in order to ignore them for plain text searches or collation). These fillers are only made for specific text layouts with specific fonts at specific sizes, the number of actual symbols/graphics you would need is unpredictable in all other cases. The format control would only be used to mark where these fillers are safely insertable automatically (just like SHY marks). The situation however would be different if these marks are also used as bases for holding diacritics (this is the case of the Arabic Tatweel). But using CGJ (or some other control with combining class 0) is generally enough to mark their separation from the base letter to which they would normally attach. The diacritic will be positioned relative to this zero-width CGJ, above or below. But CGJ itself is not freely "extensible" in width for line justification. So the encoding would be if you want all diacritics to remain attached located to the start side of the filler. If the diacritics should come at the end side of the filler, they would be encoded as . In summary that FILLER would be just another form for CGJ, except that it is extensible like whitespaces for line justification purpose. Also the FILLER would not necessarily hold diacritics and could be used alone, even without letters on either sides of it. The Arabic Tatweel is behaving mosly like CGJ (diacritics are normally rendered on the start side of the filler, but there are some cases where the Arabic diacritics are centered on the filler: it behaves more like a normal letter for rendering, even if it's ignorable for plain-text searches, and may not be rendered at all if there's no need to justify lines or diacritics may still fit around the base letter before it or even in its normal position with that base letter). 2016-05-06 17:21 GMT+02:00 Marcel Schneider : > On Wed, 4 May 2016 08:27:55 +0100, Richard Wordingham wrote: > > > On Wed, 4 May 2016 07:54:48 +0100 (BST) > > Julian Bradfield wrote: > > > > > See > > > http://xkcd.com/1676/ > > > (making sure to look at the mouse-over text) > > > > I though kashida (TATWEEL) was a precedent not to be followed. The > > issue of course, is that chained snakes do not reflow well, just as > > filler text doesn't. > > > On Wed, 4 May 2016 13:15:08 +0200, Philippe Verdy wrote: > > > Those "snakes" do exist in Arabic for justification purpose (they are > > formatting controls insertable between pairs of joined letters and > possibly > > used as base holders for diacritics). > > > > [?] > > > On Wed, 4 May 2016 09:59:04 -0300, Leonardo Boiko wrote: > > > 2016-05-04 4:14 GMT-03:00 Shriramana Sharma : > > > Isn't there some Japanese orthography feature that already does > > > something like this? > > > > [?] In fact, most kinds of Japanese calligraphy prize > > variation in line length, not uniformity. [?] > > > On Wed, 04 May 2016 07:29:20 -0700, Doug Ewell wrote: > > > 1F40D FE0F > > > > The VS just makes extra, extra sure that it?s emoji. > > > Hmm? I guess the principle of diversity should then > allow for other long animals too: various caterpillars, > squirrel running on a branch? > > More seriously, if animal pictographs are downgraded > to mere line-fillers, I?m not sure whether the text style > variation selector U+FE0E would not be a good choice. > > Why not tackle it the other way around: standardize > sequences of U+2012..U+2015, U+2E3A with some of > the other ~250 variation selectors to make them look > like extensible vegetal or animal ornaments. Or simply > chain the VSes with repeated U+002D. > > If there were a vote, I?d prefer word-break in scripts > that allow for, in case justification is really required > (to make a hieratic look); or in scripts that cannot break > words, as Hebrew, using the letter extension mechanisms. > > As of letter spacing, abusing it for justifiction purposes > is current in some languages but is not semantically neutral > ?TUS recalls?in others that may be very close geographically. > What helps making a proper layout on one side of the Rhine, > is yelling on the other. > > So yes, then abusing emoji is the lesser evil???:) > > Marcel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri May 6 16:22:16 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 7 May 2016 07:22:16 +1000 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> Message-ID: My understand ing is searchability comes down to twho factors: 1) the ToUnicode mapping ...I which mapps glyphs in the font or subsetted font to Unicode codepoints. Mappings take the form of one glyph to one codepoint or one glyph to two or more codepoints. Obviously any glyph that doesnt resolve by default to a codepoint isn't in the mapping , nor does the mapping handle glyphs that have been visually reordered during rendering. 2) the next step is to tag the PDF then use the ActualText label of each tag. So for some languages with the right fonts step one is all that is needed. And this is fairly standard in pdf generation tools. The font itself can impact on this of course. But for other languages you need to go to the second step. Woth languages I work with I might have some pdfs tat just require the visible text layer.others will have a visible text layer. For the pdf to be eearchable, the search tools not only need to be able to handle the text layer but also actualtext attributes when necessary. And that all comes down to decisions the tool developer has taken on how to handle searching when both visible text layers and ActualText labels are present. I have been told in accessibility lists that the PDF specs leave that implementation detail to the developer based on their requirements. So in some cases you need to go the extra step and ActualText. But you also need to evaluate your search tools to ensure they fo what you expect. Andrew On Saturday, 7 May 2016, Steve Swales wrote: > This discussion seems to have fizzled out, but I?m concerned that there?s a real world problem here which is at least partially the concern of the consortium, so let me stir the pot and see if there?s still any meat left. > On the current release of MacOS (including the developer beta, for your reference, Peter), if you use Calibri font, for example, in any app (e.g. notes), to write words with ?ti? (like internationalization), then press ?Print" and ?Open PDF in Preview?, you get a PDF document with the joined ?ti?. Subsequently cutting and pasting produces mojibake, and searching the document for words with?ti? doesn?t work, as previously noted. > I suppose we can look on this as purely a font handling/MacOS bug, but I?m wondering if we should be providing accommodations or conveniences in Unicode for it to work as desired. > -steve > > > On Mar 21, 2016, at 1:40 AM, Philippe Verdy wrote: > Are those PDF supposed to be searchable inside of them ? For archival purpose, the PDF are stored in their final form, and search is performed by creating a database of descriptive metadata. Each time one wants formal details, they have to read the original the way it was presented (many PDFs are jsut scanned facsimiles of old documents which originately were not even in numeric plain-text, they were printed or typewritten, frequently they include graphics, handwritten signatures, stamped seals...) > Being able to search plain-text inside a PDF is not the main objective (and not the priority). The archival however is a top priority (and there's no money to finance a numerisation and no human resource available to redo this old work, if needed other contributors will recreate a plain-text version, possibly with rich-text features, e.g. in Wikisource for old documents that fall in the public domain). > PDF/A-1a is meant only for creating new documents from a original plain-text or rich-text document created with modern word-processing applications. But this specification will frequently have to be broken, if there's the need to include handwritten or supplementary elements (signatures, seals...) whose source is not the original electronic document but the printed paper over which the annotations were made: it is this paper document, not the electronic document which is the official final source (we've got some important legal paper whose original has other marks including traces of beer or coffee, or partly burnt, the paper itself has several alterations, but it is the original "as is", and for legal purpose the only acceptable archival form as a PDF must ignore all the PDF/A-1a constraints, not meant to represent originals accurately). > 2016-03-20 20:52 GMT+01:00 Tom Gewecke : >> >> > On Mar 20, 2016, at 12:24 PM, Asmus Freytag (t) < asmus-inc at ix.netcom.com> wrote: >> > >> > Usually, the archive feature pertains only to the fact that you can reproduce the final form, not to being able to get at the correct source (plain text backbone) for the document. >> >> My understanding is that PDF/A-1a is supposed to be searchable. >> >> >> > > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From tuvalkin at gmail.com Fri May 6 23:54:57 2016 From: tuvalkin at gmail.com (=?UTF-8?Q?Ant=c3=b3nio_Martins-Tuv=c3=a1lkin?=) Date: Sat, 7 May 2016 05:54:57 +0100 Subject: non-breaking snakes In-Reply-To: References: Message-ID: <572D7521.9090203@gmail.com> On 2016.05.04 07:54, Julian Bradfield wrote: > See http://xkcd.com/1676/ > (making sure to look at the mouse-over text) The new snake character needs to have in its remarks field see-also links to these: U+115F HANGUL CHOSEONG FILLER U+1160 HANGUL JUNGSEONG FILLER U+3164 HANGUL FILLER : chaeum U+A8F9 DEVANAGARI GAP FILLER U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164) U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER -- ____. Ant?nio MARTINS-Tuv?lkin | ()| |####| PT-1500-124 Lisboa N?o me invejo de quem tem | PT-2695-010 Bobadela LRS carros, parelhas e montes | +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | --------------------------------------------------------------------- De sable uma fonte e bordadura escaqueada de jalde e goles por timbre bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" --------------------------------------------------------------------- From leob at mailcom.com Sat May 7 00:35:34 2016 From: leob at mailcom.com (Leo Broukhis) Date: Fri, 6 May 2016 22:35:34 -0700 Subject: non-breaking snakes In-Reply-To: <572D7521.9090203@gmail.com> References: <572D7521.9090203@gmail.com> Message-ID: Also, or rather foremost, to U+2766 ? FLORAL HEART ????? - what does the (almost) connecting vine remind me of? Hmmm... Leo 2016-05-06 21:54 GMT-07:00 Ant?nio Martins-Tuv?lkin : > On 2016.05.04 07:54, Julian Bradfield wrote: > > See http://xkcd.com/1676/ >> (making sure to look at the mouse-over text) >> > > The new snake character needs to have in its remarks field see-also links > to these: > > U+115F HANGUL CHOSEONG FILLER > U+1160 HANGUL JUNGSEONG FILLER > U+3164 HANGUL FILLER : chaeum > U+A8F9 DEVANAGARI GAP FILLER > U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164) > U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER > > -- ____. > Ant?nio MARTINS-Tuv?lkin | ()| > |####| > PT-1500-124 Lisboa N?o me invejo de quem tem | > PT-2695-010 Bobadela LRS carros, parelhas e montes | > +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | > facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | > --------------------------------------------------------------------- > De sable uma fonte e bordadura escaqueada de jalde e goles por timbre > bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" > --------------------------------------------------------------------- > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat May 7 08:05:24 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 7 May 2016 15:05:24 +0200 Subject: non-breaking snakes In-Reply-To: References: <572D7521.9090203@gmail.com> Message-ID: This is the same thing as: ____________ ..................... ::::::::::::::::::::: ############ ***************** =========== ///////////////////// --------------------- TTTTTTTTTTT You can use any characters (punctuation, symbols, even letters) or graphics aligned in a row to create such fillers But isolately these characters have their own meaning, independantly of their "snake" usage. The vine symbol is not special. It also maps to "leaders" dots used in TOCs or input forms. That's why I suggest that this usage being only a matter of style for graphically representing (with known fonts and layouts) a "snake", which may still be represented by a format control where they are authorized for insertion by line justification, instead of just whitespaces. Then a stylesheet, specific to a page layout, will do the rest, specifying the graphics or characters to use for these insertions, without having the document to specify a specific number of signs. In a plain-text format with unspecified layout, it should not even be visible. 2016-05-07 7:35 GMT+02:00 Leo Broukhis : > Also, or rather foremost, to U+2766 ? FLORAL HEART > > ????? - what does the (almost) connecting vine remind me of? Hmmm... > > Leo > > > 2016-05-06 21:54 GMT-07:00 Ant?nio Martins-Tuv?lkin : > >> On 2016.05.04 07:54, Julian Bradfield wrote: >> >> See http://xkcd.com/1676/ >>> (making sure to look at the mouse-over text) >>> >> >> The new snake character needs to have in its remarks field see-also links >> to these: >> >> U+115F HANGUL CHOSEONG FILLER >> U+1160 HANGUL JUNGSEONG FILLER >> U+3164 HANGUL FILLER : chaeum >> U+A8F9 DEVANAGARI GAP FILLER >> U+FFA0 HALFWIDTH HANGUL FILLER (decomp.: U+3164) >> U+10AF6 MANICHAEAN PUNCTUATION LINE FILLER >> >> -- ____. >> Ant?nio MARTINS-Tuv?lkin | ()| >> |####| >> PT-1500-124 Lisboa N?o me invejo de quem tem | >> PT-2695-010 Bobadela LRS carros, parelhas e montes | >> +351 934 821 700, +351 212 463 477 s? me invejo de quem bebe | >> facebook.com/profile.php?id=744658416 a ?gua em todas as fontes | >> --------------------------------------------------------------------- >> De sable uma fonte e bordadura escaqueada de jalde e goles por timbre >> bandeira por mote o 1? verso acima e por grito de guerra "Mi rajtas!" >> --------------------------------------------------------------------- >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hospes02 at scholarsfonts.net Sat May 7 12:00:31 2016 From: hospes02 at scholarsfonts.net (David Perry) Date: Sat, 07 May 2016 13:00:31 -0400 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> Message-ID: <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net> I agree that it's a real-world problem -- PDFs really should be searchable -- but I do not see that it's a Unicode issue. Unicode defines the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I; that's its job. Unicode is not responsible for font construction or creating PDF software. Furthermore, even if Unicode did want to do something about it, I can't imagine what that could be -- aside perhaps from using its bully pulpit to urge PDF creators and font creators to do their jobs better. The fact that some PDF apps do not search and copy/paste text correctly when unencoded characters are given PUA values has been known for many years. In the case of Calibri, I looked at the font (version installed on my Win7 system) and found that the 'ti' ligature is named t_i, which follows good naming practices, and it does not have a PUA assignment. Given this, any well-constructed PDF app should be able to decode the ligature correctly. David On 5/6/2016 11:49 AM, Steve Swales wrote: > This discussion seems to have fizzled out, but I?m concerned that > there?s a real world problem here which is at least partially the > concern of the consortium, so let me stir the pot and see if there?s > still any meat left. > > On the current release of MacOS (including the developer beta, for > your reference, Peter), if you use Calibri font, for example, in any > app (e.g. notes), to write words with ?ti? (like > internationalization), then press ?Print" and ?Open PDF in Preview?, > you get a PDF document with the joined ?ti?. Subsequently cutting and > pasting produces mojibake, and searching the document for words > with?ti? doesn?t work, as previously noted. > > I suppose we can look on this as purely a font handling/MacOS bug, but > I?m wondering if we should be providing accommodations or conveniences > in Unicode for it to work as desired. > > -steve > From lang.support at gmail.com Sun May 8 03:13:48 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 8 May 2016 18:13:48 +1000 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net> References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net> Message-ID: The t_i instance will depend on the quality of the font. If its a standard ligature there should be a glyph to codepoints assignment in the cmap table or the ToUnicode mapping in the PDF file. As David indicates, it isnt a Unicode issue. It is an issue with the font used and/or the tools used. PDFs have always been problematic. That isn't going to change anytime soon. Partly for archiveable or accessible PDFs, the person generating the PDFs should select the best tools for the job and test the PDF. Then fix any problems. Andrew On Sunday, 8 May 2016, David Perry wrote: > I agree that it's a real-world problem -- PDFs really should be searchable -- but I do not see that it's a Unicode issue. Unicode defines the basic building blocks of LATIN SMALL LETTER T and LATIN SMALL LETTER I; that's its job. Unicode is not responsible for font construction or creating PDF software. Furthermore, even if Unicode did want to do something about it, I can't imagine what that could be -- aside perhaps from using its bully pulpit to urge PDF creators and font creators to do their jobs better. > > The fact that some PDF apps do not search and copy/paste text correctly when unencoded characters are given PUA values has been known for many years. In the case of Calibri, I looked at the font (version installed on my Win7 system) and found that the 'ti' ligature is named t_i, which follows good naming practices, and it does not have a PUA assignment. Given this, any well-constructed PDF app should be able to decode the ligature correctly. > > David > > On 5/6/2016 11:49 AM, Steve Swales wrote: >> >> This discussion seems to have fizzled out, but I?m concerned that >> there?s a real world problem here which is at least partially the >> concern of the consortium, so let me stir the pot and see if there?s >> still any meat left. >> >> On the current release of MacOS (including the developer beta, for >> your reference, Peter), if you use Calibri font, for example, in any >> app (e.g. notes), to write words with ?ti? (like >> internationalization), then press ?Print" and ?Open PDF in Preview?, >> you get a PDF document with the joined ?ti?. Subsequently cutting and >> pasting produces mojibake, and searching the document for words >> with?ti? doesn?t work, as previously noted. >> >> I suppose we can look on this as purely a font handling/MacOS bug, but >> I?m wondering if we should be providing accommodations or conveniences >> in Unicode for it to work as desired. >> >> -steve >> > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Sun May 8 07:42:13 2016 From: dzo at bisharat.net (Don Osborn) Date: Sun, 8 May 2016 08:42:13 -0400 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net> Message-ID: Could it be said that a PDF conversion app generating unusual coding of characters, and doing so without advising users, is an instance of "Unicode malpractice"? (per David's mention of using the "bully pulpit") Some earlier posts in this thread made the observation that PDF is for presentation not archiving. However, since the format makes it possible to search text instead of having just an image of the pages, it seems that distinction is at least somewhat blurred. PDFs are archived and searched, and people expect to use those functions. So yes this font/coding issue in PDFs is a real world problem, but of the sort that Unicode was created to relegate to the past. An analogy that comes to mind is continued use of old hacked 8-bit fonts, which were created before Unicode was widely adopted, for printing and limited sharing ("you need to install this font to view correctly"). Documents produced with them, however, are shared as PDFs (such as some Chinese-Hausa learning materials up to at least 2010, which of course look and print fine, but which run into the same search and re-use issues), and even escape into the wild as text (with unhappy results like a Bambara translation of a handwashing poster during the ebola crisis). Any digital text these days can't be treated as just producing something visually correct. By the way, the "?" in the original title changed to "O" somewhere back in the thread. A luta continua. Don On 5/8/2016 4:13 AM, Andrew Cunningham wrote: > The t_i instance will depend on the quality of the font. If its a > standard ligature there should be a glyph to codepoints assignment in > the cmap table or the ToUnicode mapping in the PDF file. > > As David indicates, it isnt a Unicode issue. > > It is an issue with the font used and/or the tools used. > > PDFs have always been problematic. That isn't going to change anytime > soon. Partly for archiveable or accessible PDFs, the person generating > the PDFs should select the best tools for the job and test the PDF. > Then fix any problems. > > Andrew > > On Sunday, 8 May 2016, David Perry > wrote: > > I agree that it's a real-world problem -- PDFs really should be > searchable -- but I do not see that it's a Unicode issue. Unicode > defines the basic building blocks of LATIN SMALL LETTER T and LATIN > SMALL LETTER I; that's its job. Unicode is not responsible for font > construction or creating PDF software. Furthermore, even if Unicode > did want to do something about it, I can't imagine what that could be > -- aside perhaps from using its bully pulpit to urge PDF creators and > font creators to do their jobs better. > > > > The fact that some PDF apps do not search and copy/paste text > correctly when unencoded characters are given PUA values has been > known for many years. In the case of Calibri, I looked at the font > (version installed on my Win7 system) and found that the 'ti' ligature > is named t_i, which follows good naming practices, and it does not > have a PUA assignment. Given this, any well-constructed PDF app should > be able to decode the ligature correctly. > > > > David > > > > On 5/6/2016 11:49 AM, Steve Swales wrote: > >> > >> This discussion seems to have fizzled out, but I?m concerned that > >> there?s a real world problem here which is at least partially the > >> concern of the consortium, so let me stir the pot and see if there?s > >> still any meat left. > >> > >> On the current release of MacOS (including the developer beta, for > >> your reference, Peter), if you use Calibri font, for example, in any > >> app (e.g. notes), to write words with ?ti? (like > >> internationalization), then press ?Print" and ?Open PDF in Preview?, > >> you get a PDF document with the joined ?ti?. Subsequently cutting and > >> pasting produces mojibake, and searching the document for words > >> with?ti? doesn?t work, as previously noted. > >> > >> I suppose we can look on this as purely a font handling/MacOS bug, but > >> I?m wondering if we should be providing accommodations or conveniences > >> in Unicode for it to work as desired. > >> > >> -steve > >> > > > > -- > Andrew Cunningham > lang.support at gmail.com > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sun May 8 08:35:15 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 8 May 2016 15:35:15 +0200 Subject: Joined "ti" coded as "O" in PDF In-Reply-To: References: <20160317111144.665a7a7059d7ee80bb4d670165c8327d.7a26808532.wbe@email03.secureserver.net> <261BDFD9-7927-4EE2-A9FF-19F6542D0DB1@swales.us> <56EB1723.7030301@bisharat.net> <56ED91DE.5080700@bisharat.net> <20160320081122.24495yrwlpeei7mi@mail.mimuw.edu.pl> <56EEF8DD.2090808@ix.netcom.com> <34E7E8C6-B1AB-4C99-94EB-005781DE02AE@bluesky.org> <6f2fffd3-3a52-4dc3-a554-634972ad540e@scholarsfonts.net> Message-ID: 2016-05-08 14:42 GMT+02:00 Don Osborn : > Some earlier posts in this thread made the observation that PDF is for > presentation not archiving. > I tend to disagree. PDF are hugely used for archiving and for that purpose it does not matter how it was generated, it is only meant to be a facsimile, possibly with equal value as the original (printed) paper. The initial numeric format is just a working draft with no legal value in most cases. That's why PDF files can contain a digistal signature, to give them the same value as the original paper. The initial numeric draft has no value, even if it's easier to search in it. Many (many!) laws and treaties in the world are kept only as PDF, not all of them being searchable in plain text, unless there's been some OCR (and often correction to this process). The original papers (which have legal value) are kept in museums or official national libraries and no longer freely accessible to the public and that's why there are facsimile PDF created to make them accessible (and possibly signed numerically by the official library or some national authority). Lots of organisations are only archiving their legal papers as PDF and recycle their original paper. This is authorized by national laws, provided they insert a verificable signature in them, certifying their date. No alteration of the content is then authorized as these PDF become the new original (except adding new digital signatures, or possibly dropping some of them except the initial dated one whose security may have become loose over time, and for which it is needed to add new stronger signatures by the legitimate right holder; the history of signatures will be kept). Being able to search in a PDF is a distinct goal, not meant directly for archiving, but for using PDFs isolately as *working* documents. However for archives, the ability of searching in them may be provided by separate data (without legal bindings) stored in the archive index, along with the unaltered (and legal) PDF. PDFs are not being meant to be used for presentation (there are much better way to present the content and *adapt* it to the audience or presentation medium. But presentation is also a different goal than being able to search in it. A PDF is just a collection of rendered pages (possibly with a limited resolution, where rendered characters may be a bit fuzzy or some non meaningful color distinctions may be voluntarily lost) to be used "as is" and meant to be read by human eyes (even being able to produce an accurate OCR is not a goal of this format). When producing the PDF, there's choice by the human editor to reduce the resolution, reduce the colorspace and so on if this helps reducing the numeric storage size and helps archiving, or helps protecting the author's rights E.g. there are different PDF versions for free online editions of newspapers, where text may be to fuzzy to be read. But there are versions for subscribers with much better quality (but possibly less ads), and kept in archives if needed, but still not really meant to be searchable in plain text; in fact the producer may want to limit the searchability so that readers will have to look at the pages directly, and see the embedded advertizing boxes even if they are not related directly to what is being searched for; the producer may provide only a limited plain-text index for some headings, but not for the content itself: readers have to scan it visually so that they cannot completely ignore the surrounding context. The producer of the PDF then has the choice of the different options. It has different goals for the document. For legal use, there are some goals to follow, but this does not (most often) include the need to perform plain text search in them. May this means that some OCR or human work will be needed later in order to index it, but this operation may be limited by author's rights and the user will assume its own respondability if he makes a false interpretation when using only automated tools. PDFs are maent to be read and interpreted by humans, not machines. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dzo at bisharat.net Sun May 8 09:19:54 2016 From: dzo at bisharat.net (Don Osborn) Date: Sun, 8 May 2016 10:19:54 -0400 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <1246651385.10.1462502159794.JavaMail.www@wwinf1k18> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> <1246651385.10.1462502159794.JavaMail.www@wwinf1k18> Message-ID: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net> Thanks all for the replies on this matter. Concerning the keyboard side of the issue, there has been a lot of discussion about unified standards over the years, but what we end up with is maybe another case of "The nice thing about standards is that there are so many to choose from." Within that, there seem to be two main questions addressed by keyboard creation: production and popular use. It Many keyboards are made with production in one or maybe a couple of languages in mind - this is in line with the thinking behind creation of old 8-bit modified fonts. On the other hand, is the need for keyboard layouts that can be accessed broadly without the users having to learn new key assignments at each new device. In terms of philosophy, I'd see common keyboards as more in line with the intent of Unicode. In the ideal world, there would be no distinction between keyboards created with limited/focused production in mind (limited in the sense of one language in a multilingual society and/or focused on a particular production need), and keyboards intended to facilitate broad usage. Like a QWERTY+ or AZERTY+ perhaps? That has not been easy - kind of another theory of everything problem. The flexibility of touchpad keyboards in theory gets beyond the limitations of the physical keyboards - has anyone tried adding a row to say a QWERY layout, which includes additional characters, rather than sweating the issues about shoehorning them in other levels or key sequences? Is that even possible? Still would be helpful to have standards, but where something is visible, it is easy to use. On the font side, my impression (a bit dated) is that there is/was a policy dimension or gap. Back when Unicode was becoming more widely adopted, there were new computers marketed in Africa without the then limited repertoire of fonts with extended Latin. Even when these were included, there are some instances where it is possible that 8-bit fonts with extended characters were created on machines that already had one or two Unicode fonts - evidently unbeknownst to the user. So there was, and always has been, a public education side to this that none of us in position or interest to do so have been able to address. In the background one should bring in the issue of whether computer science students and IT experts in Africa had any introduction to Unicode. That could be a big missing piece in the equation. The case of the Chinese publications using modified 8-bit fonts for both Hausa boko and Chinese pinyin is a specialized one. Given the small number of people working on both those languages it may be just the chance outcome of their not being aware that Unicode already had their needs covered. A specialized keyboard for production of text including hooked consonants and tone-marked vowels, plus awareness of Unicode would probably set them on a new course. Marcel, I would be very interested to know more about what you are working on wrt Bambara - perhaps offline. Don On 5/5/2016 10:35 PM, Marcel Schneider wrote: > On Sat, 30 Apr 2016 13:27:02 -0400, Don Osborn wrote: > >> If the latter be the case, that would seem to have implications >> regarding dissemination of information about Unicode. "If you >> standardize it, they will adopt" certainly holds for industry and >> well-informed user communities (such as in open source software), but >> not necessarily for more localized initiatives. This is not to seek to >> assign blame in any way, but rather to point out what seems to be a >> persistent issue with long term costs in terms of usability of text in >> writing systems as diverse as Bambara, Hausa boko, and Chinese pinyin. > The situation Don describes is challenging the work that is already done and on-going in Mali, with several keyboard layouts at hand. If widening the range is really suitable, one might wish to test a couple of other solutions than already mentioned, that roughly fall into two subsets: > > 1) Letters on the digits row. Thanks to a kindly shared resource, I?m able to tell that over one dozen Windows layouts?mainly French, as used in Mali, but also Lithuanian, Czech, Slovak, and Vietnamese, have the digits in the Shift or AltGr shift states. The latter is the only useful way of mapping letters on digit keys and becomes handy if the Kana toggle is added, either alone or in synergy with the Kana modifier instead of AltGr. With all bracketing characters in group?2 level?1 on the home row and so on, there is enough place to have all characters for Bambara and French directly accessed. > > 2) Letters through dead keys. This is the ISO/IEC?9995 way of making more characters available in additional groups with dead key group selectors (referred to as remnant modifiers but actually implemented as dead keys). This is also one way SIL/Tavultesoft?s layouts work for African and notably for Malian languages. IME-based keyboarding software may additionally offer a transparent input experience. > > > On Mon, 2 May 2016 12:03:58 -0400, Ed Trager wrote: > >> Also with web applications the "software installation" issue is eliminated. >> Remember that while it is easy for technologically savvy folks like members >> of this mailing list to install keyboard drivers on any platform we like, >> this process is somewhat beyond the reach of many people I know, even when >> they are otherwise fairly comfortable using computers. > I can?t easily believe that people who are comfortable with computers may have trouble using the widely automatted keyboard layout installation feature, because I?ve as well experienced myself as got the opportunity to observe on other persons I know, that in fact there is some kind of reluctance based on the belief?call it a myth or an urban legend?that Windows plus preinstalled software plus MS?Office come along with everything any user may need until the next update. Though informing about Microsoft?s help to customize the keyboard is more complicated in that the display is part of the hardware, and the functioning behind has more of a blackbox. > > > Being actually working on such a project for the fr-FR locale, I?ve already got some ideas for Bambara. I hope it can soon be on-line. > > Kind regards, > > Marcel > From verdy_p at wanadoo.fr Sun May 8 10:24:17 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 8 May 2016 17:24:17 +0200 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> <1246651385.10.1462502159794.JavaMail.www@wwinf1k18> <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net> Message-ID: 2016-05-08 16:19 GMT+02:00 Don Osborn : > The flexibility of touchpad keyboards in theory gets beyond the > limitations of the physical keyboards - has anyone tried adding a row to > say a QWERY layout, which includes additional characters, rather than > sweating the issues about shoehorning them in other levels or key > sequences? Is that even possible? Still would be helpful to have standards, > but where something is visible, it is easy to use. > It is technically possible, but the problem is to add distinctive hardware "scan codes" to keys in this row. See this table: https://msdn.microsoft.com/en-us/library/aa299374(v=vs.60).aspx You'll note that almost all scancodes in the 7-bit range are used. So you'd need "extended scancodes", i.e. prefixing the special virtual scancode 00 on Windows (or the hardware scancode E0) before the extended scan code for the actual key. (The special scancode "00" turns the 7-bit table into an equivalent 8-bit table, but note that keyboards use 7-bit scancodes only, as the 8th bit is used for the press/release flag) For that, you could then reuse the scancodes of the first row (those for digits). Note that the scancodes for the row of "standard" function keys (F1..F12) is already extended this way (for additional function keys). Bit note also this table: https://www.win.tue.nl/~aeb/linux/kbd/scancodes-10.html You'll see that the hardware scancodes E0-0A and E0-0B are already assigned on PC for special functions, and so cannot be used to "extend" the keys for digits 9 and 0 on the first row (whose scancodes are 0A and 0B respectively). This is not so critical: you can perfectly have additional keys assigend for a row using non-contiguous hardware scancodes (after all the alphabetic part of the keyboard is already using multiple ranges of hardware and virtual scancodes). But you'd need a new keyboard driver (and an extension to MSKLC on Windows) to allow mapping this supplementary row, and a industry agreement to assign new extended keys in non-conflicting ways (these days, it is the Microsoft hardware labs that centralize the extensions used on PC-compatible hardware, Apple used to have its own registry for its own keyboards, but now Macs are PC and can use the same keyboards not necessarily built by Apple, e.g. by Logitech). The connectors are compatible with the same USB interface. There are some differences in hardware scancodes used on the USB interface (Windows internally translated hardware scancodes for some interfaces into the same virtual scancodes before sending them to upper keyboard drivers and applications: this is where scancode E0 on the old PC-keyboard interface or the newer PS/2 interface or USB interface, or in the old BIOS interface is remapped into the same virtual scancode 00 for Windows drivers and apps). There's also an additional hardware extension code E1 for a few function keys (it is used for a few functions encoded on 3 bytes, for upward compatibility reasons, such as the "Pause" key). Various other vendors have used specific hardware scancodes, but today almost everyone agrees to the same PC standard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun May 8 11:50:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 8 May 2016 10:50:29 -0600 Subject: Non-standard 8-bit fonts still in use Message-ID: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell> Don Osborn wrote: > Concerning the keyboard side of the issue, there has been a lot of > discussion about unified standards over the years, but what we end up > with is maybe another case of "The nice thing about standards is that > there are so many to choose from." There are a zillion keyboard layouts, not because of too many conflicting standards per se, but primarily because people don't want to change away from the layout they're familiar with, and secondarily because different languages have different needs. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From dzo at bisharat.net Sun May 8 13:11:20 2016 From: dzo at bisharat.net (Don Osborn) Date: Sun, 8 May 2016 14:11:20 -0400 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell> References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell> Message-ID: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net> Thanks Doug. You're right as far as that goes, but I'd suggest there's more to it. Languages (by which of course we mean their written forms) have requirements, and for cross-border languages, requirements may be defined differently by the different countries where they are spoken. And users have needs and experience. In the multilingual settings I'm most interested in, the language requirements often overlap, sometimes considerably (thinking here of extended Latin alphabets). This is because in many languages use characters that are part of the African Reference Alphabet. So it is possible to have one keyboard layout for each language, or merge requirements if you will for two or more. When the A12n-collab group was active* one concept discussed at some length was a "pan-Sahelian" layout that could serve many languages across a number of countries. But even then, considering variations by country (orthographies often set by country not by language), there can be several possible sets of language requirements, in a "pan-Sahelian" layout. And that's just one example. Then there is the question of key assignments for any given character. Unfortunately in Africa there are not established layouts to deal with - most formally educated people will be most familiar with QWERTY or AZERTY for the official languages. Everything else is pretty much a matter of choice, although some small communities of users may have developed familiarity with particular layouts (perhaps a reason for persistence of something like Bambara Arial). So another reason there are a zillion keyboards is that people are inventing them - for good reasons and intent, we can admit, but often without awareness of other efforts, or communication with other communities of users. You are right however that none of these are standards (with a possible exception - would have to go back and check) - I was trying to be clever - but there are different layouts. Another thing about user needs is that the polyglot/pluriliterate user may prefer something that reflects that, as opposed to having multiple keyboards for languages whose character repertoires are much the same. From a national or regional (sub-continental) point of view I would think a one-size fits all/many standard or set of keyboard standards would be ideal. But no one seems to be going there yet, after all these years. And one could go on. To get this a little on-topic for the list, the good news is that Unicode means we're talking just about keyboards and not about multiple incompatible fonts as well. Don * I'm floating the idea of a new list on the full spectrum of African languages & technology issues. Anyone interested or who has thoughts on that idea one way or another, please contact me offline. On 5/8/2016 12:50 PM, Doug Ewell wrote: > Don Osborn wrote: > >> Concerning the keyboard side of the issue, there has been a lot of >> discussion about unified standards over the years, but what we end up >> with is maybe another case of "The nice thing about standards is that >> there are so many to choose from." > > There are a zillion keyboard layouts, not because of too many > conflicting standards per se, but primarily because people don't want > to change away from the layout they're familiar with, and secondarily > because different languages have different needs. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Sun May 8 13:31:59 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 8 May 2016 12:31:59 -0600 Subject: Non-standard 8-bit fonts still in use In-Reply-To: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net> References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell> <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net> Message-ID: <00D2CA093AAB485D9C618AC09856ECCD@DougEwell> Don Osborn wrote: > In the multilingual settings I'm most interested in, the language > requirements often overlap, sometimes considerably (thinking here of > extended Latin alphabets). This is because in many languages use > characters that are part of the African Reference Alphabet. So it is > possible to have one keyboard layout for each language, or merge > requirements if you will for two or more. When the A12n-collab group > was active* one concept discussed at some length was a "pan-Sahelian" > layout that could serve many languages across a number of countries. I wonder if there is a good and fairly comprehensive reference to the most common Latin-based alphabets used for African languages, comparable to Michael Everson's "The Alphabets of Europe" [1]. Such would be helpful for determining the level of effort to create a pan-African keyboard layout, or to adapt (if necessary) an existing multilingual layout like John Cowan's Moby Latin [2]. [1] http://www.evertype.com/alphabets/ [2] http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From dzo at bisharat.net Sun May 8 14:15:20 2016 From: dzo at bisharat.net (dzo at bisharat.net) Date: Sun, 8 May 2016 19:15:20 +0000 Subject: Non-standard 8-bit fonts still in use Message-ID: <132317239-1462734909-cardhu_decombobulator_blackberry.rim.net-615064916-@b2.c1.bise6.blackberry> Rhonda Hartell did a compilation based on available info, published 23 yrs ago by SIL. Christian Chanard put that info into a database, Systemes alphabetiques, accessible via links from http://www.bisharat.net/wikidoc/pmwiki.php/PanAfrLoc/WritingSystems#toc11 All I have right now (taking break from shoveling leaf compost). Don ------Original Message------ From: Doug Ewell Sender: Unicode To: unicode at unicode.org To: Don Osborn Subject: Re: Non-standard 8-bit fonts still in use Sent: May 8, 2016 2:31 PM Don Osborn wrote: > In the multilingual settings I'm most interested in, the language > requirements often overlap, sometimes considerably (thinking here of > extended Latin alphabets). This is because in many languages use > characters that are part of the African Reference Alphabet. So it is > possible to have one keyboard layout for each language, or merge > requirements if you will for two or more. When the A12n-collab group > was active* one concept discussed at some length was a "pan-Sahelian" > layout that could serve many languages across a number of countries. I wonder if there is a good and fairly comprehensive reference to the most common Latin-based alphabets used for African languages, comparable to Michael Everson's "The Alphabets of Europe" [1]. Such would be helpful for determining the level of effort to create a pan-African keyboard layout, or to adapt (if necessary) an existing multilingual layout like John Cowan's Moby Latin [2]. [1] http://www.evertype.com/alphabets/ [2] http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? Sent via BlackBerry by AT&T From charupdate at orange.fr Mon May 9 10:16:42 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 9 May 2016 17:16:42 +0200 (CEST) Subject: Non-standard 8-bit fonts still in use In-Reply-To: <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net> References: <56204330.6010106@bisharat.net> <0cf8d7fb-d651-592a-c0fc-15e1cf77d87d@bisharat.net> <1246651385.10.1462502159794.JavaMail.www@wwinf1k18> <4135a342-82b1-0dd2-8c3e-006a757554bc@bisharat.net> Message-ID: <1339982434.13627.1462807003259.JavaMail.www@wwinf1n25> On Sun, 8 May 2016 10:19:54 -0400, Don Osborn wrote: > Marcel, I would be very interested to know more about what you are > working on wrt Bambara - perhaps offline. Thank you for your interest. I?m glad to come in touch with on-going work and I already started mailing but eventually would like to acknowledge on-list; although? On Sun, 8 May 2016 14:11:20 -0400, Don Osborn wrote: >?To get this a little on-topic for the list, the >?good news is that Unicode means we're talking just about keyboards and >?not about multiple incompatible fonts as well. Indeed, however font issues are IMHO even more suitable for the List (though strictly they are out of scope too), as opposed to keyboard layouts, that must not be discussed on the Unicode List. Only giving some hints is suitable, as had been done in this thread up to now. Consequently I switched off-list?immediately. But here I?m doing some metadiscussion, so please disregard. > In the background one should bring in the issue of whether computer > science students and IT experts in Africa had any introduction to > Unicode. That could be a big missing piece in the equation. For future archive readers there may be some need to recall that this phenomenon is a global one. Missing training to Unicode is observed in Europe as well, and on other continents. Please see the following recent thread: Unicode in the Curriculum? from Andre Schappo on 2015-12-30 (Unicode Mail List Archive). Retrieved March 11, 2016, from http://www.unicode.org/mail-arch/unicode-ml/y2015-m12/0073.html > On the font side, my impression (a bit dated) is that there is/was a > policy dimension or gap. Back when Unicode was becoming more widely > adopted, there were new computers marketed in Africa without the then > limited repertoire of fonts with extended Latin. Even when these were > included, there are some instances where it is possible that 8-bit fonts > with extended characters were created on machines that already had one > or two Unicode fonts - evidently unbeknownst to the user. So there was, > and always has been, a public education side to this that none of us in > position or interest to do so have been able to address. Please see also the capital left-hook N glyph issue Don documented at the very beginning of this thread: Non-standard 8-bit fonts still in use from Don Osborn on 2015-10-15 (Unicode Mail List Archive). (2015, October 21). Retrieved October 21, 2015, from http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0135.html For one more comment on that issue: http://unicode.org/mail-arch/unicode-ml/y2015-m10/0214.html On Sun, 8 May 2016 12:31:59 -0600, Doug Ewell wrote: >?Don Osborn wrote: >? >?> In the multilingual settings I'm most interested in, the language >?> requirements often overlap, sometimes considerably (thinking here of >?> extended Latin alphabets). This is because in many languages use >?> characters that are part of the African Reference Alphabet. So it is >?> possible to have one keyboard layout for each language, or merge >?> requirements if you will for two or more. When the A12n-collab group >?> was active* one concept discussed at some length was a "pan-Sahelian" >?> layout that could serve many languages across a number of countries. >? >?I wonder if there is a good and fairly comprehensive reference to the >?most common Latin-based alphabets used for African languages, comparable >?to Michael Everson's "The Alphabets of Europe" [1]. Such would be >?helpful for determining the level of effort to create a pan-African >?keyboard layout, or to adapt (if necessary) an existing multilingual >?layout like John Cowan's Moby Latin [2]. >? >?[1] http://www.evertype.com/alphabets/ >?[2] >?http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html On Sun, 8 May 2016 19:15:20 +0000, dzo at bisharat.net replied: >?Rhonda Hartell did a compilation based on available info, >?published 23 yrs ago by SIL. Christian Chanard put that info >?into a database, Systemes alphabetiques, accessible via links from >?http://www.bisharat.net/wikidoc/pmwiki.php/PanAfrLoc/WritingSystems#toc11 >? >?All I have right now (taking break from shoveling leaf compost). Thanks for this resource. I?ve taken a look and I like the interface. But there is some update missing, or more accurately, the source was outdated, as shows up when looking at the Bambara section that does not take into account the new orthography, though this had already been valid during over one decade (1982..1993). Sadly this valuable database is unreliable unless the data is revised. I hope that can be done soon. However unfortunately I?m unable to do this job. Best regards, Marcel From otto.stolz at uni-konstanz.de Tue May 10 05:10:35 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Tue, 10 May 2016 12:10:35 +0200 Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use) In-Reply-To: <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net> References: <8C690B446EEC4E728B4A7A98A0C6A759@DougEwell> <72eec00e-958e-fd45-e492-ff67449c8309@bisharat.net> Message-ID: <5731B39B.8000501@uni-konstanz.de> Hello, am 2016-05-08 um 20:11 Uhr schrieb Don Osborn: > Another thing about user needs is that the polyglot/pluriliterate user > may prefer something that reflects that, as opposed to having multiple > keyboards for languages whose character repertoires are much the same. > From a national or regional (sub-continental) point of view I would > think a one-size fits all/many standard or set of keyboard standards > would be ideal. But no one seems to be going there yet, after all these > years. Yes, there is somebody going there. E. g., the German standard DIN 2137:2012-06 defines a ?T2? layout which is meant for all official, Latin-based orthographies worldwide, and additionally for the Latin-based minority languages of Germany and Austria. The layout is based on the traditional QWERTZU layout for German and Austrian keyboards (which is now dubbed ?T1?). Cf. . There is also a ?T3? layout defined which comprises all characters mentioned in ISO/IEC 9995-3:2010. You can even buy a hardware T2 keyboard; however I have not tried it, because I have defined my own keyboard layout suite (pan-European Latin, pan-European Cyrillic, monotonic Greek, and Yiddish) for personal use, long ago. Best wishes, Otto Stolz From doug at ewellic.org Tue May 10 09:55:42 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 10 May 2016 07:55:42 -0700 Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use) Message-ID: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com> Otto Stolz wrote: > Yes, there is somebody going there. E. g., the German standard > DIN 2137:2012-06 defines a ?T2? layout which is meant > for all official, Latin-based orthographies worldwide, and > additionally for the Latin-based minority languages of Germany > and Austria. The layout is based on the traditional QWERTZU layout > for German and Austrian keyboards (which is now dubbed ?T1?). > Cf. . Yes, but there's the rub. QWERTY users are about as willing to switch to QWERTZ in the name of global standardization as Germans would be to switch to QWERTY. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Tue May 10 10:30:25 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 10 May 2016 17:30:25 +0200 Subject: Polyglot keyboards (was: Non-standard 8-bit fonts still in use) In-Reply-To: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com> References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com> Message-ID: Very true, and this will likely not change. Even users of "ergonomic" layouts want to keep this ergonomy for their letters (an letter pairs). All that can be made reasonable is to extend existing layouts with minimal changes: basic letters, decimal digits, and basic punctuation must remain at the same place (and there's also some resistance for the most common few additional letters used in each language that are typically placed on the 1st row, or near the Enter key). What is likely to change is the placement of combinations using AltGr on the first row (but on non-US keyboards, these also include some ASCII characters considered essential on a computer like the backslash, hash sign, tilde, arrobace, or underscore) This leaves little freedom for changes except for keys currently assigned to less essential characters such as the degree sign, the micro sign, the pound sign (in countries not usingf this symbol daily), the "universal" currency sign, the paragraph mark... Those can be used to fit better candidates for extensions. But without an extension of keyboard rows, it will be difficult to have a wide adoption on physical keyboards. Function keys F1..F12 may be easily reduced to fit additional keys for letters and diacritics. Keyboards have instead been extended for many things that most people in fact almost never use or don't need there such as multimedia keys, shortcuts to launch the browser or calculator app. or the contextual menu/options key (added by Windows), or TWO (sic!) keys for the Windows key (Keep only one and map the few additional keys found on Japanese keyboards). But it is challenging to have decent sizes for keys on notebooks keyboards which are already extremely packed (F1..F12 are already reduced vertically). They invented another way: using a new "Fn" mode key for additional multimedia keys (or keys for switching the Wifi, Bluetooth or display adapters, or control the display lightness or sound volume/mute, or to eliminate the PrintScreen function, or the ScrollLock or NumLock mode switch keys). A few of them added a couple of character keys for currency units ($ and ?) instead of the Japanese mode keys. In fact every brand has done what it wanted to extend the keyboards... except for extending really the usable alphabets. For virtual on-screen layouts, there's much more freedom as the display panel is adaptative and allows more innovative input methods, of things never dound on physical keyboards such as entering emojis. 2016-05-10 16:55 GMT+02:00 Doug Ewell : > Otto Stolz wrote: > > > Yes, there is somebody going there. E. g., the German standard > > DIN 2137:2012-06 defines a ?T2? layout which is meant > > for all official, Latin-based orthographies worldwide, and > > additionally for the Latin-based minority languages of Germany > > and Austria. The layout is based on the traditional QWERTZU layout > > for German and Austrian keyboards (which is now dubbed ?T1?). > > Cf. . > > Yes, but there's the rub. QWERTY users are about as willing to switch to > QWERTZ in the name of global standardization as Germans would be to > switch to QWERTY. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Tue May 10 11:42:27 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Tue, 10 May 2016 18:42:27 +0200 Subject: Polyglot keyboards In-Reply-To: References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com> Message-ID: <57320F73.2010001@uni-konstanz.de> Hello, I had written: > . On 2016-05-10 16:55 GMT+02:00 Doug Ewell has written: > QWERTY users are about as willing to switch to QWERTZ I have never meant that QWERTY ? or AZERTY ? users should switch to QWERTZ. I just wanted point to one instance of an officially standardized polyglot keyboard layout. E. g, there is already the Canadian multilingual keyboard, cf. , based on the traditional QWERTY layout. I do hope that other standard bodies will follow suit and define their own QWERTY, or AZERTY, or whatever versions of polyglot keyboard layouts, in accordance with ISO/IEC 9995. Am 2016-05-10 um 17:30 Uhr schrieb Philippe Verdy: > All that can be made reasonable is to extend existing layouts with minimal > changes: ? > This leaves little freedom for changes except for keys currently assigned > to less essential characters such as the degree sign, the micro sign, the > pound sign (in countries not usingf this symbol daily), the "universal" > currency sign, the paragraph mark... Those can be used to fit better > candidates for extensions. Another option (which I exploited for my personal keyboard layouts) is the re-definition of a special-character key to work as a dead key. E. g., on my personal keyboard, the ?"? key gives access to all sorts of quote characters (for French, German, English, ?, even ASCII), depending on the following key; the ?~? key works as tilde accent on the letter typed subsequently; and so on. This scheme allows the conventional QWERTZ hardware to be used for multilingual typing ? with minimal re-learning and training. And still the ??? key produces the ??? character :-) Best wishes, Otto Stolz From charupdate at orange.fr Tue May 10 18:09:42 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 11 May 2016 01:09:42 +0200 (CEST) Subject: Polyglot keyboards In-Reply-To: <57320F73.2010001@uni-konstanz.de> References: <20160510075542.665a7a7059d7ee80bb4d670165c8327d.4bfaba0439.wbe@email03.godaddy.com> <57320F73.2010001@uni-konstanz.de> Message-ID: <220047750.20065.1462921782885.JavaMail.www@wwinf1h10> On Tue, 10 May 2016 12:10:35 +0200, Otto Stolz wrote: > [?] the German standard > DIN 2137:2012-06 defines a ?T2? layout which is meant > for all official, Latin-based orthographies worldwide, and > additionally for the Latin-based minority languages of Germany > and Austria. The layout is based on the traditional QWERTZU layout > for German and Austrian keyboards (which is now dubbed ?T1?). > Cf. . > > There is also a ?T3? layout defined which comprises all characters > mentioned in ISO/IEC 9995-3:2010. Wasn?t it the other way round? As far as I?remember the sources, to stick with the tradition of referring to an ISO subset of Unicode (MES-1 for ISO/IEC 9995-3:2002), the German NB urged ISO to adopt a new subset tailored for the then on-coming ISO/IEC 9995-3:2010, that in turn was intended to hold the invoked DIN 2137:2012, which was overflowing the ISO keyboard framework on other sides too, leading to the addition of part 11 past year. As of the new Unicode subset?s extent, there were other problems raised through its being tailored for a given keyboard layout that did not make full use of the existing keyboard resources of the mainstream operating system. As a result, several Latin letters are missing, ending up in a twilighty mix of support and unsupport across Latin script using continents. While claiming coverage of several African and American languages, again several African and American languages are unsupported, notably through the lack of ?, ?, ?. Remember that Bamanankan is an official language of Mali. Having promised not to stay discussing keyboard layouts on the Unicode List, I can?t help recalling in this *new* thread the harm done to Latin script using communities by excluding their alphabets from an internationally designed keyboard standard in the era of globalisation. Everybody on this List remembers the oddities that have followed the launch of the Multilingual Latin Subset, redubbed so on the spot from the originally proposed ?Multilingual International Subset? for its not covering Greek nor Cyrillic, and subsequently annotated on demand of the ANSI, initiated by a paper from Denis?Jacquerye, as not covering all Latin script using languages, in order to avoid misleading future font designers. Marcel From rwhlk142 at gmail.com Tue May 10 18:55:23 2016 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Tue, 10 May 2016 19:55:23 -0400 Subject: The Hebrew Extended (Proposed) Block Message-ID: Hello again, y?all! ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned OTHER characters into the U+00860-U+008FF areas in the BMP of Unicode?Malayalam extended additional characters for Garshuni, and more additional Arabic characters. We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended characters... either somewhere ELSE within the BMP, *or* somewhere within either SMP areas 1 or 2. It?ll be the same arrangement originally planned for the U+00860 area?but relocated and expanded upon! ?Additional characters for correct typesetting of Hebrew ?Hebrew Palestinian vowel and pronunciation points ?The small superscript signs *?in* and *shin* for the letter *shin* ?Hebrew Palestinian cantillation ?Hebrew Babylonian vowel and pronunciation points ?Hebrew Babylonian cantillation ?Hebrew Samaritan vowel and pronunciation points ?Additional Hebrew characters for other Jewish languages A new TXT listing of this subblock (with the new CORRECT location) will be forthcoming. STAY TUNED! -------------- next part -------------- An HTML attachment was scrubbed... URL: From rwhlk142 at gmail.com Tue May 10 20:08:58 2016 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Tue, 10 May 2016 21:08:58 -0400 Subject: Moving The Hebrew Extended Block Into The SMP Message-ID: Hello again! Shalom! After reading through the V. 9? code charts PDF document, I DID find a new area to relocate our new Hebrew Extended block (a very important area to add into Unicode): THE AREA FROM U+30000 TO U+3014F (336 codepoints) ?U+30000?U+30014 (21 codepoints): Additional characters for typesetting Biblical/Classical Hebrew ?U+30015?U+3001F (11 codepoints): Palestinian vowel and pronunciation points for Hebrew and Galilean Aramaic ?U+30020?U+30021 (2 codepoints): Small superscript top-left signs for the letter *shin*?superscript ?in and superscript shin ?U+30022?U+30041 (32 codepoints): Palestinian cantillation signs for Hebrew and Galilean Aramaic ?U+30042 is reserved ?U+30043?U+3005C (26 codepoints): Babylonian vowel and pronunciation points for Hebrew ?U+3005D?U+3005F are reserved ?U+30060?U+30071 (18 codepoints): Babylonian cantillation signs for Hebrew ?U+30072?U+3007D are reserved ?U+3007E?U+3008F (18 codepoints): Samaritan vowel points, pronunciation points, and cantillation signs for Hebrew (copies of those also being used for Samaritan script in BMP) ?U+30090?U+3010F (128 codepoints): Additional characters in Hebrew script for other Jewish languages (these are pointed like the corresponding Arabic characters in the BMP) ?U+30110?U+3012F (32 codepoints): Basic Hebrew superscript characters (regular letters+5 final forms+top-left pointed *?in*+top-right pointed *shin*+*maqqef*) ?U+30130?U+3014F (32 codepoints): Basic Hebrew subscript characters (regular letters+5 final forms+top-left pointed *?in*+top-right pointed *shin*+*maqqef*) Please STAY TUNED for updates. Thank You! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue May 10 21:23:58 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 10 May 2016 22:23:58 -0400 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: <301fc24e-eb15-74f9-415b-bb1d24c9bf3a@kli.org> Sounds like a plan; most additional Hebrew characters can probably safely live in the SMP, as they are not all that common (except, of course, TETRAGRAMMATON, which I'll be writing another proposal about). What Samaritan vowel and accent points did we miss when we did Samaritan the first time around? We tried to be pretty comprehensive with it, including contact with the user community and inspecting books and MSS. Somewhere I have a list of signs I started making by reading an entry in an encyclopedia (Encyclopedia Judaica?) s.v. "Masorah". Ah, found it. Various lines, strokes, dots, colons, pairs of dots in assorted configurations around letters (Palestinian and Babylonian vowel points, etc)... A bunch of combining letters (COMBINING SAMEKH ABOVE, etc), some not exactly normal (SLANTED NUN ABOVE)... I think I had about sixty. But it isn't particularly well-organized or researched. There is also the "Expanded" Tiberian cantillation system I have seen mentioned (in Yeivin's book on Masorah for example, in the part on accents, para. #220). It seems to distinguish things like different flavors of MUNAH; I have never really found much about it, so I don't know if it needs special graphemes. The only examples in the Yeivin book that I see appear to use existing symbols in combinations (e.g. MUNAH plus a MERKHA KEFULA for a "mekarbel"). What other Hebrew characters have you got in mind? Could be interesting. Are you considering symbols for PETUHA and SETUMA pericopes in your "typesetting" section? Are those fit to be encoded? I think they've been mentioned before, but it's hard to show that they are anything other than specialized uses of PEH and SAMEKH (unless we're talking about using them as formatters, and then they're pretty definitely out of scope). ~mark On 05/10/2016 07:55 PM, Robert Wheelock wrote: > Hello again, y?all! > > ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned > OTHER characters into the U+00860-U+008FF areas in the BMP of > Unicode?Malayalam extended additional characters for Garshuni, and > more additional Arabic characters. > > We?ll need to find a DIFFERENT subblock to plant down our Hebrew > extended characters... either somewhere ELSE within the BMP, > _or_ somewhere within either SMP areas 1 or 2. > It?ll be the same arrangement originally planned for the U+00860 > area?but relocated and expanded upon! > > ?Additional characters for correct typesetting of Hebrew > ?Hebrew Palestinian vowel and pronunciation points > ?The small superscript signs /?in/ and /shin/ for the letter /shin/ > ?Hebrew Palestinian cantillation > ?Hebrew Babylonian vowel and pronunciation points > ?Hebrew Babylonian cantillation > ?Hebrew Samaritan vowel and pronunciation points > ?Additional Hebrew characters for other Jewish languages > A new TXT listing of this subblock (with the new CORRECT location) > will be forthcoming. STAY TUNED! > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue May 10 21:32:33 2016 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 10 May 2016 22:32:33 -0400 Subject: Moving The Hebrew Extended Block Into The SMP In-Reply-To: References: Message-ID: Oh yeah. I also wonder a bit about things like the "half-letters" that were used sometimes in early Hebrew printing to fill out space left at the end of a line. They would often write part of the next word, the first few letters, but maybe the last letter was missing part of it, or just random semi-characters (things like a SHIN with only two heads shows up a lot, or even complete SHINs). http://xkcd.com/1676/ got me thinking of it. They're probably not encodable... or are they? I'll have to find some example scans. If it's as common as I say, that should be easy... unless I'm wrong about that, which I guess would make the whole question easier too. ~mark From mark at kli.org Tue May 10 21:46:04 2016 From: mark at kli.org (Mark Shoulson) Date: Tue, 10 May 2016 22:46:04 -0400 Subject: Moving The Hebrew Extended Block Into The SMP In-Reply-To: References: Message-ID: <8f8e40e7-c930-1988-9ea4-1d8aceb3900c@kli.org> On 05/10/2016 09:08 PM, Robert Wheelock wrote: > > ?U+30000?U+30014 (21 codepoints): Additional characters for > typesetting Biblical/Classical Hebrew Do you have this list available yet? I'm curious about these points, and others. > ?U+30015?U+3001F (11 codepoints): Palestinian vowel and pronunciation > points for Hebrew and Galilean Aramaic > ?U+30020?U+30021 (2 codepoints): Small superscript top-left signs for > the letter /shin/?superscript ?in and superscript shin I thought SIN was indicated sometimes by a SAMEKH written above the letter. How would putting a SIN (which is just a SHIN with a dot on the left instead of the right) on top of the letter be any improvement (or difference) over just putting the dot on the left of the base letter in the first place? > ?U+30022?U+30041 (32 codepoints): Palestinian cantillation signs for > Hebrew and Galilean Aramaic > ?U+30042 is reserved > ?U+30043?U+3005C (26 codepoints): Babylonian vowel and pronunciation > points for Hebrew > ?U+3005D?U+3005F are reserved > ?U+30060?U+30071 (18 codepoints): Babylonian cantillation signs for > Hebrew > ?U+30072?U+3007D are reserved > ?U+3007E?U+3008F (18 codepoints): Samaritan vowel points, > pronunciation points, and cantillation signs for Hebrew (copies of > those also being used for Samaritan script in BMP) OK, here I'm confused. Why do we need copies? Unicode doesn't like to encode redundant things, and it only makes for messes (when do you use which ZIQAA?) If we have the characters in the BMP, we don't need them in the SMP. > ?U+30090?U+3010F (128 codepoints): Additional characters in Hebrew > script for other Jewish languages (these are pointed like the > corresponding Arabic characters in the BMP) So additional Hebrew "letters" that take Arabic vowel-points? Makes sense; I saw some of that with Samaritan (particularly with DAMMA). We should probably just use the Arabic vowel code-points though. > ?U+30110?U+3012F (32 codepoints): Basic Hebrew superscript characters > (regular letters+5 final forms+top-left pointed /?in/+top-right > pointed /shin/+/maqqef/) > ?U+30130?U+3014F (32 codepoints): Basic Hebrew subscript characters > (regular letters+5 final forms+top-left pointed /?in/+top-right > pointed /shin/+/maqqef/) When you say "superscript" (or "subscript"), do you mean "spacing character that's written small and raised/lowered"? Or do you mean "combining character that's written above/below another character"? cf. the difference between U+2071 SUPERSCRIPT LATIN SMALL LETTER I and U+0365 COMBINING LATIN SMALL LETTER I). If the former, is there a reason this has to be done as plain-text and can't be handled by higher-level markup? Probably every major script has been written small and high in some places, but we don't have superscript versions of every letter in Unicode. ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue May 10 22:34:05 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 10 May 2016 20:34:05 -0700 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: FYI It seems like 08xx is reserved for RTL scripts. http://www.unicode.org/Public/UCD/latest/ucd/extracted/DerivedBidiClass.txt # The unassigned code points that default to R are in the ranges: # [\u0590-\u05FF *\u07C0-\u089F* \uFB1D-\uFB4F \U00010800-\U00010FFF \U0001E800-\U0001EDFF \U0001EF00-\U0001EFFF] http://unicode.org/roadmaps/bmp/ 08 Samaritan Mandaic (SyrSup) ??? ??? ??? Arabic Extended-A http://unicode.org/roadmaps/smp/ 00010800-00010FFF Alphabetic and syllabic RTL scripts 0001E800-0001EFFF RTL scripts - Color highlighting is used to indicate blocks and unassigned ranges which default to right-to-left character behavior. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed May 11 07:46:10 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 11 May 2016 14:46:10 +0200 Subject: Moving The Hebrew Extended Block Into The SMP In-Reply-To: References: Message-ID: Effectively, if you need Arabic diacritics on top of Hebrew letters, just use them. There will be no defect on script breaking, except in strict security checks for identifiers where such usage is very unlikely or only "aspirational". You could as well use Latin/generic diacritics if needed such as a circumflex or cedilla. You could also use Latin letter-like diacritics, but not the spacing ones, such as superscript o. Combining characters should not ne desunified even if they are used un several scripts, and even if those script have different directions, unless they behave differently, i.e when they don't stack properly. Hebrew diacritics written above or below normally don't stack vertically but are ordered horizontally, but even in this case this can be infered from the base letter which determines the effective layout and even the effective glyph to use for the diacritic (e.g. with the cedilla which attaches sometimes above left instead of below with some Latin letters that have descenders like "g", or when some accents are added to Greek letters and placed on the left of capital letters instead of above). Desunification of these diacritics however is needed when layouts are distinguished both visually and semantically (such as the sin vs. shin dots), and when their normalisation would cause major problems requiring systematic use of CGJ to block their reordering. So don't fear using Arabic points or Latin accents, on top of Hebrew letters they will be interpreted correctly in their Hebrew context, and by themseves those combining diacritics have no direction (for the Bidi algorithm which preverves the combining clusters). Le 11 mai 2016 03:28, "Robert Wheelock" a ?crit : > Hello again! Shalom! > > After reading through the V. 9? code charts PDF document, I DID find a new > area to relocate our new Hebrew Extended block (a very important area to > add into Unicode): > THE AREA FROM U+30000 TO U+3014F (336 codepoints) > ?U+30000?U+30014 (21 codepoints): Additional characters for typesetting > Biblical/Classical Hebrew > ?U+30015?U+3001F (11 codepoints): Palestinian vowel and pronunciation > points for Hebrew and Galilean Aramaic > ?U+30020?U+30021 (2 codepoints): Small superscript top-left signs for the > letter *shin*?superscript ?in and superscript shin > ?U+30022?U+30041 (32 codepoints): Palestinian cantillation signs for > Hebrew and Galilean Aramaic > ?U+30042 is reserved > ?U+30043?U+3005C (26 codepoints): Babylonian vowel and pronunciation > points for Hebrew > ?U+3005D?U+3005F are reserved > ?U+30060?U+30071 (18 codepoints): Babylonian cantillation signs for Hebrew > ?U+30072?U+3007D are reserved > ?U+3007E?U+3008F (18 codepoints): Samaritan vowel points, pronunciation > points, and cantillation signs for Hebrew (copies of those also being used > for Samaritan script in BMP) > ?U+30090?U+3010F (128 codepoints): Additional characters in Hebrew script > for other Jewish languages (these are pointed like the corresponding Arabic > characters in the BMP) > ?U+30110?U+3012F (32 codepoints): Basic Hebrew superscript characters > (regular letters+5 final forms+top-left pointed *?in*+top-right pointed > *shin*+*maqqef*) > ?U+30130?U+3014F (32 codepoints): Basic Hebrew subscript characters > (regular letters+5 final forms+top-left pointed *?in*+top-right pointed > *shin*+*maqqef*) > Please STAY TUNED for updates. Thank You! > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed May 11 08:01:09 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 11 May 2016 15:01:09 +0200 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: Si this assignent does not respect the default rtl property of the range. It would not be a probleme for combining characters, but for LTR base letters in Malayalam this is a major problem... Induc scripts are already complexe enough without this additional incompatibility which will act against experimentations and effective use mater. The UTC should reconsider its beta allocation before the approval by ISO. The SMP is not a problem, and there are already several Indic scripts in the smp, that also borrows some devanagari non-letter signs such as punctuation without reencoding them. We'll also have Latin extensions in the smp, just like there are ideographic extensions outsider the BMP. It's more important to preserve the default properties for compatibility. Le 11 mai 2016 02:13, "Robert Wheelock" a ?crit : > Hello again, y?all! > > ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned > OTHER characters into the U+00860-U+008FF areas in the BMP of > Unicode?Malayalam extended additional characters for Garshuni, and more > additional Arabic characters. > > We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended > characters... either somewhere ELSE within the BMP, *or* somewhere > within either SMP areas 1 or 2. > It?ll be the same arrangement originally planned for the U+00860 area?but > relocated and expanded upon! > > ?Additional characters for correct typesetting of Hebrew > ?Hebrew Palestinian vowel and pronunciation points > ?The small superscript signs *?in* and *shin* for the letter *shin* > ?Hebrew Palestinian cantillation > ?Hebrew Babylonian vowel and pronunciation points > ?Hebrew Babylonian cantillation > ?Hebrew Samaritan vowel and pronunciation points > ?Additional Hebrew characters for other Jewish languages > A new TXT listing of this subblock (with the new CORRECT location) will be > forthcoming. STAY TUNED! > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed May 11 09:40:41 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 11 May 2016 07:40:41 -0700 Subject: The Hebrew Extended (Proposed) Block Message-ID: <20160511074041.665a7a7059d7ee80bb4d670165c8327d.08c76f277a.wbe@email03.godaddy.com> Robert Wheelock wrote: > ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned > OTHER characters into the U+00860-U+008FF areas in the BMP of > Unicode?Malayalam extended additional characters for Garshuni, and > more additional Arabic characters. Philippe Verdy replied: > Si this assignent does not respect the default rtl property of the > range. It would not be a probleme for combining characters, but for > LTR base letters in Malayalam this is a major problem... The characters proposed for U+0860 through U+086A are Syriac letters used for writing the Malayalam language. Pandey's proposal suggests they should have General Category AL, like other Syriac letters. There is no conflict in assigning these to a range designated for RTL scripts. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed May 11 09:47:01 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 11 May 2016 07:47:01 -0700 Subject: The Hebrew Extended (Proposed) Block Message-ID: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com> I wrote: > Pandey's proposal suggests they > should have General Category AL, like other Syriac letters. AL is a bidi type, not a General Category. Still. http://www.unicode.org/reports/tr9/#AL -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Wed May 11 11:05:04 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 11 May 2016 18:05:04 +0200 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com> References: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com> Message-ID: But are these supplemental Malayalam letters borrowed from Syriac really RTL like in the Syriac script ? I have doubts (it would seriously impact the Malayalam script which is LTR). May be the letter forms are identical (or similar) but they are changed to LTR (so the disunicification is justified if these are really letters). Or these are combining diacritics (working within the Indic letter clusters), i.e. in a "C*" general category but not in a "L*" general category (in which case they are Bidi neutral and don't really need to be in the RTL range). 2016-05-11 16:47 GMT+02:00 Doug Ewell : > I wrote: > > > Pandey's proposal suggests they > > should have General Category AL, like other Syriac letters. > > AL is a bidi type, not a General Category. Still. > > http://www.unicode.org/reports/tr9/#AL > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed May 11 11:24:29 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 11 May 2016 09:24:29 -0700 Subject: The Hebrew Extended (Proposed) Block Message-ID: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com> Philippe Verdy wrote: > But are these supplemental Malayalam letters borrowed from Syriac > really RTL like in the Syriac script ? I have doubts (it would > seriously impact the Malayalam script which is LTR). > > May be the letter forms are identical (or similar) but they are > changed to LTR (so the disunicification is justified if these are > really letters). > > Or these are combining diacritics (working within the Indic letter > clusters), i.e. in a "C*" general category but not in a "L*" general > category (in which case they are Bidi neutral and don't really need to > be in the RTL range). It might help to read the proposal: http://www.unicode.org/L2/L2015/15088-syriac-malayalam.pdf -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From frederic.grosshans at gmail.com Wed May 11 11:32:50 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Wed, 11 May 2016 18:32:50 +0200 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: <20160511074701.665a7a7059d7ee80bb4d670165c8327d.2369646f2f.wbe@email03.godaddy.com> Message-ID: <57335EB2.4090501@gmail.com> Le 11/05/2016 18:05, Philippe Verdy a ?crit : > But are these supplemental Malayalam letters borrowed from Syriac > really RTL like in the Syriac script ? I have doubts (it would > seriously impact the Malayalam script which is LTR). Since these character are uses to write the Malayalam *language* in the Syriac *script*, the borrowing is the other way around, they are essentially Malayalam (script) characters borrowed into Syriac. Fig 17 of http://www.unicode.org/L2/L2015/15156-syriac-malayalam.pdf shows an example of the look of this text. From verdy_p at wanadoo.fr Wed May 11 12:07:54 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 11 May 2016 19:07:54 +0200 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com> References: <20160511092429.665a7a7059d7ee80bb4d670165c8327d.57f8d801df.wbe@email03.godaddy.com> Message-ID: 2016-05-11 18:24 GMT+02:00 Doug Ewell : > It might help to read the proposal: > > http://www.unicode.org/L2/L2015/15088-syriac-malayalam.pdf Thanks for pointing this document. Initially I had incorrectly understood that this was an extension of the Malayalam script. But it appears now to be an extension of the Syriac script instead (used to write a variant of the Malayalam language, but fully in the Syriac script instead of the Malayalam Indic script, so OK it is fully RTL). So OK the assignment in the RTL range (of the BMP) is correct (though it could still have been in an RTL range of the SMP planes). And this is clearly not a duplication of the existing Malayalam letters due to the different properties). The encoding is justified. I apologize. Note: where is the ISO form containing the formal summary of characteristics and justifications (the list of questions and checkboxes) ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed May 11 18:40:32 2016 From: petercon at microsoft.com (Peter Constable) Date: Wed, 11 May 2016 23:40:32 +0000 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: Robert, your statement seems to have an implicit assumption that the range 0860..08FF has somehow been reserved for Hebrew. That is not the case. As Markus reference elsewhere, people can refer to the Roadmap charts to see what is tentatively planned for a given range: http://unicode.org/roadmaps/bmp/ If you or others are working on or considering working on a proposal for additional Hebrew characters, you should not make any firm assumptions about code point assignments until some indication of suitable ranges have been given by the Unicode Technical Committee and that has been added to the Roadmap. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Robert Wheelock Sent: Tuesday, May 10, 2016 4:55 PM To: unicode at unicode.org Subject: RE: The Hebrew Extended (Proposed) Block Hello again, y?all! ?BAD NEWS! (CRUCIALLY IMPORTANT): The Unicode Consortium has assigned OTHER characters into the U+00860-U+008FF areas in the BMP of Unicode?Malayalam extended additional characters for Garshuni, and more additional Arabic characters. We?ll need to find a DIFFERENT subblock to plant down our Hebrew extended characters... either somewhere ELSE within the BMP, or somewhere within either SMP areas 1 or 2. It?ll be the same arrangement originally planned for the U+00860 area?but relocated and expanded upon! ?Additional characters for correct typesetting of Hebrew ?Hebrew Palestinian vowel and pronunciation points ?The small superscript signs ?in and shin for the letter shin ?Hebrew Palestinian cantillation ?Hebrew Babylonian vowel and pronunciation points ?Hebrew Babylonian cantillation ?Hebrew Samaritan vowel and pronunciation points ?Additional Hebrew characters for other Jewish languages A new TXT listing of this subblock (with the new CORRECT location) will be forthcoming. STAY TUNED! -------------- next part -------------- An HTML attachment was scrubbed... URL: From ori at avtalion.name Fri May 13 12:31:35 2016 From: ori at avtalion.name (Ori Avtalion) Date: Fri, 13 May 2016 20:31:35 +0300 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: On Wed, May 11, 2016 at 2:55 AM, Robert Wheelock wrote: > ?Additional characters for correct typesetting of Hebrew Will this include BROKEN VAV? http://www.sofer.co.uk/html/broken_vav.html > ?Additional Hebrew characters for other Jewish languages Can you please provide some examples? Any plans for Rashi Script? It doesn't seem to fit any of the categories you listed. Arguably, it's just a font, but there's precedence in Unicode :) https://en.wikipedia.org/wiki/Rashi_script From everson at evertype.com Fri May 13 12:59:22 2016 From: everson at evertype.com (Michael Everson) Date: Fri, 13 May 2016 18:59:22 +0100 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: On 13 May 2016, at 18:31, Ori Avtalion wrote: > Any plans for Rashi Script? It doesn't seem to fit any of the > categories you listed. Arguably, it's just a font, but there's > precedence in Unicode :) Not good precedent, I think. Rashi would be best considered like Fraktur and Latin. Michael Everson From jonathan.rosenne at gmail.com Fri May 13 14:10:15 2016 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Fri, 13 May 2016 22:10:15 +0300 Subject: The Hebrew Extended (Proposed) Block In-Reply-To: References: Message-ID: <000001d1ad4b$15ddb9a0$41992ce0$@gmail.com> Rashi is a font, not a script. It has a one-to-one correspondence with standard Hebrew. Best Regards, Jonathan Rosenne 054-4246522 -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Everson Sent: Friday, May 13, 2016 8:59 PM To: unicode at unicode.org Subject: Re: The Hebrew Extended (Proposed) Block On 13 May 2016, at 18:31, Ori Avtalion wrote: > Any plans for Rashi Script? It doesn't seem to fit any of the > categories you listed. Arguably, it's just a font, but there's > precedence in Unicode :) Not good precedent, I think. Rashi would be best considered like Fraktur and Latin. Michael Everson From jameskasskrv at gmail.com Sat May 14 15:41:11 2016 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 14 May 2016 12:41:11 -0800 Subject: Klingon text in legal brief In-Reply-To: References: Message-ID: As a certain character from TOS would say, "fascinating". It's surprising that nobody commented on Ken Shirriff's post. If this had been posted ten years ago it probably would have generated more activity on this list. Best regards, James Kass On Thu, Apr 28, 2016 at 7:49 AM, Ken Shirriff wrote: > Since encoding Klingon in Unicode comes up occasionally, you might be > amused to see a legal brief that was written partly in Klingon: > https://drive.google.com/file/d/0BzmetJxi-p0VM19nbUpyNXE0a28/view > > Details are here: http://conlang.org/axanar/ > > Ken > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sat May 14 18:29:18 2016 From: everson at evertype.com (Michael Everson) Date: Sun, 15 May 2016 00:29:18 +0100 Subject: Klingon text in legal brief In-Reply-To: References: Message-ID: One keeps one?s cards to one?s chest. > On 14 May 2016, at 21:41, James Kass wrote: > > > As a certain character from TOS would say, "fascinating". > > It's surprising that nobody commented on Ken Shirriff's post. If this had been posted ten years ago it probably would have generated more activity on this list. > > Best regards, > > James Kass > > > On Thu, Apr 28, 2016 at 7:49 AM, Ken Shirriff wrote: > Since encoding Klingon in Unicode comes up occasionally, you might be amused to see a legal brief that was written partly in Klingon: https://drive.google.com/file/d/0BzmetJxi-p0VM19nbUpyNXE0a28/view > > Details are here: http://conlang.org/axanar/ > > Ken > > > From jameskasskrv at gmail.com Sat May 14 21:46:44 2016 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 14 May 2016 18:46:44 -0800 Subject: Klingon text in legal brief Message-ID: Best wishes towards a winning hand. Best regards, James Kass From haberg-1 at telia.com Sun May 15 13:57:54 2016 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sun, 15 May 2016 20:57:54 +0200 Subject: Math upright Latin and Greek styles Message-ID: Are there any plans to add math upright Latin and Greek styles, in order to distinguish them from regular (non-math) Latin and Greek? ?In programs like TeX, the latter are normally used for italics, so it means that there is a conflict with using them for upright. From haberg-1 at telia.com Sun May 15 16:47:03 2016 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Sun, 15 May 2016 23:47:03 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: References: Message-ID: > On 15 May 2016, at 23:19, Murray Sargent wrote: > > Hans ?berg asked, ?Are there any plans to add math upright Latin and Greek styles, in order to distinguish them from regular (non-math) Latin and Greek? ?In programs like TeX, the latter are normally used for italics, so it means that there is a conflict with using them for upright?. > > Math upright Latin is unified with the ASCII alphabetics and math upright Greek is unified with Unicode Greek letters in the U+0390 block. TeX and MathML upright Latin and upright lower-case Greek letters are converted to math italic by default. In the Linear Format, upright letters are enclosed in quotes and marked as ?ordinary text?. In Microsoft Word and other Microsoft Office apps, you can control math italicization in math zones using the italics hot key Ctrl+I and other italic formatting tools. > > There is ambiguity as to whether a span of upright ASCII alphabetics is a function name or a product or a combination of the two. Such ambiguities are rare since spans of upright ASCII alphabetics are usually words or abbreviations of some kind such as function names. Individual upright letters can be distinguished as individual variables if desired by inserting appropriate invisible times (U+2062) characters. > > We are thinking about adding other math alphabets as discussed in the post Unicode Math Calligraphic Alphabets. Comments are welcome. The question arose on the ConTeXt mailing list [1]. Changing Basic Latin and Greek to upright does not seem practical, due to legacy and lack of efficient input methods. So the idea came up to have these reserved for text and computer input, while a specific math upright style would be used when wanting to indicate that. 1. https://mailman.ntg.nl/pipermail/ntg-context/2016/085523.html From haberg-1 at telia.com Sun May 15 17:25:51 2016 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 16 May 2016 00:25:51 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: References: Message-ID: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> > On 16 May 2016, at 00:05, Murray Sargent wrote: > > Hans ?berg mentioned "Changing Basic Latin and Greek to upright does not seem practical, due to legacy and lack of efficient input methods." > > Have to say that it's really easy for the user to switch between math upright, italic, bold, and bold italic letters in Microsoft Word by just using the usual hot keys as discussed in > > https://blogs.msdn.microsoft.com/murrays/2007/05/30/using-math-italic-and-bold-in-word-2007/. > > This capability has been shipping for over 10 years now. But admittedly implementing such input functionality is a little tricky since the alphanumerics need to be converted to the desired Unicode Math Alphanumerics. I am not familiar with the product, so it unclear to me whether it it produces a UTF-8 text file with the correct Unicode code points, as is a requirement for the LuaTeX engine that ConTeXt defaults to. One can design a new key map on OS X that selects the correct Unicode code points, but that is a huge task, given the large number of math symbols. The legacy issue is that there are already loads of TeX code that translates the Basic Latin into Unicode math italic style. So it is hard to break the habit, and old code cannot readily be reused. And one can ignore the problem altogether, and use the traditional TeX backslash ?\?? commands, but using Unicode helps the readability of the source code. This is even more so in the case of theorem proof assistants. From verdy_p at wanadoo.fr Sun May 15 20:30:42 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 16 May 2016 03:30:42 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> References: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> Message-ID: isn't it specified in TeX using a font selection package instead of the default one? Also the only upright letters I saw was for inserting normal text (not mathematical symbols) or comments/descriptions, or when using the standardized "monospace", or "serif" font (which are not italic by default). 2016-05-16 0:25 GMT+02:00 Hans ?berg : > > > On 16 May 2016, at 00:05, Murray Sargent > wrote: > > > > Hans ?berg mentioned "Changing Basic Latin and Greek to upright does not > seem practical, due to legacy and lack of efficient input methods." > > > > Have to say that it's really easy for the user to switch between math > upright, italic, bold, and bold italic letters in Microsoft Word by just > using the usual hot keys as discussed in > > > > > https://blogs.msdn.microsoft.com/murrays/2007/05/30/using-math-italic-and-bold-in-word-2007/ > . > > > > This capability has been shipping for over 10 years now. But admittedly > implementing such input functionality is a little tricky since the > alphanumerics need to be converted to the desired Unicode Math > Alphanumerics. > > I am not familiar with the product, so it unclear to me whether it it > produces a UTF-8 text file with the correct Unicode code points, as is a > requirement for the LuaTeX engine that ConTeXt defaults to. One can design > a new key map on OS X that selects the correct Unicode code points, but > that is a huge task, given the large number of math symbols. > > The legacy issue is that there are already loads of TeX code that > translates the Basic Latin into Unicode math italic style. So it is hard to > break the habit, and old code cannot readily be reused. > > And one can ignore the problem altogether, and use the traditional TeX > backslash ?\?? commands, but using Unicode helps the readability of the > source code. This is even more so in the case of theorem proof assistants. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon May 16 03:05:11 2016 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 16 May 2016 10:05:11 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: References: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> Message-ID: > On 16 May 2016, at 03:30, Philippe Verdy wrote: > > isn't it specified in TeX using a font selection package instead of the default one? Also the only upright letters I saw was for inserting normal text (not mathematical symbols) or comments/descriptions, or when using the standardized "monospace", or "serif" font (which are not italic by default). Most use a macro package like ConTeXt, which is more recent and modern than LaTeX, and it is not difficult to change so that the Basic Latin produces math upright style. But legacy is that it is used for math italic, and it is hard to change that legacy. From verdy_p at wanadoo.fr Mon May 16 11:56:23 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 16 May 2016 18:56:23 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: References: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> Message-ID: I do not advocate changing that, but these legacy *TeX variants have their own builtin sets of supported fonts with their implicit style and use them with the normal letters, just like what is done in HTML when you apply an italic style. Has these *TeX variants exist this way they don't need these additions that will be needed only on newer *TeX variants that will not use explicit font variants in their encoding, but directly new distinguished code points (without explicit font style tagging). There are now many *TeX variants each one having its own local assumptions about the default styles (and layouts) they will apply. If you want to convert any one of them to HTML (or similar rich-text format), you always need to know how these *TeX variants have been "profiled": you cannot simply use the same conversion rules for all *TeX. Now, if new upright maths characters are added, this will just add new complications in the rules used by these converters, with little benefit. The benefit will be visible only when converting to plain-text only (but such conversion is already defective in many aspects, as the maths layout is not representable directly without adding additional notations such as parentheses or some "\"-escaped notations: such conversion to plain-text is in fact, most often, keeping the original *TeX syntax/notation if they want to "preserve" the original semantics) 2016-05-16 10:05 GMT+02:00 Hans ?berg : > > > On 16 May 2016, at 03:30, Philippe Verdy wrote: > > > > isn't it specified in TeX using a font selection package instead of the > default one? Also the only upright letters I saw was for inserting normal > text (not mathematical symbols) or comments/descriptions, or when using the > standardized "monospace", or "serif" font (which are not italic by default). > > Most use a macro package like ConTeXt, which is more recent and modern > than LaTeX, and it is not difficult to change so that the Basic Latin > produces math upright style. But legacy is that it is used for math italic, > and it is hard to change that legacy. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From haberg-1 at telia.com Mon May 16 12:02:38 2016 From: haberg-1 at telia.com (=?utf-8?Q?Hans_=C3=85berg?=) Date: Mon, 16 May 2016 19:02:38 +0200 Subject: Math upright Latin and Greek styles In-Reply-To: References: <58915D9F-5A34-4548-978B-66F0F27C224B@telia.com> Message-ID: > On 16 May 2016, at 18:56, Philippe Verdy wrote: > > I do not advocate changing that, but these legacy *TeX variants have their own builtin sets of supported fonts with their implicit style and use them with the normal letters, just like what is done in HTML when you apply an italic style. Has these *TeX variants exist this way they don't need these additions that will be needed only on newer *TeX variants that will not use explicit font variants in their encoding, but directly new distinguished code points (without explicit font style tagging). The ConTeXt macro package default engine is LuaTeX, which uses UTF-8 for text files and UTF-32 internally, and combines the effort of several of those other, older versions. Then one can use the STIX fonts (or XITS) which are Unicode. From ori at avtalion.name Thu May 19 10:53:46 2016 From: ori at avtalion.name (Ori Avtalion) Date: Thu, 19 May 2016 18:53:46 +0300 Subject: Broken link on 2016 Document Register Message-ID: On the page: http://www.unicode.org/L2/L-curdoc.htm The link for "L2/16-164" points at: http://www.unicode.org/L2/L2016/ when it should point at: http://www.unicode.org/L2/L2016/16164-ucas-font-support.pdf From davidj_faulks at yahoo.ca Thu May 19 13:06:05 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Thu, 19 May 2016 18:06:05 +0000 (UTC) Subject: Proposal not reviewed, what to do? References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com> Hello, Although I am glad that mostof my recent proposals have been accepeted, it does seem that one of them: http://www.unicode.org/L2/L2016/16080-add-astrology.pdf was not reviewed at the recent UTC meeting. I'm feeling a bit unsure of what to make of that, especially since that was the proposal I was most unsure about. The SEI recommended that some of the characters proposed there be accepeted and others not. Do I do nothing, and wait until after the next UTC meeting? Should I try to submit a revised proposal? Should I assume some characters are likely to be encoded and only concentrate on others? I would like any feedback ?. David Faulks From dwanders at sonic.net Thu May 19 13:23:06 2016 From: dwanders at sonic.net (Deborah W. Anderson) Date: Thu, 19 May 2016 11:23:06 -0700 Subject: Proposal not reviewed, what to do? In-Reply-To: <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com> References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com> <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com> Message-ID: <007201d1b1fb$7d5539a0$77fface0$@sonic.net> Hi David, I was present last week, and can relate the outcome. We ran short on time at the UTC, so L2/16-080 was postponed until the next meeting. What would be helpful, I think, would be to take on board the comments from http://www.unicode.org/L2/L2016/16156-script-recs.pdf and revise your doc accordingly (i.e., include the ones recommended for encoding, and, if you can, see if you can provide additional information on others). With best wishes, Debbie Anderson -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of David Faulks Sent: Thursday, May 19, 2016 11:06 AM To: Unicode Mailing List Subject: Proposal not reviewed, what to do? Hello, Although I am glad that mostof my recent proposals have been accepeted, it does seem that one of them: http://www.unicode.org/L2/L2016/16080-add-astrology.pdf was not reviewed at the recent UTC meeting. I'm feeling a bit unsure of what to make of that, especially since that was the proposal I was most unsure about. The SEI recommended that some of the characters proposed there be accepeted and others not. Do I do nothing, and wait until after the next UTC meeting? Should I try to submit a revised proposal? Should I assume some characters are likely to be encoded and only concentrate on others? I would like any feedback ?. David Faulks From verdy_p at wanadoo.fr Thu May 19 14:13:42 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 19 May 2016 21:13:42 +0200 Subject: Proposal not reviewed, what to do? In-Reply-To: <007201d1b1fb$7d5539a0$77fface0$@sonic.net> References: <616133108.4978547.1463681165994.JavaMail.yahoo.ref@mail.yahoo.com> <616133108.4978547.1463681165994.JavaMail.yahoo@mail.yahoo.com> <007201d1b1fb$7d5539a0$77fface0$@sonic.net> Message-ID: Why those extra punctuation marks would need a separate allocation? Couldn't they be encoded as *variants* of existing punctuation marks (ie. the existing standard punctuation followed by a VS)? I think they are exactly in the scope of encoding of variants (even if most encoded variants ar for the Sinographic scripts, there should not be any prohibition for them in the Latin script) Remark: - with the EXCLAMATIVUS PUNCTUS (from which the current "!" character derives directly). The laternative encoding being to use the standard exclamation mark "!" followed by either a combining dot below (but this dot would be too low, under the base line), or a (more appropriate) combining middle dot (note how this middle dot combines specially with the Latin letter l to appear on the right of the ascender, rather than over it, and for the capital it fits in the middle of the gap left by the lower right leg: this is already handled as exception pairs in fonts for Catalan and a few other languages; we also already have examples of punctuations used with diacritics such as the macron). - on the opposite, the two variants of "colon" with sideway comma, could be in fact simply a pair of characters (the standard colon or semi-colon followed by the character for the sideway comma), without needing any VS. The sideway comma is not really a variant as its own spacing glyph and does not really attach to the colon or semicolon on the left ; such combination is akin to other combination of punctuation signs (such as "::" or "!?" or ":-" or "--"), I don't think it is a case for the encoding of the sideway comma as a diacritic. If there are cases were the two characters may need to be ligated we could bind them with a joiner control in the middle. 2016-05-19 20:23 GMT+02:00 Deborah W. Anderson : > Hi David, > I was present last week, and can relate the outcome. We ran short on time > at the UTC, so L2/16-080 was postponed until the next meeting. What would > be helpful, I think, would be to take on board the comments from > http://www.unicode.org/L2/L2016/16156-script-recs.pdf and revise your doc > accordingly (i.e., include the ones recommended for encoding, and, if you > can, see if you can provide additional information on others). > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed May 25 10:27:49 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 25 May 2016 08:27:49 -0700 Subject: Emoji for subdivision flags Message-ID: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> Now that UTR #52 has been suspended, are any *specific* alternative plans for representing subdivision flags being bandied about? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Wed May 25 13:28:23 2016 From: petercon at microsoft.com (Peter Constable) Date: Wed, 25 May 2016 18:28:23 +0000 Subject: Emoji for subdivision flags In-Reply-To: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> Message-ID: Nothing discussed at this point. The highest priority item that UTS#52 might have covered are female emoji, and that's were the main emoji attention is at present. After all, there's only so much attention we should be spending on emoji, right? ;-) Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Wednesday, May 25, 2016 8:28 AM To: Unicode Mailing List Subject: Emoji for subdivision flags Now that UTR #52 has been suspended, are any *specific* alternative plans for representing subdivision flags being bandied about? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed May 25 13:55:50 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 25 May 2016 11:55:50 -0700 Subject: Emoji for subdivision flags Message-ID: <20160525115550.665a7a7059d7ee80bb4d670165c8327d.90f758a44f.wbe@email03.godaddy.com> Peter Constable wrote: > After all, there's only so much attention we should be spending on > emoji, right? ;-) But my expectations have been exceeded so many times before... I remember when flags were considered the #1 use case for these extensions, at least among those publicly discussed. That was a year ago and I guess that's a long time. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From public at khwilliamson.com Wed May 25 19:34:57 2016 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 25 May 2016 18:34:57 -0600 Subject: Emoji for subdivision flags In-Reply-To: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> Message-ID: <574644B1.8070609@khwilliamson.com> On 05/25/2016 09:27 AM, Doug Ewell wrote: > Now that UTR #52 has been suspended, are any *specific* alternative > plans for representing subdivision flags being bandied about? > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > What I'd like to know is how does one find out about such decisions in a timely manner? From petercon at microsoft.com Wed May 25 22:47:36 2016 From: petercon at microsoft.com (Peter Constable) Date: Thu, 26 May 2016 03:47:36 +0000 Subject: Emoji for subdivision flags In-Reply-To: <574644B1.8070609@khwilliamson.com> References: <20160525082749.665a7a7059d7ee80bb4d670165c8327d.ae90c32975.wbe@email03.godaddy.com> <574644B1.8070609@khwilliamson.com> Message-ID: Watch for UTC minutes to be posted? Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Karl Williamson Sent: Wednesday, May 25, 2016 5:35 PM To: Doug Ewell ; Unicode Mailing List Subject: Re: Emoji for subdivision flags On 05/25/2016 09:27 AM, Doug Ewell wrote: > Now that UTR #52 has been suspended, are any *specific* alternative > plans for representing subdivision flags being bandied about? > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > What I'd like to know is how does one find out about such decisions in a timely manner? From mathias at qiwi.be Thu May 26 03:17:02 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 26 May 2016 10:17:02 +0200 Subject: Canonical block names: spaces vs. underscores Message-ID: `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. Which is it? If proper canonical block names use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that? If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that? From mathias at qiwi.be Thu May 26 08:44:51 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 26 May 2016 15:44:51 +0200 Subject: Canonical block names: spaces vs. underscores In-Reply-To: References: Message-ID: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> > On 26 May 2016, at 10:17, Mathias Bynens wrote: > > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. > > Which is it? > > If proper canonical block names use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that? > If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that? > Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in addition to the underscores, the case of the `A` changed as well. Which is the canonical name? The same goes for other blocks with ?and? in the name, e.g. `Miscellaneous Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc. From doug at ewellic.org Thu May 26 10:43:35 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 May 2016 08:43:35 -0700 Subject: Emoji for subdivision flags Message-ID: <20160526084335.665a7a7059d7ee80bb4d670165c8327d.e24d72e063.wbe@email03.godaddy.com> Peter Constable replied to Karl Williamson: >>> Now that UTR #52 has been suspended, are any *specific* alternative >>> plans for representing subdivision flags being bandied about? >> >> What I'd like to know is how does one find out about such decisions >> in a timely manner? > > Watch for UTC minutes to be posted? Apparently the key is to look at this list [1], which is up to date, and not this one [2], which isn't. The relevant minutes are at [3]. Search for "Issue 321" and in particular look through the review comments at [4] to find out what happened to the original scope and intent of PDUTS #52. [1] http://www.unicode.org/L2/meetings/utc-meetings.html [2] http://www.unicode.org/consortium/utc-minutes.html [3] http://www.unicode.org/L2/L2016/16121.htm [4] http://www.unicode.org/review/pri321/feedback.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at macchiato.com Thu May 26 10:47:27 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 26 May 2016 08:47:27 -0700 Subject: Canonical block names: spaces vs. underscores In-Reply-To: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> Message-ID: The canonical property and property value formats are in the *Alias* files. {phone} On May 26, 2016 06:57, "Mathias Bynens" wrote: > > > On 26 May 2016, at 10:17, Mathias Bynens wrote: > > > > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists > blocks such as `Cyrillic Supplement`. > > > > However, `PropertyValueAliases.txt` ( > http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to > this block as `Cyrillic_Supplement`, with an underscore instead of a space. > > > > Which is it? > > > > If proper canonical block names use spaces instead of underscores, why > doesn?t `PropertyValueAliases.txt` reflect that? > > If proper canonical block names use underscores instead of spaces, why > doesn?t `Blocks.txt` reflect that? > > > > Another example: `Blocks.txt` has `Superscripts and Subscripts`, whereas > `PropertyValueAliases.txt` has `Superscripts_And_Subscripts`. Note that in > addition to the underscores, the case of the `A` changed as well. Which is > the canonical name? > > The same goes for other blocks with ?and? in the name, e.g. `Miscellaneous > Symbols and Pictographs`, `Supplemental Symbols and Pictographs`, etc. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 26 10:56:44 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 May 2016 08:56:44 -0700 Subject: Canonical block names: spaces vs. underscores Message-ID: <20160526085644.665a7a7059d7ee80bb4d670165c8327d.9e0b0bde9f.wbe@email03.godaddy.com> Mathias Bynens wrote: > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists > blocks such as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` > (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to > this block as `Cyrillic_Supplement`, with an underscore instead of a > space. > > Which is it? It's both: http://www.unicode.org/reports/tr44/#Matching_Symbolic -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Thu May 26 11:03:20 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 26 May 2016 09:03:20 -0700 Subject: Canonical block names: spaces vs. underscores In-Reply-To: References: Message-ID: <31a8a43d-90d8-fdd8-ea13-4ecd5974e571@att.net> On 5/26/2016 1:17 AM, Mathias Bynens wrote: > `Blocks.txt` (http://unicode.org/Public/UNIDATA/Blocks.txt) lists blocks such as `Cyrillic Supplement`. > > However, `PropertyValueAliases.txt` (http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt) refers to this block as `Cyrillic_Supplement`, with an underscore instead of a space. > > Which is it? > > If proper canonical block names Well, first of all, "canonical block name" is not a defined term in the standard. Unlike normalization of Unicode strings, there is no "normalization" of property values that defines a particular form as *the* canonical form to which other strings normalize. > use spaces instead of underscores, why doesn?t `PropertyValueAliases.txt` reflect that? > If proper canonical block names use underscores instead of spaces, why doesn?t `Blocks.txt` reflect that? > > > See the matching rules in UAX #44: http://www.unicode.org/reports/tr44/#Matching_Rules and in particular, the matching rule for symbolic values, which applies in this case: http://www.unicode.org/reports/tr44/#UAX44-LM3 For enumerated properties, and especially for catalog properties such as Block and Script, the value of the property may be multi-word, and the best form to use in one context might not be exactly (as in binary string equality exact) the same as in another. For Blocks.txt, all block names are given with spaces and with the casing conventions that would be most consistent with returning values for a block name in an API. The property values used in PropertyValueAliases.txt, on the other hand, are systematically turned into forms that are more identifier friendly, as the typical context of use for those values is in regex expressions and the like. There are invariant rules in place that guarantee that any new property values for properties subject to the Loose Matching Rule #3 noted above are always unique in their namespace, given the application of that matching rule. --Ken From mathias at qiwi.be Thu May 26 12:05:05 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 26 May 2016 19:05:05 +0200 Subject: Canonical block names: spaces vs. underscores In-Reply-To: References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> Message-ID: > On 26 May 2016, at 17:47, Mark Davis ?? wrote: > > The canonical property and property value formats are in the *Alias* files. Thanks for confirming! Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files. > On 26 May 2016, at 18:03, Ken Whistler wrote: > > [?] "canonical block name" is not a defined term in the standard. I didn?t mean to imply it was ? it?s just an English word. I meant ?canonical? as in ?without loose matching applied?. > See the matching rules in UAX #44: > > http://www.unicode.org/reports/tr44/#Matching_Rules > > and in particular, the matching rule for symbolic values, which applies in this case: > > http://www.unicode.org/reports/tr44/#UAX44-LM3 I know about loose matching, having recently implemented it (https://github.com/mathiasbynens/unicode-loose-match). > For enumerated properties, and especially for catalog properties such as Block and Script, > the value of the property may be multi-word, and the best form to use in one context might > not be exactly (as in binary string equality exact) the same as in another. That makes sense, but shouldn?t it be consistent throughout the Unicode database text files? From kenwhistler at att.net Thu May 26 13:07:14 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 26 May 2016 11:07:14 -0700 Subject: Canonical block names: spaces vs. underscores In-Reply-To: References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> Message-ID: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> On 5/26/2016 10:05 AM, Mathias Bynens wrote: >> On 26 May 2016, at 17:47, Mark Davis ?? wrote: >> >> The canonical property and property value formats are in the *Alias* files. > Thanks for confirming! Well, not quite... See below. > > Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files. There's always a chance, I guess. But if we did so, we'd end up having to just invent some other more-or-less ad hoc property: Block_Name_Usable_For_Display, with the values we already have in the Blocks.txt file. Or we would have to change the format to include the block short alias as an additional field in the file, which would have its own maintenance and consistency issues. Or we would be introducing a historical inconsistency in the UCD between versions, which would *complicate* certain other scripts that parse the UCD. > >> On 26 May 2016, at 18:03, Ken Whistler wrote: >> >> [?] "canonical block name" is not a defined term in the standard. > I didn?t mean to imply it was ? it?s just an English word. I meant ?canonical? as in ?without loose matching applied?. Ah, but "canonical" is a very freighted word in Unicode parlance. There are 58 instances of the word "canonical" in the current version of UAX #44, Unicode Character Database. Every one of them is a term of art, and none of them means what you mean there. ;-) What are actually in PropertyValueAliases.txt are "preferred aliases" (one "abbreviated", and one "long"), plus a few "other aliases" for various compatibility reasons. UAX #42 follows suit. The block property is represented by the blk attribute, and the enumerated values of the blk attribute: http://www.unicode.org/reports/tr42/#w1aac13c13c19b1 use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt. > >> For enumerated properties, and especially for catalog properties such as Block and Script, >> the value of the property may be multi-word, and the best form to use in one context might >> not be exactly (as in binary string equality exact) the same as in another. > That makes sense, but shouldn?t it be consistent throughout the Unicode database text files? Well, let's take an example. The entry in Blocks.txt for the Arabic Presentation Forms-A block is: FB50..FDFF; Arabic Presentation Forms-A The entry for that block in PropertyValueAliases.txt is: blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; Arabic_Presentation_Forms-A So then which would it be? Should Blocks.txt be changed to the long preferred alias: FB50..FDFF; Arabic_Presentation_Forms_A or to the abbreviated preferred alias: FB50..FDFF; Arabic_PF_A which would be more consistent with the XML attribute and with most regex usage? If the latter, you would end up with systematically less identifiable labels in Blocks.txt, which would make it a bit more obscure for other uses, and which would also then create ambiguities about what might be the "best" or "preferred" label for blocks for an API returning a block name -- which certainly wouldn't be the abbreviated "preferred alias". I suppose a proposal to the UTC to further modify the UCD handling of block names could change this situation. But I'm not convinced that we shouldn't just leave things as they stand -- for stability. And then live with the complications required for scripts or other parsing algorithms that actually need to deal with Blocks.txt to either parse out block ranges (its main function) or to get usable block names (its subsidiary function). --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu May 26 13:44:55 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 May 2016 20:44:55 +0200 Subject: Canonical block names: spaces vs. underscores In-Reply-To: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> Message-ID: 2016-05-26 20:07 GMT+02:00 Ken Whistler : > Well, let's take an example. The entry in Blocks.txt for the Arabic > Presentation Forms-A block is: > > FB50..FDFF; Arabic Presentation Forms-A > > The entry for that block in PropertyValueAliases.txt is: > > blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; > Arabic_Presentation_Forms-A > > So then which would it be? Should Blocks.txt be changed to the long > preferred alias: > > FB50..FDFF; Arabic_Presentation_Forms_A > > or to the abbreviated preferred alias: > > FB50..FDFF; Arabic_PF_A > I think that this would break parsers that expect the alias used in Blocks.txt to be directly "readable" with spaces. My opinion is to keep Blocks.txt untouched (with spaces) as it's part of the core standard since too long (and in sync with the ISO standard) as being the *normative* block name. But we could add this normative value (with spaces) into PropertyValueAliases.txt (that ISO 10646 does not have or need in its standard): blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A The other solution would be to *add* the abbreviated prefered alias in Blocks.txt: FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A But this could break existing Block.txt parsers, when parsers should not bug if finding new aliases in PropertyValueAliases.txt Another solution would be to properly explain that to lookup values in PropertyValues.txt, you can search it by replacing spaces in block names by underscores, or make sure that underscores and spaces in the *middle* of values are considered equivalent (so that even if they are rendered visually, we can also display the listed aliases using spaces instead of underscores. However it must be clear that these aliases are case-sensitive by default ("Arabic_Presentation_Forms_A" is not the same as "Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms A), unless the block names property is normatively said to be case-insensitive (in that case the followings are also aliases: "arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost, which is much higher than *only* allowing basic replacements of spaces and underscores (this will work, provided that there's no "special" aliases starting by underscores, or using pairs of underscores: I doubt ISO will use pairs of spaces in block names which are supposed to be trimmed with whitespaces in the middle compressed). Removing or replacing the space-separated words in block names in the UCD would break the compatibility and synchronization with the ISO standard which list them with spaces. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Thu May 26 13:48:48 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Thu, 26 May 2016 20:48:48 +0200 Subject: Canonical block names: spaces vs. underscores In-Reply-To: <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> Message-ID: <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be> > On 26 May 2016, at 20:07, Ken Whistler wrote: > > Well, let's take an example. The entry in Blocks.txt for the Arabic Presentation Forms-A block is: > > FB50..FDFF; Arabic Presentation Forms-A > > The entry for that block in PropertyValueAliases.txt is: > > blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ; Arabic_Presentation_Forms-A > > So then which would it be? Should Blocks.txt be changed to the long preferred alias: > > FB50..FDFF; Arabic_Presentation_Forms_A > > or to the abbreviated preferred alias: > > FB50..FDFF; Arabic_PF_A > > which would be more consistent with the XML attribute and with most regex usage? This sounds like a strawman argument (?). The long preferred alias definitely seems more suitable for a ?canonical? name. > I suppose a proposal to the UTC to further modify the UCD handling of block names > could change this situation. But I'm not convinced that we shouldn't just leave > things as they stand -- for stability. And then live with the complications required > for scripts or other parsing algorithms that actually need to deal with Blocks.txt to > either parse out block ranges (its main function) or to get usable block names > (its subsidiary function). Perhaps the ?Note:? in the commented header in `Blocks.txt` could be extended to point out that the ~~canonical block names~~, nay, ++preferred block aliases++ are listed in `PropertyValueAliases.txt`? That would?ve been enough to avoid the question that spawned this thread. From verdy_p at wanadoo.fr Thu May 26 14:32:12 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 May 2016 21:32:12 +0200 Subject: Canonical block names: spaces vs. underscores In-Reply-To: <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be> References: <934267C8-FB36-42CE-AD79-10FDA079FB27@qiwi.be> <715da82c-b053-74df-ee9a-3dcc5df540e8@att.net> <40EE1677-FDEE-4234-9847-26EAB3C0FCBB@qiwi.be> Message-ID: 2016-05-26 20:48 GMT+02:00 Mathias Bynens : > > > On 26 May 2016, at 20:07, Ken Whistler wrote: > > Perhaps the ?Note:? in the commented header in `Blocks.txt` could be > extended to point out that the ~~canonical block names~~, nay, ++preferred > block aliases++ are listed in `PropertyValueAliases.txt`? That would?ve > been enough to avoid the question that spawned this thread. > I'd say that the "preferred block aliases" should be stable and always in the first entry. And the last entry should be the preferred version for display and unabbreviated (but not necessarily stable, it may change over time, and applications are free to use better display names, including translations; this last entry should be the best suitable for US English in a *technical* glossary and preferably used in Unicode documentations and proposals, but may be different for British English, or for vernacular names, but for reference the 1st entry should not change) Note also that the 1st entry in property aliases is not necessarily the most abbreviated one: there may be other aliases in the middle of the list using shorter names, provided that they don't conflict with others; or special aliases used for specific lookups matching some pattern with a known prefixes/suffixes (e.g. Hangul syllable types) so that another specification specific for this usage could simply drop those implied prefixes/suffixes, using even shorter aliases internally than the listed aliases) The rules for lookling up aliases in PropertyAliases should be independant of the property type: - capitalization should be preserved (with lookups always case-sensive, even of the listed values for a property type are currently using only ASCII capital letters, or only ASCII lowercase letters): the capitalization form may need to be distinguished in some future of the standard (without having to use a broken orthography to distinguish them), and we should not be using a slow UCA collator to match entries. - only underscores/spaces should be considered equivalent, and there will NEVER be special entries using leading or trailing underscores, or pairs of underscores, or pairs of whitespaces (all aliases are assumed to be trimmable and compressible, like in XML or HTML by default): applications may then choose the "canonicalization" form they prefer (with underscores, or with spaces) - some "camelCased" bijective transform could suppress spaces/underscores, provided that the transform includes an "escaping" mechanism for case distinctions; but alternatively we could also list conforming "camelCased" aliases (from which lowercase-only aliases with ASCII hyphens could be infered for use in CSS selectors also with a bijective transform) - however some programming languages (e.g. BASIC) do not have any case distinction for identifiers (and there's no easy escaping mechanism without using separators like underscores, which should also not be used in leading or traling positions), or use lettercase (of the initial) for special meaning (e.g. in several IA languages to distinguish variables and atoms: the escaping mechanism may need to prepend a leading underscore or some common prefix). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu May 26 15:41:49 2016 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 May 2016 13:41:49 -0700 Subject: Canonical block names: spaces vs. underscores Message-ID: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com> Mathias Bynens wrote: > Any chance the canonical names can be used in `Blocks.txt` as well, > for consistency? This would simplify scripts that parse the Unicode > database text files. I don't see the problem here. The loose-matching rule is well-defined and not complicated, either visually or algorithmically; and if Mathias has an implementation up on GitHub, he should be able to use it wherever it's needed. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From markus.icu at gmail.com Fri May 27 00:14:44 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 26 May 2016 22:14:44 -0700 Subject: Canonical block names: spaces vs. underscores In-Reply-To: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com> References: <20160526134149.665a7a7059d7ee80bb4d670165c8327d.2839425136.wbe@email03.godaddy.com> Message-ID: Note that the Block property is an artifact of how the committee organizes the encoding of characters. It is not very useful for processing. For that, the Script property, Script_Extensions, and others are normally much better. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat May 28 10:51:55 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 28 May 2016 09:51:55 -0600 Subject: Canonical block names: spaces vs. underscores In-Reply-To: References: Message-ID: Philippe Verdy wrote: > However it must be clear that these aliases are case-sensitive by > default ("Arabic_Presentation_Forms_A" is not the same as > "Arabic_presentation_forms_A" but is the same as "Arabic > Presentation_Forms A), unless the block names property is normatively > said to be case-insensitive (in that case the followings are also > aliases: "arabic_pf_a", "arabic pf a"). But adding case insensitivity > has a cost, which is much higher than *only* allowing basic > replacements of spaces and underscores [...] UAX #44 says: > 5.9.2 Matching Character Names > > UAX44-LM2. Ignore case, whitespace, underscore ('_'), and all medial > hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E. > > 5.9.3 Matching Symbolic Values > > UAX44-LM3. Ignore case, whitespace, underscore ('_'), hyphens, and any > initial prefix string "is". I read the words "ignore case" in these two rules to mean that case should be ignored. -- Doug Ewell | http://ewellic.org | Thornton, CO ????