From roozbeh at unicode.org Fri Mar 13 21:00:52 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Fri, 13 Mar 2015 19:00:52 -0700 Subject: Android 5.1 ships with support for several minority scripts Message-ID: Android 5.1 , released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From alolita.sharma at gmail.com Fri Mar 13 21:53:16 2015 From: alolita.sharma at gmail.com (Alolita Sharma) Date: Fri, 13 Mar 2015 19:53:16 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Roozbeh, This is great news! Thanks for your efforts in integrating Noto and Harfbuzz in Android and @unicode too :-) Is there a link to a blog post or release notes listing the improved language support? Best, Alolita On Fri, Mar 13, 2015 at 7:00 PM, Roozbeh Pournader wrote: > Android 5.1 > , > released earlier this week, has added support for 25 minority scripts. The > wide coverage can be reproduced by almost everybody for free, thanks to the > Noto and HarfBuzz > projects, both of > which are open source. (Android itself is open source too.) > > By my count, these are the new scripts added in Android 5.1: Balinese, > Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah > Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, > Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and > Tifinagh. > > (Android 5.0, released last year, had already added the Georgian lari, > complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new > scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, > Gurmukhi, Sinhala, and Yi.) > > Note that different Android vendors and carriers may choose to ship more > fonts or less, but Android One phones and > most Nexus devices will support all the > above scripts out of the box. > > None of this would have been possible without the efforts of Unicode > volunteers who worked hard to encode the scripts in Unicode. Thanks to the > efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the > world would can now read and write their language on smartphones and > tablets for the first time. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Mar 13 22:27:55 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 14 Mar 2015 14:27:55 +1100 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Hi Roozbeh, a point of clarification and a question: * the Cham font is actually an Eastern Cham font supporting Akhar Thrah the Eastern variety of the script. Akhar Srak . Western Cham script remains unsupported. Which languages was the Thai Tham font designed to support? And which variety of the script? Andrew On Saturday, 14 March 2015, Roozbeh Pournader wrote: > Android 5.1, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) > By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. > (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) > Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. > > None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Mar 14 01:59:53 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 14 Mar 2015 07:59:53 +0100 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Congrats! {phone} On Mar 14, 2015 03:09, "Roozbeh Pournader" wrote: > Android 5.1 > , > released earlier this week, has added support for 25 minority scripts. The > wide coverage can be reproduced by almost everybody for free, thanks to the > Noto and HarfBuzz > projects, both of > which are open source. (Android itself is open source too.) > > By my count, these are the new scripts added in Android 5.1: Balinese, > Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah > Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, > Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and > Tifinagh. > > (Android 5.0, released last year, had already added the Georgian lari, > complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new > scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, > Gurmukhi, Sinhala, and Yi.) > > Note that different Android vendors and carriers may choose to ship more > fonts or less, but Android One phones and > most Nexus devices will support all the > above scripts out of the box. > > None of this would have been possible without the efforts of Unicode > volunteers who worked hard to encode the scripts in Unicode. Thanks to the > efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the > world would can now read and write their language on smartphones and > tablets for the first time. > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke at dashjr.org Sat Mar 14 11:17:44 2015 From: luke at dashjr.org (Luke Dashjr) Date: Sat, 14 Mar 2015 16:17:44 +0000 Subject: CSUR Tonal Message-ID: <201503141617.47987.luke@dashjr.org> On Friday, August 06, 2010 9:02:39 AM Andrew West wrote: > Looking at the examples shown on > , it seems to me that > 0-8 are ordinary digits, and the symbols for 9 through 15 are inverted > or inverted+modified forms of the digits '7' through '1', so that > there is some sort of imperfect bilateral symmetry on the clock and > compass faces, with '0' and '8' as the axis of symmetry. Thus the '9' > is an inverted '6' (as 16-6=10) not an ordinary '9'. So except for the > odd glyph forms for 9, 11, 12 and 15 (would be be expected to be > simple inversions of '7', '5', '4' and '1') it makes sense as a system > to me. > > Anyhow, I do not think the ordinary digits '0' through '8' should be > encoded in the CSUR. The question was recently raised whether this reasoning is sound, considering that the digits '0' through '8' are not the same when being "rendered" by text-to-speech. A number '100' (written in tonal digits) ought not be spoken the same as ASCII '100' although it is visually identical when displayed as writing. Does Unicode give any relevance to non-visual rendering, or do TTS just need to settle for environmental hints (eg, the user explicitly telling it tonal numbers are in use)? For additional context, the pronunciation vs visual differentiation came up when trying to get the BTC "B with double lines" encoded in Ubuntu's fonts. The font designers argue that because the BTC version should be pronounced "bitcoins", it is not appropriate to use the existing encoding which might be pronounced differently[1]. Thank you for any insight on this matter, Luke 1. https://bugs.launchpad.net/ubuntu-font-family/+bug/1061115 P.S. Can someone disable (or soften) Spamhaus RBL for the Unicode mailing list (and/or private email servers for some of those CC'd)? Spamhaus has recently been abusing their RBL for putting political pressure on entire ISPs including the one my mail server is within the IP subnet of, despite that my mail server has never been involved in any spam. From roozbeh at unicode.org Sat Mar 14 12:24:24 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Sat, 14 Mar 2015 10:24:24 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Alolita, there is no blog post or release notes mentioning the scripts or languages yet. These are the links to the commit messages, which provide more information, listing languages supported by the scripts: https://android.googlesource.com/platform/external/noto-fonts/+/41fe586 https://android.googlesource.com/platform/external/noto-fonts/+/ea4709d https://android.googlesource.com/platform/external/lohit-fonts/+/de90084 Here's the list of the languages and language families newly supported: Balinese, Batak langauges, Berber languages, Buginese langauges, Buhid, Cham, Church Slavonic, Coptic, Divehi, Hanunoo, Javanese, Kayah languages, Kh?n, Lepcha, Limbu, Makassarese languages, Mandar languages, Manipuri/Meithei, Northern Thai, Oriya, Rejang, Santali, Saurashtra, Sundanese, Sylheti, Tagbanwa languages, Tai Dam, Tai D?n, Tai L?, Tai N?a, and Thai Song. On Fri, Mar 13, 2015 at 7:53 PM, Alolita Sharma wrote: > Roozbeh, > > This is great news! Thanks for your efforts in integrating Noto and > Harfbuzz in Android and @unicode too :-) > > Is there a link to a blog post or release notes listing the improved > language support? > > Best, > Alolita > > > > On Fri, Mar 13, 2015 at 7:00 PM, Roozbeh Pournader > wrote: > >> Android 5.1 >> , >> released earlier this week, has added support for 25 minority scripts. The >> wide coverage can be reproduced by almost everybody for free, thanks to the >> Noto and HarfBuzz >> projects, both of >> which are open source. (Android itself is open source too.) >> >> By my count, these are the new scripts added in Android 5.1: Balinese, >> Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah >> Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, >> Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and >> Tifinagh. >> >> (Android 5.0, released last year, had already added the Georgian lari, >> complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new >> scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, >> Gurmukhi, Sinhala, and Yi.) >> >> Note that different Android vendors and carriers may choose to ship more >> fonts or less, but Android One phones and >> most Nexus devices will support all the >> above scripts out of the box. >> >> None of this would have been possible without the efforts of Unicode >> volunteers who worked hard to encode the scripts in Unicode. Thanks to the >> efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the >> world would can now read and write their language on smartphones and >> tablets for the first time. >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Sat Mar 14 12:28:14 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Sat, 14 Mar 2015 10:28:14 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Andrew, I don't know the answer to your questions unfortunately. You can investigate the fonts yourself (they are available at https://code.google.com/p/noto/), or ask for support for Western Cham (assuming it's already properly encoded at Unicode) at the Noto issue tracker at https://code.google.com/p/noto/issues/entry. On Fri, Mar 13, 2015 at 8:27 PM, Andrew Cunningham wrote: > Hi Roozbeh, > > a point of clarification and a question: > > * the Cham font is actually an Eastern Cham font supporting Akhar Thrah > the Eastern variety of the script. > > Akhar Srak . Western Cham script remains unsupported. > > Which languages was the Thai Tham font designed to support? And which > variety of the script? > > Andrew > > > On Saturday, 14 March 2015, Roozbeh Pournader wrote: > > Android 5.1, released earlier this week, has added support for 25 > minority scripts. The wide coverage can be reproduced by almost everybody > for free, thanks to the Noto and HarfBuzz projects, both of which are open > source. (Android itself is open source too.) > > By my count, these are the new scripts added in Android 5.1: Balinese, > Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah > Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, > Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and > Tifinagh. > > (Android 5.0, released last year, had already added the Georgian lari, > complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new > scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, > Gurmukhi, Sinhala, and Yi.) > > Note that different Android vendors and carriers may choose to ship more > fonts or less, but Android One phones and most Nexus devices will support > all the above scripts out of the box. > > > > None of this would have been possible without the efforts of Unicode > volunteers who worked hard to encode the scripts in Unicode. Thanks to the > efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the > world would can now read and write their language on smartphones and > tablets for the first time. > > > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Sat Mar 14 16:18:26 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 15 Mar 2015 08:18:26 +1100 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: Comment on Cham was informational. What is in unicode charts was based on Eastern Cham only. Proposals to add Cham and Arabic characters to needed to support Western Cham are underdevelopment. Testing on Thai Tham will occur ... I was curious as to what the original design parameters forvthe font was. It is easier to evaluate a fonts language support knowing what was originally indended. For instance I do not assume that the myanmar font was designed to support all languages that use the myanmar script. I can also make assumptions about Latin script coverage and languages that are supported/unsupported. Andrew On Sunday, 15 March 2015, Roozbeh Pournader wrote: > Andrew, > I don't know the answer to your questions unfortunately. You can investigate the fonts yourself (they are available at https://code.google.com/p/noto/), or ask for support for Western Cham (assuming it's already properly encoded at Unicode) at the Noto issue tracker at https://code.google.com/p/noto/issues/entry. > On Fri, Mar 13, 2015 at 8:27 PM, Andrew Cunningham wrote: >> >> Hi Roozbeh, >> >> a point of clarification and a question: >> >> * the Cham font is actually an Eastern Cham font supporting Akhar Thrah the Eastern variety of the script. >> >> Akhar Srak . Western Cham script remains unsupported. >> >> Which languages was the Thai Tham font designed to support? And which variety of the script? >> >> Andrew >> >> On Saturday, 14 March 2015, Roozbeh Pournader wrote: >> > Android 5.1, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) >> > By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. >> > (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) >> > Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. >> > >> > None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. >> > >> >> -- >> Andrew Cunningham >> Project Manager, Research and Development >> (Social and Digital Inclusion) >> Public Libraries and Community Engagement >> State Library of Victoria >> 328 Swanston Street >> Melbourne VIC 3000 >> Australia >> >> Ph: +61-3-8664-7430 >> Mobile: 0459 806 589 >> Email: acunningham at slv.vic.gov.au >> lang.support at gmail.com >> >> http://www.openroad.net.au/ >> http://www.mylanguage.gov.au/ >> http://www.slv.vic.gov.au/ >> > > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sat Mar 14 16:27:56 2015 From: prosfilaes at gmail.com (David Starner) Date: Sat, 14 Mar 2015 14:27:56 -0700 Subject: CSUR Tonal In-Reply-To: <201503141617.47987.luke@dashjr.org> References: <201503141617.47987.luke@dashjr.org> Message-ID: On Sat, Mar 14, 2015 at 9:17 AM, Luke Dashjr wrote: > Does Unicode give any relevance to non-visual rendering, or do TTS > just need to settle for environmental hints (eg, the user explicitly telling > it tonal numbers are in use)? How do you tell a chemist from the general populace? Ask them to pronounce "unionized" (that is, is it "un-ion-ized" or "union-ized")? Is 700-9000 pronounced "seven hundred minus nine thousand" or "seven oh oh dash nine oh oh oh"? Which is part of the reason Unicode doesn't try to handle this problem; there is a Unicode solution, where the minus sign or the en-dash can be used, yet I and many others still use the hyphen-minus. Separating these things have been a mess. -- Kie ekzistas vivo, ekzistas espero. From luke at dashjr.org Sat Mar 14 16:47:26 2015 From: luke at dashjr.org (Luke Dashjr) Date: Sat, 14 Mar 2015 21:47:26 +0000 Subject: CSUR Tonal In-Reply-To: References: <201503141617.47987.luke@dashjr.org> Message-ID: <201503142147.30428.luke@dashjr.org> On Saturday, March 14, 2015 9:27:56 PM David Starner wrote: > On Sat, Mar 14, 2015 at 9:17 AM, Luke Dashjr wrote: > > Does Unicode give any relevance to non-visual rendering, or do TTS > > just need to settle for environmental hints (eg, the user explicitly > > telling it tonal numbers are in use)? > > How do you tell a chemist from the general populace? Ask them to > pronounce "unionized" (that is, is it "un-ion-ized" or "union-ized")? > Is 700-9000 pronounced "seven hundred minus nine thousand" or "seven > oh oh dash nine oh oh oh"? These are mere pronunciation/linguistic differences, though. The cases I am talking about are entirely different meanings/words. That is, 100 decimal is "one hundred" with a binary value of 110 0100. But the same "100" in tonal would be "san" with a binary value of 1 0000 0000. And in the other example, one is "B with double lines" vs "bitcoins". Luke From roozbeh at unicode.org Sat Mar 14 17:14:30 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Sat, 14 Mar 2015 15:14:30 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: On Sat, Mar 14, 2015 at 2:18 PM, Andrew Cunningham wrote: > Testing on Thai Tham will occur ... I was curious as to what the original > design parameters forvthe font was. It is easier to evaluate a fonts > language support knowing what was originally indended. > I unfortunately don't know that. > For instance I do not assume that the myanmar font was designed to support > all languages that use the myanmar script. > I was involved in development of the Myanmar font. We have tried to cover as many languages that we found requirements for, mostly from the information available in Martin Hosken's UTN 11. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Mar 14 17:15:39 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 14 Mar 2015 16:15:39 -0600 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: <5959475C5A634BE6A63167722FC0D1D1@DougEwell> Roozbeh Pournader wrote: > (Android 5.0, released last year, had already added the Georgian lari, The one that's going to be officially standardized with the release of Unicode 8.0 three months from now? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From roozbeh at unicode.org Sat Mar 14 17:56:13 2015 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Sat, 14 Mar 2015 15:56:13 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: <5959475C5A634BE6A63167722FC0D1D1@DougEwell> References: <5959475C5A634BE6A63167722FC0D1D1@DougEwell> Message-ID: Yes, the one and only ? (sent from my Android phone). On Mar 14, 2015 3:16 PM, "Doug Ewell" wrote: > Roozbeh Pournader wrote: > > (Android 5.0, released last year, had already added the Georgian lari, >> > > The one that's going to be officially standardized with the release of > Unicode 8.0 three months from now? > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Sat Mar 14 19:36:26 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 15 Mar 2015 00:36:26 +0000 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: <20150315003626.4e83c26f@JRWUBU2> On Sun, 15 Mar 2015 08:18:26 +1100 Andrew Cunningham wrote: > Testing on Thai Tham will occur ... I was curious as to what the > original design parameters forvthe font was. It is easier to evaluate > a fonts language support knowing what was originally indended. The target language (dubious concept) for Noto Sans Tai Tham appears to be the Unicode code chart language! The strongest evidence of any affiliation seems to be for Pali! I may just be jealous, but I find it an appalling font. I'm reporting on the 2013 version, Version 1.01, which I downloaded this week. The main problem the font has is that it has not solved the problem of vertical spacing. The problem is that subscript and superscript marks (including non-spacing sequences) are not naturally significantly smaller than base letters, and eye strain results from their being artificially shrunk. Another problem is that it seems to have been written without any clear model for Indic rearrangement, but merely a hope that it happens. These two problems combine for consonant stacks. To reduce the vertical space problem, base consonants are shrunk when they are followed by SAKOT + consonant. However, this does not occur if a vowel intervenes in the logical order, whether preposed (in which case it will not intervene in the final glyph order) or superscript (in which case it will intervene in the final glyph order. Of course, these combinations don't occur in Pali, so perhaps the font's target language is Pali! There's an egregious bug in the handling of the sequences MEDIAL RA plus preposed vowel (E, AE, OO, AI, THAM AI). They are formed into a ligature with MEDIAL RA on the left, whereas it should appear on the right! The font also exhibits a problem relating to , and . There are two subscript forms relating to BA and PA. The common form is a spacing subscript form, generally used for final consonants and as the second element of the Pali clusters -pp- and -mp-. (It's also often used for some other clusters.) Note that the normal way of writing the phonetically final consonant as a base consonant generally uses BA, though PA may occasionally occur as a result of influence from Standard Thai. The rarer, non-spacing form represents /b/ at the start of a phonetic cluster. The final encoding proposal for the Tai Tham script (then designated Lanna) assigned to the non-spacing form and to the spacing form - this corresponded to their uses as phonetically initial consonants. There was thus an unwelcome difference between the two default ways of writing a final consonant - BA or , depending on what preceded. (There was already another pair like this - RA or for final /n/.) However, during the ISO standardisation process, U+1A5D TAI THAM CONSONANT SIGN BA was introduced to represent the non-spacing form. This made the original distinction of and redundant, and most fonts and writers seem to use for the spacing subscript form. and generally yield the same glyph. However, in the Noto Sans Tai Tham font, is rendered as a non-spacing glyph, similar to that for . The font only use mark2mark positioning for tone marks. As a result, a subscript vowel will overstrike a subscript consonant resulting in an illegible blob, e.g. in the word ???? /mu?/ 'pig'. There are other issues with the font. I was wondering how to organise a bug report or set of bug reports on it when this thread arose. Richard. From richard.wordingham at ntlworld.com Sat Mar 14 20:17:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sun, 15 Mar 2015 01:17:56 +0000 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: <20150315011756.63263a13@JRWUBU2> On Sat, 14 Mar 2015 10:24:24 -0700 Roozbeh Pournader wrote: > These are the links to the commit messages, which provide more > information, listing languages supported by the scripts: > https://android.googlesource.com/platform/external/noto-fonts/+/41fe586 > https://android.googlesource.com/platform/external/noto-fonts/+/ea4709d > https://android.googlesource.com/platform/external/lohit-fonts/+/de90084 > > Here's the list of the languages and language families newly > supported: Balinese, Batak langauges, Berber languages, Buginese > langauges, Buhid, Cham, Church Slavonic, Coptic, Divehi, Hanunoo, > Javanese, Kayah languages, Kh?n, Lepcha, Limbu, Makassarese > languages, Mandar languages, Manipuri/Meithei, Northern Thai, Oriya, > Rejang, Santali, Saurashtra, Sundanese, Sylheti, Tagbanwa languages, > Tai Dam, Tai D?n, Tai L?, Tai N?a, and Thai Song. Just to nit-pick - Saurashtra isn't normally written in the Saurashtra script, Northern Thai is normally written in the Thai script, not the Tai Tham script, and the main script for Tai L? may well be New Tai Lue. Or am I partly wrong - have you now added line-breaking dictionaries for Northern Thai? I hope you haven't broken New Tai Lue support (probably not needed) by adding a shaping engine for the current standard (Unicode up to 7.0)! Richard. From verdy_p at wanadoo.fr Sat Mar 14 23:32:12 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 15 Mar 2015 05:32:12 +0100 Subject: CSUR Tonal In-Reply-To: <201503141617.47987.luke@dashjr.org> References: <201503141617.47987.luke@dashjr.org> Message-ID: 2015-03-14 17:17 GMT+01:00 Luke Dashjr : > P.S. Can someone disable (or soften) Spamhaus RBL for the Unicode mailing > list > (and/or private email servers for some of those CC'd)? Spamhaus has > recently > been abusing their RBL for putting political pressure on entire ISPs > including > the one my mail server is within the IP subnet of, despite that my mail > server > has never been involved in any spam. > All RBL lists should not be used blindly. They are only defaults and a service (Unicode list servers) using such RBL should also maintain its own whitelis. And yes this is a problem ofSamphaus too if it cannot whitelist your own subnet within its blacklisted IP range. But is your subnet really declared with a stable IP range within a *secured* whois or DNS database? Contact your ISP to get a stable IP range or have it declared instead of being within the same shared block (may be they will want you to subscribe and pay a fixed IP setting for your server,allocated by your ISP outside its generic shared block,or publicly marked as excluded from it with a specific subdelegation). -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Sun Mar 15 04:32:36 2015 From: prosfilaes at gmail.com (David Starner) Date: Sun, 15 Mar 2015 02:32:36 -0700 Subject: CSUR Tonal In-Reply-To: <201503142147.30428.luke@dashjr.org> References: <201503141617.47987.luke@dashjr.org> <201503142147.30428.luke@dashjr.org> Message-ID: On Sat, Mar 14, 2015 at 2:47 PM, Luke Dashjr wrote: > On Saturday, March 14, 2015 9:27:56 PM David Starner wrote: >> On Sat, Mar 14, 2015 at 9:17 AM, Luke Dashjr wrote: >> > Does Unicode give any relevance to non-visual rendering, or do TTS >> > just need to settle for environmental hints (eg, the user explicitly >> > telling it tonal numbers are in use)? >> >> How do you tell a chemist from the general populace? Ask them to >> pronounce "unionized" (that is, is it "un-ion-ized" or "union-ized")? >> Is 700-9000 pronounced "seven hundred minus nine thousand" or "seven >> oh oh dash nine oh oh oh"? > > These are mere pronunciation/linguistic differences, though. The cases I am > talking about are entirely different meanings/words. Un-ion-ized and union-ized are two completely different words. Different meanings, different pronunciations, different etymologies. > That is, 100 decimal is "one hundred" with a binary value of 110 0100. > But the same "100" in tonal would be "san" with a binary value of 1 0000 0000. And -6300 is the same thing as a phone number? And as I said, there are a number of characters that look similar in Unicode, and the one found on a keyboard is commonly used for the less common one. Adding new characters doesn't magically add support for them, and even very popular characters suffer from the fact that a keyboard supports about 100 characters easily, with typists learning a handful of necessary multikey characters. There is no reason to assume that anyone would use your new digits even if they were encoded. > And in the other example, one is "B with double lines" vs "bitcoins". And ! is "exclamation point" and "not" and "factorial". Currency symbols are not usually unified with other symbols, and combining overlays are usually discouraged, so LATIN CAPITAL LETTER B + COMBINING DOUBLE VERTICAL STROKE OVERLAY will probably not block the encoding of a bitcoin symbol. Being unifiable with ?, on the other hand, will. A lack of actual use will as well. -- Kie ekzistas vivo, ekzistas espero. From luke at dashjr.org Sun Mar 15 07:34:07 2015 From: luke at dashjr.org (Luke Dashjr) Date: Sun, 15 Mar 2015 12:34:07 +0000 Subject: CSUR Tonal In-Reply-To: References: <201503141617.47987.luke@dashjr.org> Message-ID: <201503151234.11585.luke@dashjr.org> On Sunday, March 15, 2015 4:32:12 AM Philippe Verdy wrote: > But is your subnet really declared with a stable IP range within a > *secured* whois or DNS database? Contact your ISP to get a stable IP range > or have it declared instead of being within the same shared block (may be > they will want you to subscribe and pay a fixed IP setting for your > server,allocated by your ISP outside its generic shared block,or publicly > marked as excluded from it with a specific subdelegation). I'm not familiar with this concept. It would also be unlikely to help, since Spamhaus is *intentionally* including innocent IPs in the blacklisting because they want the ISP to feel pressured into paying them for services. Luke From mark at kli.org Sun Mar 15 16:42:55 2015 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 15 Mar 2015 17:42:55 -0400 Subject: Revenge of HEBREW TETRAGRAMMATON Message-ID: <5505FCDF.4070301@kli.org> An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Mar 15 17:50:13 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 15 Mar 2015 16:50:13 -0600 Subject: CSUR Tonal In-Reply-To: References: Message-ID: <1AAE6043A8854DCCB51B556F98F4A61D@DougEwell> Luke Dashjr wrote: > That is, 100 decimal is "one hundred" with a binary value of 110 0100. > But the same "100" in tonal would be "san" with a binary value of > 1 0000 0000. "100" with the meaning of "one hundred" is spoken as "ciento" in Spanish, "ekat?n" in Greek, "sto" in Russian, etc. So pronunciation by itself doesn't necessarily justify separate encoding. Within English-speaking contexts, "100" can also be a binary number, or an octal number with a binary value of 100 0000. In my world as a developer, it's often a hex number, as in tonal. In most of these cases it's typically pronounced "one zero zero" or "one oh oh." So the numeric value of a string of digits within a positional system also doesn't necessarily justify separate encoding. TTS systems always have to rely on environmental hints. Anyone who has worked on them will agree. > And in the other example, one is "B with double lines" vs "bitcoins". As David pointed out, currency symbols really aren?t an analogy to anything else. They are never built from combining characters, and are never decomposable to them. This has nothing really to do with TTS or pronunciation. One person in the Ubuntu thread mentioned that, but that is not the primary reason. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From luke at dashjr.org Sun Mar 15 18:23:18 2015 From: luke at dashjr.org (Luke Dashjr) Date: Sun, 15 Mar 2015 23:23:18 +0000 Subject: CSUR Tonal In-Reply-To: <1AAE6043A8854DCCB51B556F98F4A61D@DougEwell> References: <1AAE6043A8854DCCB51B556F98F4A61D@DougEwell> Message-ID: <201503152323.24059.luke@dashjr.org> On Sunday, March 15, 2015 10:50:13 PM Doug Ewell wrote: > > And in the other example, one is "B with double lines" vs "bitcoins". > > As David pointed out, currency symbols really aren?t an analogy to > anything else. They are never built from combining characters, and are > never decomposable to them. This has nothing really to do with TTS or > pronunciation. One person in the Ubuntu thread mentioned that, but that > is not the primary reason. Why is that? From lisam at us.ibm.com Sun Mar 15 21:00:15 2015 From: lisam at us.ibm.com (Lisa Moore) Date: Sun, 15 Mar 2015 19:00:15 -0700 Subject: Android 5.1 ships with support for several minority scripts In-Reply-To: References: Message-ID: +1...this is great! Lisa From: Mark Davis ?? To: Roozbeh Pournader Cc: Norbert Lindenberg , Behdad Esfahbod , "unicode at unicode.org" Date: 03/14/2015 12:17 AM Subject: Re: Android 5.1 ships with support for several minority scripts Sent by: "Unicode" Congrats! {phone} On Mar 14, 2015 03:09, "Roozbeh Pournader" wrote: Android 5.1, released earlier this week, has added support for 25 minority scripts. The wide coverage can be reproduced by almost everybody for free, thanks to the Noto and HarfBuzz projects, both of which are open source. (Android itself is open source too.) By my count, these are the new scripts added in Android 5.1: Balinese, Batak, Buginese, Buhid, Cham, Coptic, Glagolitic, Hanunnoo, Javanese, Kayah Li, Lepcha, Limbu, Meetei Mayek, Ol Chiki, Oriya, Rejang, Saurashtra, Sundanese, Syloti Nagri, Tagbanwa, Tai Le, Tai Tham, Tai Viet, Thaana, and Tifinagh. (Android 5.0, released last year, had already added the Georgian lari, complete Unicode 7.0 coverage for Latin, Greek, and Cyrillic, and seven new scripts: Braille, Canadian Aboriginal Syllabics, Cherokee, Gujarati, Gurmukhi, Sinhala, and Yi.) Note that different Android vendors and carriers may choose to ship more fonts or less, but Android One phones and most Nexus devices will support all the above scripts out of the box. None of this would have been possible without the efforts of Unicode volunteers who worked hard to encode the scripts in Unicode. Thanks to the efforts of Unicode, Noto, and HarfBuzz, thousands of communities around the world would can now read and write their language on smartphones and tablets for the first time. _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Mar 16 05:15:59 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 16 Mar 2015 10:15:59 +0000 (GMT) Subject: Fun open-ended coding puzzles Message-ID: <7436024.16384.1426500959149.JavaMail.defaultUser@defaultHost> Fun open-ended coding puzzles The following post includes a link to a multi-colour "multi-do-not" glyph in Venice, the image from Venice being in Google Street View. http://forum.high-logic.com/viewtopic.php?p=20763#p20763 Using Unicode 8.0, is it possible to encode a plain text representation of that glyph, using a sequence of characters, some displayable and maybe including some joiners and possibly including one or more uses of the emoji variation selector? If so, what is the most appropriate sequence of characters? If not, what necessary character or characters is or are missing? If one searches for Clos Luc? in http://maps.google.com then one can use Google Street View to look around in the grounds of the ch?teau where Leonardo da Vinci spent his last few years, the name of the ch?teau having changed since the time of Leonardo da Vinci. There are various exhibits including a large banner amongst the trees showing an enlargement of part of the Mona Lisa. What is the closest plain text Unicode 8.0 sequence to represent the exhibit, including at least two trees? Readers are invited to post comments and additional fun open-ended coding puzzles. William Overington 16 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 16 11:39:37 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 16 Mar 2015 09:39:37 -0700 Subject: CSUR Tonal Message-ID: <20150316093937.665a7a7059d7ee80bb4d670165c8327d.20150cc036.wbe@email03.secureserver.net> Luke Dashjr wrote: >> As David pointed out, currency symbols really aren?t an analogy to >> anything else. They are never built from combining characters, and >> are never decomposable to them. This has nothing really to do with >> TTS or pronunciation. One person in the Ubuntu thread mentioned that, >> but that is not the primary reason. > > Why is that? Why are currency symbols not decomposable to combining characters, or equivalently, composable from them? Well, I (rather famously) don't speak for UTC or WG2, and for all I know, the official thinking on this has changed. But the impression I got from 20 years following the Unicode Standard was that a currency symbol such as $ is fundamentally different from a capital S with a combining vertical line, even if that was the original derivation of the symbol a few centuries ago. Likewise, even if the euro sign ? was designed as a stylized E with an equals-sign overlay symbolizing equality, or something, that is no longer its essential nature as a character. By contrast, an A with an acute accent is, and will always be, an A with an acute accent, regardless of whether it is encoded as a precomposed character or as a combining sequence, or whether it is perceived as a unitary letter in any language's orthography. In particular with regard to the bitcoin hack^H^H^H^Hworkaround, U+20E6 COMBINING DOUBLE VERTICAL STROKE OVERLAY appears to have been encoded for a specific math purpose, and not meant to be applied to just any arbitrary base character on the basis of its appearance. (There was also a corollary principle that characters should be used for what they are meant to be, not just because the glyph "looks right.") As I said, YMMV, and you are way better off checking with a real UTC or WG2 member or even writing a proposal. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From wjgo_10009 at btinternet.com Tue Mar 17 12:45:49 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 17 Mar 2015 17:45:49 +0000 (GMT) Subject: Encoding of HEBREW TETRAGRAMMATON In-Reply-To: <5505FCDF.4070301@kli.org> References: <5505FCDF.4070301@kli.org> Message-ID: <28937181.67158.1426614349858.JavaMail.defaultUser@defaultHost> > Whatever the fate or merits of the initial proposal we made back in 1998, we need to work out how these documents are to be encoded, I think. What do you think? I opine that these items need to be encoded and encoded such that they can be accessed and applied using basic systems rather than needing rare and expensive specialist software. Suppose that plane 4 were introduced with a rule that if a rendering system cannot find in a font a glyph for U+4PQRS then it first looks in the font for a glyph for U+4P000 before using the .notdef glyph. P, Q, R, S are here used as placeholders, each placeholding for a hexadecimal value in the range 0 .. 15. In the example, Q, R and S are not all simultaneously zero. This would allow for a graceful use of a generic glyph indicating the meaning and indicating that there is an underlying encoding that could be rendered using a font that supplies a correct glyph. That would allow for sixteen independent blocks each consisting of one generic glyph and up to 4095 specific glyphs. One of those sixteen blocks could be used for this encoding project. The other blocks would be available for other encoding projects. Maybe smaller blocks and more of them with this particular encoding project having several such blocks ringfenced for it. William Overington 17 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Tue Mar 17 20:28:55 2015 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 17 Mar 2015 21:28:55 -0400 Subject: Encoding of HEBREW TETRAGRAMMATON In-Reply-To: <28937181.67158.1426614349858.JavaMail.defaultUser@defaultHost> References: <5505FCDF.4070301@kli.org> <28937181.67158.1426614349858.JavaMail.defaultUser@defaultHost> Message-ID: <5508D4D7.8090905@kli.org> I think you're conflating glyphs with characters. Multiple glyphs for a single character, selectable by the font or the software is fine; we don't necessarily need codepoints for things that are to be considered the same character. (This begs the question of whether or not these count as "the same character" as proposed originally, or something else.) ~mark On 03/17/2015 01:45 PM, William_J_G Overington wrote: > > Suppose that plane 4 were introduced with a rule that if a rendering > system cannot find in a font a glyph for U+4PQRS then it first looks > in the font for a glyph for U+4P000 before using the .notdef glyph. P, > Q, R, S are here used as placeholders, each placeholding for a > hexadecimal value in the range 0 .. 15. In the example, Q, R and S are > not all simultaneously zero. > > This would allow for a graceful use of a generic glyph indicating the > meaning and indicating that there is an underlying encoding that could > be rendered using a font that supplies a correct glyph. > > That would allow for sixteen independent blocks each consisting of one > generic glyph and up to 4095 specific glyphs. One of those sixteen > blocks could be used for this encoding project. The other blocks would > be available for other encoding projects. > > Maybe smaller blocks and more of them with this particular encoding > project having several such blocks ringfenced for it. > > William Overington > > 17 March 2015 > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 18 12:14:35 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 18 Mar 2015 10:14:35 -0700 Subject: Encoding of HEBREW TETRAGRAMMATON Message-ID: <20150318101435.665a7a7059d7ee80bb4d670165c8327d.4ebb5b4075.wbe@email03.secureserver.net> William_J_G Overington wrote: > Suppose that plane 4 were introduced with a rule that if a rendering > system cannot find in a font a glyph for U+4PQRS then it first looks > in the font for a glyph for U+4P000 before using the .notdef glyph. P, > Q, R, S are here used as placeholders, each placeholding for a > hexadecimal value in the range 0 .. 15. In the example, Q, R and S are > not all simultaneously zero. > > This would allow for a graceful use of a generic glyph indicating the > meaning and indicating that there is an underlying encoding that could > be rendered using a font that supplies a correct glyph. http://www.unicode.org/policies/lastresortfont_eula.html -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From wjgo_10009 at btinternet.com Wed Mar 18 14:06:15 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 18 Mar 2015 19:06:15 +0000 (GMT) Subject: Encoding of HEBREW TETRAGRAMMATON In-Reply-To: <5508D4D7.8090905@kli.org> References: <5505FCDF.4070301@kli.org> <28937181.67158.1426614349858.JavaMail.defaultUser@defaultHost> <5508D4D7.8090905@kli.org> Message-ID: <22354037.65561.1426705575630.JavaMail.defaultUser@defaultHost> > Multiple glyphs for a single character, selectable by the font or the software is fine; we don't necessarily need codepoints for things that are to be considered the same character. Well, the encoding needs to be of practical usefulness. The only other way that I can at present think might be used is variation selectors and that might be quite complicated and limited to 255 variants at present and need specialist software to use. For completeness I am trying to find out how that could be done. http://forum.high-logic.com/viewtopic.php?f=37&t=5487 > (This begs the question of whether or not these count as "the same character" as proposed originally, or something else.) Maybe the solution to that question is to regard it as a paradox and just submit a petition to the Unicode Technical Committee requesting encoding of the items each with an individual code point. William Overington 18 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From software at ogee-group.com Sat Mar 21 09:33:37 2015 From: software at ogee-group.com (software at ogee-group.com) Date: Sat, 21 Mar 2015 14:33:37 +0000 Subject: Looking for Unicode Plain Text Math parser/converter Message-ID: <20150321143337.Horde.dP0g2UtHhb-fM7AB-KSGsw3@just132.justhost.com> Hello, I am new to this list. I am looking for a parser/converter in C++/C# to interpret a Unicode Plain Text Math expression (see UTN 28 or UTR 25) and convert it to MathML. Example expression: \sum_(i=1)^n \naryand x_i Any help would be greatly appreciated Thank you, s. From wjgo_10009 at btinternet.com Mon Mar 23 10:35:15 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 23 Mar 2015 15:35:15 +0000 (GMT) Subject: Origin of the digital encoding of accented characters for Esperanto Message-ID: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> Origin of the digital encoding of accented characters for Esperanto Twelve accented characters (uppercase versions and lowercase versions of six accented letters) used for Esperanto are encoded in Unicode. These may well be in Unicode as legacy encoded characters from one or more earlier standards. Does anyone know please how Esperanto characters first became encoded digitally? For example, was it that someone who was interested in Esperanto happened to be a member of a committee that was working on encoding accented characters? Or did one or more people, or a group of people, or an Esperanto society, lobby for the characters to become included? Or what? It does not seem axiomatic that accented characters for Esperanto would necessarily be included in a digital encoding of the accented characters needed for the languages of Europe. William Overington 23 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Mar 23 11:58:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 23 Mar 2015 09:58:45 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> Message-ID: <55104645.3090609@att.net> On 3/23/2015 8:35 AM, William_J_G Overington wrote: > Origin of the digital encoding of accented characters for Esperanto > > Twelve accented characters (uppercase versions and lowercase versions > of six accented letters) used for Esperanto are encoded in Unicode. WJO is referring to U+0109, U+011D, U+0125, U+0135, U+015D, U+016D (and their uppercase pairs). > > These may well be in Unicode as legacy encoded characters from one or > more earlier standards. No. > > Does anyone know please how Esperanto characters first became encoded > digitally? In the Unicode Standard, the fact that these all occur in the Latin Extended-A block is a clue. The Latin Extended-A block dates back to Unicode 1.0. You can easily verify that by referring to the archival record. See: http://www.unicode.org/versions/Unicode1.0.0/ And in fact, the exact set in the Latin Extended-A block can be traced even further back than the publication of Unicode 1.0 in 1991. That same repertoire was included in the charts distributed for public review in the Unicode 1.0 final review draft in December, 1990. So we know that the inclusion of the 12 accented characters for Esperanto in that set dates back at least that far -- which should eliminate a lot of fruitless alternative speculative theories about their origins in Unicode. > > For example, was it that someone who was interested in Esperanto > happened to be a member of a committee that was working on encoding > accented characters? Well, sort of. See further explanation below. > > Or did one or more people, or a group of people, or an Esperanto > society, lobby for the characters to become included? No. > > Or what? Well, the answer is sort of "or what". The repertoire of accented characters included in the Latin Extended-A block for the final review draft of Unicode 1.0 in December, 1990 was largely culled from the even earlier list of Latin letters proposed for encoding in the 2nd DP (Draft Proposal) for ISO/IEC 10646-1. Their inclusion in the Unicode Standard 1.0 repertoire was one of the early compatibility decisions, to ensure that repertoire that national bodies had thought important enough to be included in the early 10646 balloting was accounted for in some way in the first Unicode Standard draft. The list of accented Latin letters in the Latin Extended-A block consisted of the union of all of the then-extant ISO 8859 8-bit standard repertoire for various Latin alphabets, *plus* the additional letters culled from the 2nd DP 10646-1. For the record, the 2nd DP 10646 was JTC1/SC2 N2066 (=WG2 N551), dated December 1, 1989. In that era, documents were only distributed by paper, and I don't know of an extant online copy, so it is rather difficult to track down! In any event, in that document from 1989, I consider it likely that the person who probably originally assembled the lists of various European language alphabets and included them in the drafts for balloting was Hugh McGregor Ross, the then British editor of 10646 and a person with a passion for details about lesser-used writing systems. Mr. Ross is, unfortunately, recently deceased, so we cannot ask him directly. But I suspect that examination of the early drafts of 10646 and papers related to it would confirm this speculation on my part. --Ken > > It does not seem axiomatic that accented characters for Esperanto > would necessarily be included in a digital encoding of the accented > characters needed for the languages of Europe. > > William Overington From leob at mailcom.com Mon Mar 23 12:10:46 2015 From: leob at mailcom.com (Leo Broukhis) Date: Mon, 23 Mar 2015 10:10:46 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <55104645.3090609@att.net> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> <55104645.3090609@att.net> Message-ID: Ken, zgrep U011D /usr/share/i18n/charmaps/* ANSI_X3.110-1983.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX EUC-JISX0213.gz: /xaa/xe0 LATIN SMALL LETTER G WITH CIRCUMFLEX EUC-JP.gz: /x8f/xab/xba LATIN SMALL LETTER G WITH CIRCUMFLEX EUC-JP-MS.gz: /x8f/xab/xba LATIN SMALL LETTER G WITH CIRCUMFLEX EUC-TW.gz: /x8e/xa7/xac/xbc GB18030.gz: /x81/x30/x8e/x34 LATIN SMALL LETTER G WITH CIRCUMFLEX IBM905.gz: /x9b LATIN SMALL LETTER G WITH CIRCUMFLEX ISO_6937-2-ADD.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX ISO_6937.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX ISO-8859-3.gz: /xf8 LATIN SMALL LETTER G WITH CIRCUMFLEX ISO_8859-SUPP.gz: /xb8 LATIN SMALL LETTER G WITH CIRCUMFLEX ISO-IR-90.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX SHIFT_JISX0213.gz: /x85/xde LATIN SMALL LETTER G WITH CIRCUMFLEX T.101-G2.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX T.61-8BIT.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX UTF-8.gz: /xc4/x9d LATIN SMALL LETTER G WITH CIRCUMFLEX VIDEOTEX-SUPPL.gz: /xc3/x67 LATIN SMALL LETTER G WITH CIRCUMFLEX How come this character is in ISO-8859-3? IBM905? Leo On Mon, Mar 23, 2015 at 9:58 AM, Ken Whistler wrote: > > > On 3/23/2015 8:35 AM, William_J_G Overington wrote: > >> Origin of the digital encoding of accented characters for Esperanto >> >> Twelve accented characters (uppercase versions and lowercase versions of >> six accented letters) used for Esperanto are encoded in Unicode. >> > > WJO is referring to U+0109, U+011D, U+0125, U+0135, U+015D, U+016D (and > their uppercase pairs). > > >> These may well be in Unicode as legacy encoded characters from one or >> more earlier standards. >> > > No. > > >> Does anyone know please how Esperanto characters first became encoded >> digitally? >> > > In the Unicode Standard, the fact that these all occur in the Latin > Extended-A block is > a clue. The Latin Extended-A block dates back to Unicode 1.0. You can > easily verify > that by referring to the archival record. See: > > http://www.unicode.org/versions/Unicode1.0.0/ > > And in fact, the exact set in the Latin Extended-A block can be traced > even further > back than the publication of Unicode 1.0 in 1991. That same repertoire was > included > in the charts distributed for public review in the Unicode 1.0 final > review draft > in December, 1990. So we know that the inclusion of the 12 accented > characters > for Esperanto in that set dates back at least that far -- which should > eliminate a > lot of fruitless alternative speculative theories about their origins in > Unicode. > > >> For example, was it that someone who was interested in Esperanto happened >> to be a member of a committee that was working on encoding accented >> characters? >> > > Well, sort of. See further explanation below. > > >> Or did one or more people, or a group of people, or an Esperanto society, >> lobby for the characters to become included? >> > > No. > > >> Or what? >> > > Well, the answer is sort of "or what". The repertoire of accented > characters included in the > Latin Extended-A block for the final review draft of Unicode 1.0 in > December, 1990 > was largely culled from the even earlier list of Latin letters proposed for > encoding in the 2nd DP (Draft Proposal) for ISO/IEC 10646-1. Their > inclusion in > the Unicode Standard 1.0 repertoire was one of the early compatibility > decisions, > to ensure that repertoire that national bodies had thought important > enough to > be included in the early 10646 balloting was accounted for in some way in > the first Unicode Standard draft. > > The list of accented Latin letters in the Latin Extended-A block consisted > of the > union of all of the then-extant ISO 8859 8-bit standard repertoire for > various > Latin alphabets, *plus* the additional letters culled from the 2nd DP > 10646-1. > > For the record, the 2nd DP 10646 was JTC1/SC2 N2066 (=WG2 N551), dated > December 1, 1989. In that era, documents were only distributed by paper, > and I don't know of an extant online copy, so it is rather difficult to > track down! > > > In any event, in that document from 1989, I consider it likely that the > person > who probably originally assembled the lists of various European language > alphabets and > included them in the drafts for balloting was Hugh McGregor Ross, the > then British editor of 10646 and a person with a passion for details about > lesser-used writing systems. Mr. Ross is, unfortunately, recently deceased, > so we cannot ask him directly. But I suspect that examination of the > early drafts of 10646 and papers related to it would confirm this > speculation > on my part. > > > --Ken > > >> It does not seem axiomatic that accented characters for Esperanto would >> necessarily be included in a digital encoding of the accented characters >> needed for the languages of Europe. >> >> William Overington >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Mon Mar 23 12:26:19 2015 From: tom at bluesky.org (Tom Gewecke) Date: Mon, 23 Mar 2015 10:26:19 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> Message-ID: <68A55FC1-0980-4F6E-9F50-CA6E627211A0@bluesky.org> On 23 mars 2015, at 08:35, William_J_G Overington wrote: > Origin of the digital encoding of accented characters for Esperanto > > These may well be in Unicode as legacy encoded characters from one or more earlier standards. ISO 6937 of 1983 seems to have been designed to support them. http://en.wikipedia.org/wiki/ISO/IEC_6937 From kenwhistler at att.net Mon Mar 23 12:44:10 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 23 Mar 2015 10:44:10 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> <55104645.3090609@att.net> Message-ID: <551050EA.1010904@att.net> For ISO 8859-3, the answer is in the wiki: http://en.wikipedia.org/wiki/ISO/IEC_8859-3 "It was designed to cover Turkish, Maltese and Esperanto, ..." The answer for IBM CP905 is simple -- it is simply the EBCDIC code page of June, 1986 that corresponded to ISO 8859-3. That also covers the answer for ISO-IR 109, which is simply the registration of the right-hand part of Latin-3. At any rate, since I didn't check first whether the Esperanto letters were in ISO 8859-3 before I wrote my initial response, this would certainly remove all proximate speculation about the occurrence of the accented letters for Esperanto in the Unicode 1.0 repertoire in Latin Extended-A. They were included by the exercise of doing the union of all the 8859 Latin alphabets. So the answer for Unicode is, instead, *yes*, they were in a pre-existing standard that was grandfathered in to the initial collection of accented Latin letters. And the question, instead, then becomes tracking down through the ancient history of JTC1/SC2/WG3 (<-- Note *3*, not *2*), why the participants who drafted 8859-3 felt it was important to include the Esperanto letters in the repertoire for the South European set back in 1986. That date, by the way, is earlier than anything I have firsthand records for. --Ken On 3/23/2015 10:10 AM, Leo Broukhis wrote: > > How come this character is in ISO-8859-3? IBM905? > > Leo > > From leob at mailcom.com Mon Mar 23 13:13:02 2015 From: leob at mailcom.com (Leo Broukhis) Date: Mon, 23 Mar 2015 11:13:02 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <551050EA.1010904@att.net> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> <55104645.3090609@att.net> <551050EA.1010904@att.net> Message-ID: > So the answer for Unicode is, instead, *yes*, they were in > a pre-existing standard that was grandfathered in to the > initial collection of accented Latin letters. That's what I was hinting at. :) Leo On Mon, Mar 23, 2015 at 10:44 AM, Ken Whistler wrote: > For ISO 8859-3, the answer is in the wiki: > > http://en.wikipedia.org/wiki/ISO/IEC_8859-3 > > "It was designed to cover Turkish, Maltese and Esperanto, ..." > > The answer for IBM CP905 is simple -- it is simply the EBCDIC > code page of June, 1986 that corresponded to ISO 8859-3. > That also covers the answer for ISO-IR 109, which is simply > the registration of the right-hand part of Latin-3. > > At any rate, since I didn't check first whether the Esperanto > letters were in ISO 8859-3 before I wrote my initial response, > this would certainly remove all proximate speculation about > the occurrence of the accented letters for Esperanto in > the Unicode 1.0 repertoire in Latin Extended-A. They were > included by the exercise of doing the union of all the > 8859 Latin alphabets. > > So the answer for Unicode is, instead, *yes*, they were in > a pre-existing standard that was grandfathered in to the > initial collection of accented Latin letters. > > And the question, instead, then becomes tracking down through > the ancient history of JTC1/SC2/WG3 (<-- Note *3*, not *2*), > why the participants who drafted 8859-3 felt it was important > to include the Esperanto letters in the repertoire for the South > European set back in 1986. That date, by the way, is earlier than > anything I have firsthand records for. > > --Ken > > > > On 3/23/2015 10:10 AM, Leo Broukhis wrote: > >> >> How come this character is in ISO-8859-3? IBM905? >> >> Leo >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Mar 23 13:22:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 23 Mar 2015 11:22:38 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto Message-ID: <20150323112238.665a7a7059d7ee80bb4d670165c8327d.3fdaa81729.wbe@email03.secureserver.net> Ken wrote: > The list of accented Latin letters in the Latin Extended-A block > consisted of the union of all of the then-extant ISO 8859 8-bit > standard repertoire for various Latin alphabets, *plus* the additional > letters culled from the 2nd DP 10646-1. The Esperanto letters can be found in ECMA-94, 2nd Edition (June 1986), pp. 17-21 (pp. 33-37 in the PDF), which is equivalent to ISO 8859-3. http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Mon Mar 23 14:54:07 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Mon, 23 Mar 2015 19:54:07 +0000 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <551050EA.1010904@att.net> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> <55104645.3090609@att.net> <551050EA.1010904@att.net> Message-ID: <20150323195407.0f7ebdda@JRWUBU2> On Mon, 23 Mar 2015 10:44:10 -0700 Ken Whistler wrote: > And the question, instead, then becomes tracking down through > the ancient history of JTC1/SC2/WG3 (<-- Note *3*, not *2*), > why the participants who drafted 8859-3 felt it was important > to include the Esperanto letters in the repertoire for the South > European set back in 1986. That date, by the way, is earlier than > anything I have firsthand records for. Perhaps its more an odds and sods collection. Esperanto was once significant, so one should not be surprised that it should be supported. Richard. From prosfilaes at gmail.com Mon Mar 23 21:01:32 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 23 Mar 2015 19:01:32 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> Message-ID: On Mon, Mar 23, 2015 at 8:35 AM, William_J_G Overington wrote: > It does not seem axiomatic that accented characters for Esperanto would > necessarily be included in a digital encoding of the accented characters > needed for the languages of Europe. Where does languages of Europe come from? Latin Extended-A is not designed to exclusively cover Europe, and both ISO 8859-3 and Extended-A cover Turkish. The largest Esperanto libraries have about 25,000 books, and there's a large collection of people wanting to use Esperanto on the Internet; moreover, the encoding decision is trivial, being a simple and uncontested set of twelve codepoints. Of all the Latin script characters not encoded in Unicode 1.1, I doubt any of them have 1% the use of the Esperanto characters. Not encoding them upfront would have been silly. -- Kie ekzistas vivo, ekzistas espero. From asmusf at ix.netcom.com Mon Mar 23 23:56:09 2015 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Mon, 23 Mar 2015 21:56:09 -0700 Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: <551050EA.1010904@att.net> References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> <55104645.3090609@att.net> <551050EA.1010904@att.net> Message-ID: <5510EE69.20003@ix.netcom.com> On 3/23/2015 10:44 AM, Ken Whistler wrote: > And the question, instead, then becomes tracking down through > the ancient history of JTC1/SC2/WG3 (<-- Note *3*, not *2*), > why the participants who drafted 8859-3 felt it was important > to include the Esperanto letters in the repertoire for the South > European set back in 1986. That date, by the way, is earlier than > anything I have firsthand records for. ECMA was actively involved in developing these sets and published them as parallel standard (ECMA-94, second edition, 1986). http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-094.pdf ECMA had a very close working relation with ISO, some of that history can be tracked down in snippets on the web; but sadly, most of the active participant in developing the early editions of the 8859 series would have passed away by now. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Mar 24 06:19:26 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 24 Mar 2015 11:19:26 +0000 (GMT) Subject: Origin of the digital encoding of accented characters for Esperanto In-Reply-To: References: <24228925.47147.1427124915251.JavaMail.defaultUser@defaultHost> Message-ID: <33262193.24001.1427195966706.JavaMail.defaultUser@defaultHost> WJGO >> It does not seem axiomatic that accented characters for Esperanto would necessarily be included in a digital encoding of the accented characters needed for the languages of Europe. DS > Where does languages of Europe come from? It seems to me that an alternative scenario could quite easily, and possibly more probably, have been what had happened, namely that a list of the countries of Europe had been made and then starting from that list, the main language of each country in that list then be added into a list of languages to be supported, with Esperanto not even having been thought about. Also, it could have been that if Esperanto had been suggested that the idea could have been dismissed as Esperanto were not the language of a country or dismissed for some negative opinion about Esperanto or some other purported reason. It seems axiomatic that the accented characters for French and German would be included, yet not axiomatic that the accented characters for Esperanto included. So I wondered how they came to become encoded. Back in the 1960s I saw a list of the accented characters needed to typeset in various European languages. It was in the Riscatype catalogue of metal type. Esperanto was included in that list. Is it possible that that list was used years later in deciding which accented characters to include in an electronic coding? I remember that in the early 1970s two researchers were trying to translate what they thought was a paper in Spanish and having great problems. I glanced at the text and pointed out that it was Portuguese. Asked if I spoke Portuguese I replied that I did not, but that, being interested in printing, I knew that the a tilde character was used in Portuguese and not in Spanish: so the Riscatype list was helpful to them. DS > Latin Extended-A is not designed to exclusively cover Europe, and both ISO 8859-3 and Extended-A cover Turkish. Well, part of Turkey is in Europe. DS > The largest Esperanto libraries have about 25,000 books, and there's a large collection of people wanting to use Esperanto on the Internet; ... Fine. DS > ... moreover, the encoding decision is trivial, being a simple and uncontested set of twelve codepoints. Well, the decision was not necessarily trivial nor uncontested: that is now a part of history and maybe some documentation will be found to describe what was the situation at the time. DS > Of all the Latin script characters not encoded in Unicode 1.1, I doubt any of them have 1% the use of the Esperanto characters. Not encoding them upfront would have been silly. I have been interested in Esperanto since the 1960s when I found an Esperanto dictionary in an antiquarian bookshop. I had not previously known of Esperanto. I asked the bookshop owner about this language and he explained and I bought the dictionary and a copy of the English version of the book The Life of Zamenhof, by Edmond Privat. Soon after I bought a copy of Teach Yourself Esperanto and some years later in the early 1970s I bought the Teach Yourself Esperanto Dictionary. In the late 1990s I gained two certificates in Esperanto, namely for Elementary and Intermediate levels. More recently I have written a song in Esperanto and I am hoping to record it and place it on the web so that it will become archived by the British Library. The song lyrics use g circumflex many times and s circumflex a few times and I was thinking that it is good that the characters are available in Unicode. DS > Kie ekzistas vivo, ekzistas espero. Dankon. For the benefit of readers who do not know any Esperanto, I translate to English what David wrote Where there exists life, there exists hope. thus Where there is life, there is hope. and the translation into English of my reply is Thanks. I am also trying to draft a petition to send to the Unicode Technical Committee about encoding some localizable sentences with their symbols in plane 13 and building localizable sentence technology as a part of Unicode for the future. As part of the introduction I am seeking to compare and contrast Esperanto and localizable sentence technology. Both are intended to assist communication through the language barrier. Neither is intended to replace natural languages. Esperanto can be used to construct a sentence for any meaning. Yet localizable sentences are for a finite set of sentences. Esperanto does need to be learned as a language before it can be used, quicker and simpler than learning French or German, yet still taking quite a lot of study. Localizable sentences could be used easily, just by learning how to use a cascading menu system with category headings and sentences localized into one's own native language: there is the capability to include names, not localizable, within a stream of localizable sentences and escape mechanisms for adding unlocalizable items in Esperanto or in a natural language. Before encoding as electronic characters, the letters used for Esperanto had been in use by a lot of people for many years, in handwriting and in print. Localizable sentence characters, by being part of a pure electronic technology, have no history of use, so an encoding would be so that the technology could become used. Whether that use would happen is open for opinions to be expressed, yet unless the encoding takes place no one can be certain that an encoding into regular Unicode would be used or would not be used. William Overington 24 March 2015 From rwhlk142 at gmail.com Wed Mar 25 12:20:08 2015 From: rwhlk142 at gmail.com (Robert Wheelock) Date: Wed, 25 Mar 2015 13:20:08 -0400 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! Message-ID: Hello! When you?re typing, do you find yourself winding up being CONFUSED over what you type?!?! It?s a crucially SERIOUS matter?especially when typing on a computer! For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! The default Microsoft Sans Serif font (within Microsoft Windows) has this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at U+02261)?because Microsoft (regrettably) placed this math symbol where the HOLLOW HEART SUIT should be (at U+02661)! * ?AGONISTES!* What Microsoft SHOULD DO *is* *THIS*: Please move the IDENTICAL TO SIGN from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER LOCATION at (U+02261)!! THAT would be MUCH better!! What other CHARACTER CALAMITIES have you come across?!?! Thank You! -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.mcglothlin at gmail.com Wed Mar 25 16:24:26 2015 From: mike.mcglothlin at gmail.com (Michael McGlothlin) Date: Wed, 25 Mar 2015 16:24:26 -0500 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: References: Message-ID: <15450011-1A3B-4745-A999-BC7EA12C8E03@gmail.com> I'd like to see a free/open "default" font that has a correct, simple styled, symbol for every Unicode character. Vendors should be pressured to use this font when other options aren't available. I get tired of seeing default symbols, incorrect symbols, and mystery white spaces that aren't really white space. It's pretty silly to have a code point without a default symbol I think. Thanks, Michael McGlothlin Sent from my iPhone > On Mar 25, 2015, at 12:20 PM, Robert Wheelock wrote: > > Hello! > > When you?re typing, do you find yourself winding up being CONFUSED over what you type?!?! It?s a crucially SERIOUS matter?especially when typing on a computer! > > For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! > > The default Microsoft Sans Serif font (within Microsoft Windows) has this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at U+02261)?because Microsoft (regrettably) placed this math symbol where the HOLLOW HEART SUIT should be (at U+02661)! ?AGONISTES! > > What Microsoft SHOULD DO is THIS: Please move the IDENTICAL TO SIGN from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER LOCATION at (U+02261)!! THAT would be MUCH better!! > > What other CHARACTER CALAMITIES have you come across?!?! > > Thank You! -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Wed Mar 25 17:18:44 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Wed, 25 Mar 2015 15:18:44 -0700 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: <15450011-1A3B-4745-A999-BC7EA12C8E03@gmail.com> References: <15450011-1A3B-4745-A999-BC7EA12C8E03@gmail.com> Message-ID: Just like Unicode Last Resort Font[1]? [1]: http://www.unicode.org/policies/lastresortfont_eula.html ? Shervin On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin < mike.mcglothlin at gmail.com> wrote: > I'd like to see a free/open "default" font that has a correct, simple > styled, symbol for every Unicode character. Vendors should be pressured to > use this font when other options aren't available. I get tired of seeing > default symbols, incorrect symbols, and mystery white spaces that aren't > really white space. It's pretty silly to have a code point without a > default symbol I think. > > > Thanks, > Michael McGlothlin > Sent from my iPhone > > On Mar 25, 2015, at 12:20 PM, Robert Wheelock wrote: > > Hello! > > When you?re typing, do you find yourself winding up being CONFUSED over > what you type?!?! It?s a crucially SERIOUS matter?especially when typing > on a computer! > > For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may show > up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI > (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! > > The default Microsoft Sans Serif font (within Microsoft Windows) has this > ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at > U+02261)?because Microsoft (regrettably) placed this math symbol where the > HOLLOW HEART SUIT should be (at U+02661)! * ?AGONISTES!* > > What Microsoft SHOULD DO *is* *THIS*: Please move the IDENTICAL TO SIGN > from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER > LOCATION at (U+02261)!! THAT would be MUCH better!! > > What other CHARACTER CALAMITIES have you come across?!?! > > Thank You! > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Mar 25 17:40:03 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 25 Mar 2015 15:40:03 -0700 Subject: Are you CONFUSED about WHAT CHARACTER(S) you =?UTF-8?Q?type=3F!=3F!?= Message-ID: <20150325154003.665a7a7059d7ee80bb4d670165c8327d.e6f1cc4b8c.wbe@email03.secureserver.net> Robert Wheelock wrote: > The default Microsoft Sans Serif font (within Microsoft Windows) has > this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which > should be at U+02261)?because Microsoft (regrettably) placed this math > symbol where the HOLLOW HEART SUIT should be (at U+02661)! > * ?AGONISTES!* It's a known font bug. It's been around since at least 2010. It's probably not the end of the world. (Calling U+2661 by the name HOLLOW HEART SUIT instead of its real name, WHITE HEART SUIT, is also a bug.) -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mark at kli.org Wed Mar 25 21:30:55 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 25 Mar 2015 22:30:55 -0400 Subject: Avoidance variants Message-ID: <55136F5F.6080808@kli.org> An HTML attachment was scrubbed... URL: From lang.support at gmail.com Wed Mar 25 21:40:17 2015 From: lang.support at gmail.com (Andrew Cunningham) Date: Thu, 26 Mar 2015 13:40:17 +1100 Subject: Avoidance variants In-Reply-To: <55136F5F.6080808@kli.org> References: <55136F5F.6080808@kli.org> Message-ID: Or is it a markup issue rather than something for plain text? On 26 March 2015 at 13:30, Mark E. Shoulson wrote: > So, not much in the way of discussion regarding the TETRAGRAMMATON issue > I raised the other week. OK; someone'll eventually get to it I guess. > > Another thing I was thinking about, while toying with Hebrew fonts. > Often, letters are substituted in _nomina sacra_ in order to avoid writing > a holy name, much as the various symbols for the tetragrammaton are used. > And indeed, sometimes they're used in that name too, as I mentioned, usages > like ???? or ???? and so on. There's an example in the paper that shows > ????? instead of ?????. Much more common today would be ????? and in fact > people frequently even pronounce it that way (when it refers to big-G God, > in non-sacred contexts. But for little-g gods, the same word is pronounced > without the avoidance, because it isn't holy. It's weird.) > > I wonder if it makes sense maybe to encode not a codepoint, but a variant > sequence(s) to represent this sort of "defaced" or "altered" letter HEH. > It's still a HEH, it just looks like another letter, right? (QOF or DALET > or occasionally HET) That would keep some consistency to the spelling. On > the other hand, the spelling with a QOF is already well entrenched in texts > all over the internet. But maybe it isn't right. And what about the use > of ?? or ?? for the tetragrammaton? Are they both HEHs, one "altered", or > is one really a DALET? Any thoughts? > > (and seriously, what to do about all those tetragrammaton symbols?) > > ~mark > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Wed Mar 25 21:49:31 2015 From: mark at kli.org (Mark E. Shoulson) Date: Wed, 25 Mar 2015 22:49:31 -0400 Subject: Avoidance variants In-Reply-To: References: <55136F5F.6080808@kli.org> Message-ID: <551373BB.70200@kli.org> On 03/25/2015 10:40 PM, Andrew Cunningham wrote: > Or is it a markup issue rather than something for plain text? Maybe, but it doesn't really seem so. There's no such thing as plain text on paper (once it's printed it's arguably formatted somehow), but looking at all the examples it seems to be happening in contexts that are as plain-texty as you could wish for. Blocks of boring plain text, no italics or effects any more complex than justification, simple notes written all in one font with no formatting to speak of etc. For that matter, there's all the existing plain-text-encoded cases I mentioned, using actual QOF letters instead of HEHs in undeniably plain, electronic text. Maybe it would make more sense to encode such texts as they are written (using QOF codepoints, etc) and have the markup indicate that it's a HEH in disguise. But that doesn't gain us anything in terms of standardizing the spelling, so to speak, to have the same text/word represented with the same letters in different representations. ~mark > > > On 26 March 2015 at 13:30, Mark E. Shoulson > wrote: > > So, not much in the way of discussion regarding the TETRAGRAMMATON > issue I raised the other week. OK; someone'll eventually get to > it I guess. > > Another thing I was thinking about, while toying with Hebrew > fonts. Often, letters are substituted in _nomina sacra_ in order > to avoid writing a holy name, much as the various symbols for the > tetragrammaton are used. And indeed, sometimes they're used in > that name too, as I mentioned, usages like ???? or ???? and so > on. There's an example in the paper that shows ????? instead of > ?????. Much more common today would be ????? and in fact people > frequently even pronounce it that way (when it refers to big-G > God, in non-sacred contexts. But for little-g gods, the same word > is pronounced without the avoidance, because it isn't holy. It's > weird.) > > I wonder if it makes sense maybe to encode not a codepoint, but a > variant sequence(s) to represent this sort of "defaced" or > "altered" letter HEH. It's still a HEH, it just looks like > another letter, right? (QOF or DALET or occasionally HET) That > would keep some consistency to the spelling. On the other hand, > the spelling with a QOF is already well entrenched in texts all > over the internet. But maybe it isn't right. And what about the > use of ?? or ?? for the tetragrammaton? Are they both HEHs, one > "altered", or is one really a DALET? Any thoughts? > > (and seriously, what to do about all those tetragrammaton symbols?) > > ~mark > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Wed Mar 25 22:29:24 2015 From: petercon at microsoft.com (Peter Constable) Date: Thu, 26 Mar 2015 03:29:24 +0000 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: <20150325154003.665a7a7059d7ee80bb4d670165c8327d.e6f1cc4b8c.wbe@email03.secureserver.net> References: <20150325154003.665a7a7059d7ee80bb4d670165c8327d.e6f1cc4b8c.wbe@email03.secureserver.net> Message-ID: It's the first time it was retorted to us, AFAIK. Sent from my IBM 3277/APL ________________________________ From: Doug Ewell Sent: ?3/?25/?2015 3:44 PM To: Unicode Mailing List Subject: Re: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! Robert Wheelock wrote: > The default Microsoft Sans Serif font (within Microsoft Windows) has > this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which > should be at U+02261)?because Microsoft (regrettably) placed this math > symbol where the HOLLOW HEART SUIT should be (at U+02661)! > * ?AGONISTES!* It's a known font bug. It's been around since at least 2010. It's probably not the end of the world. (Calling U+2661 by the name HOLLOW HEART SUIT instead of its real name, WHITE HEART SUIT, is also a bug.) -- Doug Ewell | http://ewellic.org | Thornton, CO ???? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.rosenne at gmail.com Thu Mar 26 00:14:10 2015 From: jonathan.rosenne at gmail.com (Jonathan Rosenne) Date: Thu, 26 Mar 2015 07:14:10 +0200 Subject: Avoidance variants In-Reply-To: <55136F5F.6080808@kli.org> References: <55136F5F.6080808@kli.org> Message-ID: <004601d06783$b2d3b520$187b1f60$@gmail.com> ?It's still a HEH, it just looks like another letter, right?? Wrong. It?s a QOF. Just like the p in receipt is a p. Unicode should not concern itself with the reasons words are spelt the way they are spelt. Best Regards, Jonathan Rosenne From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Mark E. Shoulson Sent: Thursday, March 26, 2015 4:31 AM To: unicode at unicode.org Subject: Avoidance variants So, not much in the way of discussion regarding the TETRAGRAMMATON issue I raised the other week. OK; someone'll eventually get to it I guess. Another thing I was thinking about, while toying with Hebrew fonts. Often, letters are substituted in _nomina sacra_ in order to avoid writing a holy name, much as the various symbols for the tetragrammaton are used. And indeed, sometimes they're used in that name too, as I mentioned, usages like ???? or ???? and so on. There's an example in the paper that shows ????? instead of ?????. Much more common today would be ????? and in fact people frequently even pronounce it that way (when it refers to big-G God, in non-sacred contexts. But for little-g gods, the same word is pronounced without the avoidance, because it isn't holy. It's weird.) I wonder if it makes sense maybe to encode not a codepoint, but a variant sequence(s) to represent this sort of "defaced" or "altered" letter HEH. It's still a HEH, it just looks like another letter, right? (QOF or DALET or occasionally HET) That would keep some consistency to the spelling. On the other hand, the spelling with a QOF is already well entrenched in texts all over the internet. But maybe it isn't right. And what about the use of ?? or ?? for the tetragrammaton? Are they both HEHs, one "altered", or is one really a DALET? Any thoughts? (and seriously, what to do about all those tetragrammaton symbols?) ~mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From mike.mcglothlin at gmail.com Thu Mar 26 02:53:56 2015 From: mike.mcglothlin at gmail.com (Michael McGlothlin) Date: Thu, 26 Mar 2015 02:53:56 -0500 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: References: <15450011-1A3B-4745-A999-BC7EA12C8E03@gmail.com> Message-ID: <0D6FAA7B-CDED-4819-B916-EEFEF7862B98@gmail.com> Similar but with a couple differences. Most important would be getting vendors to actually use the font. Also it should be appropriate to actually display the characters rather than being debugging information. Does this last resort font represent every character in some meaningful way? e.g. I've tried to use somewhat rare characters like runes before and it was a pretty big pain to find fonts that were free to distribute, weren't buggy, and displayed the correct symbol for that character. And some applications wouldn't display them correctly even after installing a font. (Visual Studio let me use runes as variable names and compiled fine but wouldn't actually display the rune symbols.) Sent from my iPad > On Mar 25, 2015, at 5:18 PM, Shervin Afshar wrote: > > Just like Unicode Last Resort Font[1]? > > [1]: http://www.unicode.org/policies/lastresortfont_eula.html > > ? Shervin > >> On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin wrote: >> I'd like to see a free/open "default" font that has a correct, simple styled, symbol for every Unicode character. Vendors should be pressured to use this font when other options aren't available. I get tired of seeing default symbols, incorrect symbols, and mystery white spaces that aren't really white space. It's pretty silly to have a code point without a default symbol I think. >> >> >> Thanks, >> Michael McGlothlin >> Sent from my iPhone >> >>> On Mar 25, 2015, at 12:20 PM, Robert Wheelock wrote: >>> >>> Hello! >>> >>> When you?re typing, do you find yourself winding up being CONFUSED over what you type?!?! It?s a crucially SERIOUS matter?especially when typing on a computer! >>> >>> For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! >>> >>> The default Microsoft Sans Serif font (within Microsoft Windows) has this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at U+02261)?because Microsoft (regrettably) placed this math symbol where the HOLLOW HEART SUIT should be (at U+02661)! ?AGONISTES! >>> >>> What Microsoft SHOULD DO is THIS: Please move the IDENTICAL TO SIGN from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER LOCATION at (U+02261)!! THAT would be MUCH better!! >>> >>> What other CHARACTER CALAMITIES have you come across?!?! >>> >>> Thank You! >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From philip_chastney at yahoo.com Thu Mar 26 04:46:49 2015 From: philip_chastney at yahoo.com (philip chastney) Date: Thu, 26 Mar 2015 02:46:49 -0700 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: <0D6FAA7B-CDED-4819-B916-EEFEF7862B98@gmail.com> Message-ID: <1427363209.31231.YahooMailBasic@web162606.mail.bf1.yahoo.com> a couple of points that you might like to take into account: 1) who is going to finance the development of these freely distributable fonts? 2) how is old software expected to handle codepoints allocated after the software was released? both of these issues are solvable, and I woulldn't expect them to be solved soon /phil -------------------------------------------- On Thu, 26/3/15, Michael McGlothlin wrote: Subject: Re: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! To: "Shervin Afshar" Cc: "Robert Wheelock" , "unicode at unicode.org" Date: Thursday, 26 March, 2015, 7:53 AM Similar but with a couple differences. Most important would be getting vendors to actually use the font. Also it should be appropriate to actually display the characters rather than being debugging information. Does this last resort font represent every character in some meaningful way? e.g. I've tried to use somewhat rare characters like runes before and it was a pretty big pain to find fonts that were free to distribute, weren't buggy, and displayed the correct symbol for that character. And some applications wouldn't display them correctly even after installing a font. (Visual Studio let me use runes as variable names and compiled fine but wouldn't actually display the rune symbols.) Sent from my iPad On Mar 25, 2015, at 5:18 PM, Shervin Afshar wrote: Just like Unicode Last Resort Font[1]? ?[1]:?http://www.unicode.org/policies/lastresortfont_eula.html ? Shervin On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin wrote: I'd like to see a free/open "default" font that has a correct, simple styled, symbol for every Unicode character. Vendors should be pressured to use this font when other options aren't available. I get tired of seeing default symbols, incorrect symbols, and mystery white spaces that aren't really white space. It's pretty silly to have a code point without a default symbol I think. Thanks,Michael McGlothlinSent from my iPhone On Mar 25, 2015, at 12:20 PM, Robert Wheelock wrote: Hello! When you?re typing, do you find yourself winding up being CONFUSED over what you type?!?!? It?s a crucially SERIOUS matter?especially when typing on a computer! For instance: ?When you type in a HOLLOW HEART SUIT (U+02661), it may show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! The default Microsoft Sans Serif font (within Microsoft Windows) has this ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at U+02261)?because Microsoft (regrettably) placed this math symbol where the HOLLOW HEART SUIT should be (at U+02661)! ??AGONISTES! What Microsoft SHOULD DO is?THIS: ?Please move the IDENTICAL TO SIGN from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER LOCATION at (U+02261)!!? THAT would be MUCH better!! What other CHARACTER CALAMITIES have you come across?!?! Thank You! _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -----Inline Attachment Follows----- _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From mark at macchiato.com Thu Mar 26 04:55:24 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 26 Mar 2015 10:55:24 +0100 Subject: Are you CONFUSED about WHAT CHARACTER(S) you type?!?! In-Reply-To: <0D6FAA7B-CDED-4819-B916-EEFEF7862B98@gmail.com> References: <15450011-1A3B-4745-A999-BC7EA12C8E03@gmail.com> <0D6FAA7B-CDED-4819-B916-EEFEF7862B98@gmail.com> Message-ID: It only provides a "stand-in" glyph if you don't otherwise have a font for that character on your system. That "stand-in" just indicates the type of character (eg script). No single font with current technology can handle all of Unicode. The most complete open font set I know of is the Noto family: https://www.google.com/get/noto/. I don't think it has a full set of symbols (others: correct me if I'm wrong.) Symbola is pretty good for arbitrary symbols. There are many other resources on http://unicode.org/resources/fonts.html. Mark *? Il meglio ? l?inimico del bene ?* On Thu, Mar 26, 2015 at 8:53 AM, Michael McGlothlin < mike.mcglothlin at gmail.com> wrote: > Similar but with a couple differences. Most important would be getting > vendors to actually use the font. Also it should be appropriate to actually > display the characters rather than being debugging information. > > Does this last resort font represent every character in some meaningful > way? e.g. I've tried to use somewhat rare characters like runes before and > it was a pretty big pain to find fonts that were free to distribute, > weren't buggy, and displayed the correct symbol for that character. And > some applications wouldn't display them correctly even after installing a > font. (Visual Studio let me use runes as variable names and compiled fine > but wouldn't actually display the rune symbols.) > > > Sent from my iPad > > On Mar 25, 2015, at 5:18 PM, Shervin Afshar > wrote: > > Just like Unicode Last Resort Font[1]? > > [1]: http://www.unicode.org/policies/lastresortfont_eula.html > > ? Shervin > > On Wed, Mar 25, 2015 at 2:24 PM, Michael McGlothlin < > mike.mcglothlin at gmail.com> wrote: > >> I'd like to see a free/open "default" font that has a correct, simple >> styled, symbol for every Unicode character. Vendors should be pressured to >> use this font when other options aren't available. I get tired of seeing >> default symbols, incorrect symbols, and mystery white spaces that aren't >> really white space. It's pretty silly to have a code point without a >> default symbol I think. >> >> >> Thanks, >> Michael McGlothlin >> Sent from my iPhone >> >> On Mar 25, 2015, at 12:20 PM, Robert Wheelock wrote: >> >> Hello! >> >> When you?re typing, do you find yourself winding up being CONFUSED over >> what you type?!?! It?s a crucially SERIOUS matter?especially when typing >> on a computer! >> >> For instance: When you type in a HOLLOW HEART SUIT (U+02661), it may >> show up as an IDENTICAL TO SIGN (U+02261) or a GREEK CAPITAL LETTER XI >> (U+0039E)... it all DEPENDS on whatever FONT you?re using to type with! >> >> The default Microsoft Sans Serif font (within Microsoft Windows) has this >> ABOMINABLE habit of substituting this IDENTICAL TO SIGN (which should be at >> U+02261)?because Microsoft (regrettably) placed this math symbol where the >> HOLLOW HEART SUIT should be (at U+02661)! * ?AGONISTES!* >> >> What Microsoft SHOULD DO *is* *THIS*: Please move the IDENTICAL TO SIGN >> from (U+02661?the location where the HOLLOW HEART SUIT goes) to its PROPER >> LOCATION at (U+02261)!! THAT would be MUCH better!! >> >> What other CHARACTER CALAMITIES have you come across?!?! >> >> Thank You! >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Mar 26 09:43:32 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 26 Mar 2015 07:43:32 -0700 Subject: Are you CONFUSED about WHAT CHARACTER(S) you =?UTF-8?Q?type=3F!=3F!?= Message-ID: <20150326074332.665a7a7059d7ee80bb4d670165c8327d.85373632cc.wbe@email03.secureserver.net> Peter Constable wrote: >> It's a known font bug. It's been around since at least 2010. It's >> probably not the end of the world. > > It's the first time it was retorted to us, AFAIK. (I assume "reported" unless the report was made in a caustic manner. :) Here's a post to Microsoft Community from July 2010, where the poster didn't make the association with the particular font, and the respondent suggested ensuring that other fonts were installed, which of course doesn't really solve the problem: http://answers.microsoft.com/en-us/windows/forum/windows_7-files/trouble-with-unicode-character-white-heart-suit/d0df7dcc-d5d1-4e95-94f6-e62a53eb8f9c It was reported to Community again in August 2012, but might not have made it to the right people: http://answers.microsoft.com/en-us/windows/forum/windows_7-desktop/font-glyph-bug-in-microsoft-sans-serif/a21c9c5a-19f0-4430-ac8f-cbb7af49d66a -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Thu Mar 26 10:12:13 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 26 Mar 2015 08:12:13 -0700 Subject: Avoidance variants In-Reply-To: <004601d06783$b2d3b520$187b1f60$@gmail.com> References: <55136F5F.6080808@kli.org> <004601d06783$b2d3b520$187b1f60$@gmail.com> Message-ID: <551421CD.1090905@ix.netcom.com> On 3/25/2015 10:14 PM, Jonathan Rosenne wrote: > > ?It's still a HEH, it just looks like another letter, right?? Wrong. > It?s a QOF. Just like the p in receipt is a p. Unicode should not > concern itself with the reasons words are spelt the way they are spelt. > Identifying deliberate misspellings as such is a matter of markup. In other citations, one would use human readable mark-up (that is add "[sic]"), but in other contexts it might be useful to make a term searchable by providing identifying markup; what the protocol for that would be, I don't know, but character encoding, as Jony suggests, is surely the wrong level for dealing with issues of orthography. A./ > > Best Regards, > > Jonathan Rosenne > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Mark E. Shoulson > *Sent:* Thursday, March 26, 2015 4:31 AM > *To:* unicode at unicode.org > *Subject:* Avoidance variants > > So, not much in the way of discussion regarding the TETRAGRAMMATON > issue I raised the other week. OK; someone'll eventually get to it I > guess. > > Another thing I was thinking about, while toying with Hebrew fonts. > Often, letters are substituted in _nomina sacra_ in order to avoid > writing a holy name, much as the various symbols for the > tetragrammaton are used. And indeed, sometimes they're used in that > name too, as I mentioned, usages like ????or ????and so on. There's > an example in the paper that shows ?????instead of ?????. Much more > common today would be ?????and in fact people frequently even > pronounce it that way (when it refers to big-G God, in non-sacred > contexts. But for little-g gods, the same word is pronounced without > the avoidance, because it isn't holy. It's weird.) > > I wonder if it makes sense maybe to encode not a codepoint, but a > variant sequence(s) to represent this sort of "defaced" or "altered" > letter HEH. It's still a HEH, it just looks like another letter, > right? (QOF or DALET or occasionally HET) That would keep some > consistency to the spelling. On the other hand, the spelling with a > QOF is already well entrenched in texts all over the internet. But > maybe it isn't right. And what about the use of ??or ??for the > tetragrammaton? Are they both HEHs, one "altered", or is one really a > DALET? Any thoughts? > > (and seriously, what to do about all those tetragrammaton symbols?) > > ~mark > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Thu Mar 26 10:18:09 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 26 Mar 2015 15:18:09 +0000 (GMT) Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <551373BB.70200@kli.org> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> Message-ID: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> > Blocks of boring plain text, no italics or effects any more complex than justification, simple notes written all in one font with no formatting to speak of etc. I am wondering if it is considered a good idea to define into Plane 14 some formatting characters, so that plain text could in the future contain italics and so on. For example, written here with an asterisk included as I seem to remember that that is the convention so as to avoid a suggested new character being mistaken as an existing character, how about the following. *U+E1000 FORMAT NOT ITALICS *U+E1001 FORMAT ITALICS *U+E1002 FORMAT NOT BOLD *U+E1003 FORMAT BOLD Traditionally such a suggestion would be refuted as out of scope for plain text: use of markup would be suggested. Yet that was then, this is now: ideas of what can, or should, be encoded in plain text have changed with time and could usefully continue to change where that is of use to consumers. I have often wondered why use of markup is regarded as such a requirement when the capabilities of plain text could so easily be enhanced. Expanding the capabilities of plain text would increase interoperability. William Overington 26 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Mar 26 11:41:08 2015 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 26 Mar 2015 09:41:08 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: <551436A4.6030304@ix.netcom.com> An HTML attachment was scrubbed... URL: From Shawn.Steele at microsoft.com Thu Mar 26 11:42:04 2015 From: Shawn.Steele at microsoft.com (Shawn Steele) Date: Thu, 26 Mar 2015 16:42:04 +0000 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: > I am wondering if it is considered a good idea to define into Plane 14 some formatting characters, so that plain text could in the future contain italics and so on. No, it would not then be ?plain text? -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Thu Mar 26 15:16:23 2015 From: textexin at xencraft.com (Tex Texin) Date: Thu, 26 Mar 2015 13:16:23 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: <00b901d06801$bba76470$32f62d50$@xencraft.com> Or to put it another way, you are inventing what is essentially another markup language. If you are going to tag text with styling, why not just use one of the many existing markup schemes used in html, bulletin boards, wiki, etc.? tex From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Shawn Steele Sent: Thursday, March 26, 2015 9:42 AM To: wjgo_10009 at btinternet.com; mark at kli.org; unicode at unicode.org; lang.support at gmail.com Subject: RE: Plain text (from Re: Avoidance variants) > I am wondering if it is considered a good idea to define into Plane 14 some formatting characters, so that plain text could in the future contain italics and so on. No, it would not then be ?plain text? -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Thu Mar 26 15:34:30 2015 From: rick at unicode.org (Rick McGowan) Date: Thu, 26 Mar 2015 13:34:30 -0700 Subject: IUC 39 call for participation - abstract submission reminder Message-ID: <55146D56.1010905@unicode.org> Hi everyone, Just a quick reminder that the Call for participation in IUC #39 is now open, and the deadline for submitting an abstract is coming up quickly: April 3. All the information is here, on the conference website: http://www.unicodeconference.org/ The conference itself is October 26-28, in Santa Clara. Watch this space in upcoming weeks for further announcements... Regards, Rick From timothy at greenwood.name Thu Mar 26 16:07:46 2015 From: timothy at greenwood.name (Tim Greenwood) Date: Thu, 26 Mar 2015 21:07:46 +0000 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: Many years ago, in the initial days of Unicode development, I discussed it with a colleague at DEC. His response was that Unicode would become weighed down with all sorts of junk getting added to it. Over twenty years later Unicode's huge success comes from not having done that. It is not going to start now. - Tim Greenwood On Thu, Mar 26, 2015 at 11:37 AM William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > > Blocks of boring plain text, no italics or effects any more complex than > justification, simple notes written all in one font with no formatting to > speak of etc. > > > I am wondering if it is considered a good idea to define into Plane 14 > some formatting characters, so that plain text could in the future contain > italics and so on. > > > For example, written here with an asterisk included as I seem to remember > that that is the convention so as to avoid a suggested new character being > mistaken as an existing character, how about the following. > > > *U+E1000 FORMAT NOT ITALICS > > > *U+E1001 FORMAT ITALICS > > > *U+E1002 FORMAT NOT BOLD > > > *U+E1003 FORMAT BOLD > > > Traditionally such a suggestion would be refuted as out of scope for plain > text: use of markup would be suggested. > > > Yet that was then, this is now: ideas of what can, or should, be encoded > in plain text have changed with time and could usefully continue to change > where that is of use to consumers. > > > I have often wondered why use of markup is regarded as such a requirement > when the capabilities of plain text could so easily be enhanced. Expanding > the capabilities of plain text would increase interoperability. > > > William Overington > > > 26 March 2015 > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Mar 26 18:20:18 2015 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 26 Mar 2015 19:20:18 -0400 Subject: Avoidance variants In-Reply-To: <004601d06783$b2d3b520$187b1f60$@gmail.com> References: <55136F5F.6080808@kli.org> <004601d06783$b2d3b520$187b1f60$@gmail.com> Message-ID: <55149432.5050203@kli.org> On 03/26/2015 01:14 AM, Jonathan Rosenne wrote: > > ?It's still a HEH, it just looks like another letter, right?? Wrong. > It?s a QOF. Just like the p in receipt is a p. Unicode should not > concern itself with the reasons words are spelt the way they are spelt. > Good enough point. And I suppose when people were setting the type, they weren't thinking "this is a HEH, I'm just putting a QOF there"; they were reaching for the QOF box. And all the text online also supports this point of view. I guess I was just tinkering with some Hebrew fonts and experimenting with making these kinds of "variants" so that the same text could, say, be printed both "formally" and "informally". But that really makes more sense as Stylistic Alternates or something, not encoded. ~mark > Best Regards, > > Jonathan Rosenne > > *From:*Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of > *Mark E. Shoulson > *Sent:* Thursday, March 26, 2015 4:31 AM > *To:* unicode at unicode.org > *Subject:* Avoidance variants > > So, not much in the way of discussion regarding the TETRAGRAMMATON > issue I raised the other week. OK; someone'll eventually get to it I > guess. > > Another thing I was thinking about, while toying with Hebrew fonts. > Often, letters are substituted in _nomina sacra_ in order to avoid > writing a holy name, much as the various symbols for the > tetragrammaton are used. And indeed, sometimes they're used in that > name too, as I mentioned, usages like ????or ????and so on. There's > an example in the paper that shows ?????instead of ?????. Much more > common today would be ?????and in fact people frequently even > pronounce it that way (when it refers to big-G God, in non-sacred > contexts. But for little-g gods, the same word is pronounced without > the avoidance, because it isn't holy. It's weird.) > > I wonder if it makes sense maybe to encode not a codepoint, but a > variant sequence(s) to represent this sort of "defaced" or "altered" > letter HEH. It's still a HEH, it just looks like another letter, > right? (QOF or DALET or occasionally HET) That would keep some > consistency to the spelling. On the other hand, the spelling with a > QOF is already well entrenched in texts all over the internet. But > maybe it isn't right. And what about the use of ??or ??for the > tetragrammaton? Are they both HEHs, one "altered", or is one really a > DALET? Any thoughts? > > (and seriously, what to do about all those tetragrammaton symbols?) > > ~mark > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Mar 26 18:27:25 2015 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 26 Mar 2015 19:27:25 -0400 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: <551495DD.8000308@kli.org> On 03/26/2015 11:18 AM, William_J_G Overington wrote: > > Blocks of boring plain text, no italics or effects any more complex > than justification, simple notes written all in one font with no > formatting to speak of etc. > > > I am wondering if it is considered a good idea to define into Plane 14 > some formatting characters, so that plain text could in the future > contain italics and so on. And we could define "plain water" to include sugar and flavorings, and have Coke run out of our taps. But that isn't "plain water" anymore. And yes, we DO allow some additives in water and still call it "plain", even as we do have some formatting characters in Unicode and call it plain text (e.g. tab, formfeed, ZWJ, RLO, PDF, etc) Alternatively, you could say we already have such things encodable as plain text, using character sequences, like U+003C U+0069 U+003E to indicate "BEGIN ITALICS", etc... Just need the right reader... ~mark > > > For example, written here with an asterisk included as I seem to > remember that that is the convention so as to avoid a suggested new > character being mistaken as an existing character, how about the > following. > > > *U+E1000 FORMAT NOT ITALICS > > > *U+E1001 FORMAT ITALICS > > > *U+E1002 FORMAT NOT BOLD > > > *U+E1003 FORMAT BOLD > > > Traditionally such a suggestion would be refuted as out of scope for > plain text: use of markup would be suggested. > > > Yet that was then, this is now: ideas of what can, or should, be > encoded in plain text have changed with time and could usefully > continue to change where that is of use to consumers. > > > I have often wondered why use of markup is regarded as such a > requirement when the capabilities of plain text could so easily be > enhanced. Expanding the capabilities of plain text would increase > interoperability. > > > William Overington > > > 26 March 2015 > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Mar 26 19:01:20 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 26 Mar 2015 17:01:20 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: Exact semantics of formatting characters aside, it is best to define plain text as a stateless stream. The characters you're proposing require a decoder to keep state, therefore they won't do. At most you may ask for *U+E1001 COMBINING ITALICIZER *U+E1003 COMBINING BOLDIFIER after all, we already have U+0332 COMBINING LOW LINE and U+0336 COMBINING LONG STROKE OVERLAY for and resp, thus adding their counterparts for and will merely complete the set. Leo On Thu, Mar 26, 2015 at 8:18 AM, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > > Blocks of boring plain text, no italics or effects any more complex than > justification, simple notes written all in one font with no formatting to > speak of etc. > > > I am wondering if it is considered a good idea to define into Plane 14 > some formatting characters, so that plain text could in the future contain > italics and so on. > > > For example, written here with an asterisk included as I seem to remember > that that is the convention so as to avoid a suggested new character being > mistaken as an existing character, how about the following. > > > *U+E1000 FORMAT NOT ITALICS > > > *U+E1001 FORMAT ITALICS > > > *U+E1002 FORMAT NOT BOLD > > > *U+E1003 FORMAT BOLD > > > Traditionally such a suggestion would be refuted as out of scope for plain > text: use of markup would be suggested. > > > Yet that was then, this is now: ideas of what can, or should, be encoded > in plain text have changed with time and could usefully continue to change > where that is of use to consumers. > > > I have often wondered why use of markup is regarded as such a requirement > when the capabilities of plain text could so easily be enhanced. Expanding > the capabilities of plain text would increase interoperability. > > > William Overington > > > 26 March 2015 > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at kli.org Thu Mar 26 22:10:46 2015 From: mark at kli.org (Mark E. Shoulson) Date: Thu, 26 Mar 2015 23:10:46 -0400 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: <5514CA36.8060704@kli.org> On 03/26/2015 08:01 PM, Leo Broukhis wrote: > Exact semantics of formatting characters aside, it is best to define > plain text as a stateless stream. Well, not strictly true. Or at least, Unicode text is not quite stateless. We have these directional overrides and embeddings and isolates... The embeddings can even be nested. And turning national digit shaping on and off, etc. Statelessness looks more like an ideal; the current reality already violates it. ~mark From shervinafshar at gmail.com Thu Mar 26 22:34:43 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 26 Mar 2015 20:34:43 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <5514CA36.8060704@kli.org> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <5514CA36.8060704@kli.org> Message-ID: On Thu, Mar 26, 2015 at 8:10 PM, Mark E. Shoulson wrote: > Statelessness looks more like an ideal; the current reality already > violates it. The question is whether that ideal is "violated" because of choice or out of necessity; bidi-related stateful format codes specifically seem like a case of much needed addition to me. And no...I do not see any need to have **bold** and _italic_ modifiers in Unicode. ? Shervin On Thu, Mar 26, 2015 at 8:10 PM, Mark E. Shoulson wrote: > On 03/26/2015 08:01 PM, Leo Broukhis wrote: > >> Exact semantics of formatting characters aside, it is best to define >> plain text as a stateless stream. >> > Well, not strictly true. Or at least, Unicode text is not quite > stateless. We have these directional overrides and embeddings and > isolates... The embeddings can even be nested. And turning national digit > shaping on and off, etc. Statelessness looks more like an ideal; the > current reality already violates it. > > ~mark > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Fri Mar 27 05:31:15 2015 From: everson at evertype.com (Michael Everson) Date: Fri, 27 Mar 2015 11:31:15 +0100 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: On 27 Mar 2015, at 01:01, Leo Broukhis wrote: > Exact semantics of formatting characters aside, it is best to define plain text as a stateless stream. The characters you're proposing require a decoder to keep state, therefore they won't do. At most you may ask for > *U+E1001 COMBINING ITALICIZER > *U+E1003 COMBINING BOLDIFIER COMBINING EMBOLDENER, surely. ;-) Michael Everson * http://www.evertype.com/ From neil at tonal.clara.co.uk Fri Mar 27 09:47:53 2015 From: neil at tonal.clara.co.uk (Neil Harris) Date: Fri, 27 Mar 2015 14:47:53 +0000 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <551495DD.8000308@kli.org> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <551495DD.8000308@kli.org> Message-ID: <55156D99.3070700@tonal.clara.co.uk> On 26/03/15 23:27, Mark E. Shoulson wrote: > On 03/26/2015 11:18 AM, William_J_G Overington wrote: >> > Blocks of boring plain text, no italics or effects any more complex >> than justification, simple notes written all in one font with no >> formatting to speak of etc. >> >> >> I am wondering if it is considered a good idea to define into Plane >> 14 some formatting characters, so that plain text could in the future >> contain italics and so on. > > And we could define "plain water" to include sugar and flavorings, and > have Coke run out of our taps. But that isn't "plain water" anymore. > And yes, we DO allow some additives in water and still call it > "plain", even as we do have some formatting characters in Unicode and > call it plain text (e.g. tab, formfeed, ZWJ, RLO, PDF, etc) > > Alternatively, you could say we already have such things encodable as > plain text, using character sequences, like U+003C U+0069 U+003E to > indicate "BEGIN ITALICS", etc... Just need the right reader... > > ~mark Or you could just redefine "&" and "<" as U+0026 START HTML ENTITY and U+003C START HTML TAG and be done with it, and just incorporate HTML5 into Unicode forever, thus eliminating these discussions from this list, and moving them to the W3C and WHATWG lists... -- Neil From wjgo_10009 at btinternet.com Fri Mar 27 08:00:09 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 27 Mar 2015 13:00:09 +0000 (GMT) Subject: Plain text (from Re: Avoidance variants) In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> Message-ID: <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> >> Exact semantics of formatting characters aside, it is best to define plain text as a stateless stream. The characters you're proposing require a decoder to keep state, therefore they won't do. At most you may ask for *U+E1001 COMBINING ITALICIZER *U+E1003 COMBINING BOLDIFIER after all, we already have U+0332 COMBINING LOW LINE and U+0336 COMBINING LONG STROKE OVERLAY for and resp, thus adding their counterparts for and will merely complete the set. > COMBINING EMBOLDENER, surely. ;-) So, if that were implemented, then to typeset, say, the word astrolabe within a plain text file, in italics, one would need to use nine instances of the COMBINING ITALICIZER, one instance after each letter of the word astrolabe. That would be fine and the characters discussed would be, in my opinion, two useful additions to Unicode. If a two word phrase were to be typeset within a plain text file then each letter of the two words would need to have an instance of the COMBINING ITALICIZER after each letter of the word. Would one add an instance after the space character that is between the words? William Overington 27 March 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Mar 27 11:21:20 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 27 Mar 2015 09:21:20 -0700 Subject: Plain text (from Re: Avoidance variants) Message-ID: <20150327092119.665a7a7059d7ee80bb4d670165c8327d.04ce2bd3e6.wbe@email03.secureserver.net> So one of the concerns I have is the implication that "interesting" styling comprises: 1. bold 2. italic If formatting characters were encoded to support these two styling options, right away there would be calls to expand the set with: 3. underlining 4. strikeout 5. superscript 6. subscript 7. font face 8. font size 9. font color 10. highlighting 11. character spacing 12. line spacing 13. margins 14. page size 15. all the other styling options that word processors provide The proposed, simplified solution doesn't scale. It's the same as one of the concerns I have with encoding localizable sentences as characters. There aren't 20 or 50 or 100 sentences that people might want localized, but crores of them. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From asmus-inc at ix.netcom.com Fri Mar 27 11:25:42 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 27 Mar 2015 09:25:42 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> Message-ID: <55158486.4070306@ix.netcom.com> On 3/27/2015 6:00 AM, William_J_G Overington wrote: > > > So, if that were implemented, then to typeset, say, the word astrolabe > within a plain text file, in italics, one would need to use nine > instances of the COMBINING ITALICIZER, one instance after each letter > of the word astrolabe. > > That would be fine and the characters discussed would be, in my > opinion, two useful additions to Unicode. Just because you seem not to get this, those would not only be useless, they would lead to dual encoding a problem that needs to be avoided, not added to. With the Math Alphanumerics, Unicode decided not to use the combining mark approach, but to encode individual characters in their italic/bold shapes, for use when isolated letters need to have that appearance as part of technical notations. In adoption that solution, nearly fifteen years ago, Unicode effectively set a precedent against the kind of character you re proposing. As you apparently find it difficult to understand this, let me spell it out for you: any proposal for adding these kind of characters is dead on arrival. A./ From michaelanortonster at gmail.com Fri Mar 27 11:31:08 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 12:31:08 -0400 Subject: Usage stats? Message-ID: Hello and thank you for an incredible service (just joining the list). Is there a list of usage statistics per character of the Unicode set available somewhere? Cheers, -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Mar 27 10:15:58 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 27 Mar 2015 15:15:58 +0000 (GMT) Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <55156D99.3070700@tonal.clara.co.uk> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <551495DD.8000308@kli.org> <55156D99.3070700@tonal.clara.co.uk> Message-ID: <28234536.41082.1427469358202.JavaMail.defaultUser@defaultHost> > Or you could just redefine "&" and "<" as U+0026 START HTML ENTITY and U+003C START HTML TAG and be done with it, and just incorporate HTML5 into Unicode forever, thus eliminating these discussions from this list, and moving them to the W3C and WHATWG lists... ---- That encapsulates what I do not like about using markup other than in very precise limited circumstances such as designing a web page. The characters have defined meanings in Unicode: HTML changes those meanings for the purpose of writing web page source code. That use should not act as an Aunt Sally argument for stopping the addition of additional Unicode characters into regular Unicode. Adding some additional characters for producing italics, bold and maybe colour as well into regular Unicode so that the facilities available for use in plain text format are extended. would, in my opinion, be a good thing. HTML is HTML. There we are. The existence of HTML should not, in my opinion,stop development of the capabilities of plain text. William Overington 27 March 2015 From kenwhistler at att.net Fri Mar 27 12:07:18 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 27 Mar 2015 10:07:18 -0700 Subject: Usage stats? In-Reply-To: References: Message-ID: <55158E46.5010004@att.net> Search engine companies (and in particular, Google) have such information squirreled away in their index databases, at least as far as usage stats for Unicode characters on the web go -- but it is proprietary information, and they generally don't publish information about such statistics. Perhaps there are researchers out there who have set web crawlers on a mission to generate such web statistics for publication, and maybe somebody on this list knows of such research -- but it would be virtually impossible to generate such information for the much wider collection of documents and data that are not easily accessible for web indexing. (Behind password walls, in pdf document archives, in proprietary databases, ... ) As an example of why this is a problem, consider the fact that there are *peta*bytes of information picked up and stored in databases from scanners and other devices used at tens of millions of retail points of sale. Such data, by its nature, would tend to skew heavily towards use of ASCII a-z and digits 0-9 in its character data. How would you end up weighting such (mostly publicly inaccessible) data in trying to count up for overall statistics on character use? There are more traditional usage count studies that focus on counts of character frequency within single language orthographies in single scripts (e.g., letter frequences for French text), but I don't think that is what you were asking about. Here is some discussion of a similar question posted on stackoverflow: http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics --Ken On 3/27/2015 9:31 AM, Michael Norton wrote: > Hello and thank you for an incredible service (just joining the list). > Is there a list of usage statistics per character of the Unicode set > available somewhere? > > From michaelanortonster at gmail.com Fri Mar 27 13:04:06 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 14:04:06 -0400 Subject: magnetic limit Message-ID: A little more on what I had posted earlier today--> Today's periodic table consists of melting, boiling, and freezing points for elements; this magnetic point may now be added for each element in order to identify characteristics of a given volume for practitioners across which, via surface distributions per Maxwell's equations, transformations can positively occur. The synchronicity between the 99th & 1st elements appear evidently. On Wed, Mar 25, 2015 at 8:28 AM, Michael Norton < michaelanortonster at gmail.com> wrote: > spreadsheet attached > > On Wed, Mar 25, 2015 at 8:00 AM, Michael Norton < > michaelanortonster at gmail.com> wrote: > >> Dear Prof. Haynes, >> >> In a little under the past day's time, it had been bothering me in the >> back of my mind that at least here in the States there is a tremendous need >> for education reform; it seems that while tuition rates continue to >> increase, entropy in the formalism does as well. For example, recently I >> looked at an Honors Calculus book. It looked as if it had been packed >> together by a madman, without particular attention to practicality and >> applied science. >> >> Regardless of that book, the Lord (if you believe in one*) works in >> mysterious ways and, sure enough, this morning I decided to run the weights >> on the periodic table against my newfound love, this magnetic limit >> determined via Stephen's equation and my own. What I've found is a >> beautiful symmetry that only Nature can best express, though the formalized >> math here is nearly as equally impressive, and formally so. For example, >> our 99th element Einsteinium simplistically represents both the Hawking >> magnetic limit and Element #1, Hydrogen. Finding this is feeling as if >> God's answered my prayers within a day's thought. >> >> I cannot tell you how grateful I am for this to the movie industry--from >> supply chain production start to exhibition marketing end--for bringing >> forth Stephen's equation and putting it on my chest less than half a year >> ago now. It has opened up a growth spurt I have not felt since >> adolescence. Of course I am as grateful for you and the expression writer >> himself, Mr. Hawking. That the same determinations may be used to cure >> cancer and other ailments is icing on the cake. >> >> I hope to meet you soon; trying to emmigrate as best I can and perhaps >> find a way to establish a dual citizenship. >> >> *I'm being politically correct here. >> >> Best, >> >> -- >> >> Michael A. Norton, B.A. Cinema, M.P.A. >> My Cinema Home: http://www.NortonsNook.com >> >> "All great actors are mere mathematical masters of speech and the human >> body." >> >> >> >> >> > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Fri Mar 27 13:07:25 2015 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Fri, 27 Mar 2015 11:07:25 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> Message-ID: <20150327180725.GA9968@math.berkeley.edu> On Fri, Mar 27, 2015 at 01:00:09PM +0000, William_J_G Overington wrote: > >> Exact semantics of formatting characters aside, it is best to define plain text as a stateless stream. The characters you're proposing require a decoder to keep state, therefore they won't do. At most you may ask for > *U+E1001 COMBINING ITALICIZER > *U+E1003 COMBINING BOLDIFIER > after all, we already have U+0332 COMBINING LOW LINE and U+0336 COMBINING LONG STROKE OVERLAY for and resp, thus adding their counterparts for and will merely complete the set. > > COMBINING EMBOLDENER, surely. ;-) > So, if that were implemented, then to typeset, say, the word astrolabe within a plain text file, in italics, one would need to use nine instances of the COMBINING ITALICIZER, one instance after each letter of the word astrolabe. > That would be fine and the characters discussed would be, in my opinion, two useful additions to Unicode. > If a two word phrase were to be typeset within a plain text file then each letter of the two words would need to have an instance of the COMBINING ITALICIZER after each letter of the word. Would one add an instance after the space character that is between the words? > 27 March 2015 Guys, it is just a 4 days wait. Then we can discuss the last question in depth until a consensus is reached! Ilya From kenwhistler at att.net Fri Mar 27 13:32:30 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 27 Mar 2015 11:32:30 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <28234536.41082.1427469358202.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <551495DD.8000308@kli.org> <55156D99.3070700@tonal.clara.co.uk> <28234536.41082.1427469358202.JavaMail.defaultUser@defaultHost> Message-ID: <5515A23E.2020000@att.net> On 3/27/2015 8:15 AM, William_J_G Overington wrote: >> Or you could just redefine "&" and "<" as > ---- > > That encapsulates what I do not like about using markup other than in very precise limited circumstances such as designing a web page. > > The characters have defined meanings in Unicode: HTML changes those meanings for the purpose of writing web page source code. This represents a fundamental misunderstanding of what Unicode character encoding is all about. I realize that William is unlikely to be deterred from his project to incorporate various functions into what he conceives of as plain text, but in hopes of preventing other folk from following him down the garden path, let's consider this scenario further. The Unicode Standard specifies the character encoding for: U+003C "<" LESS-THAN SIGN U+003E ">" GREATER-THAN SIGN That specification clearly *identifies* the characters and their code points. The code charts give the representative glyphs, to help in the identification. And the Unicode Character Database provides precise specification of character properties for these characters (as for all others), to assist in uniform and correct implementations. What the Unicode Standard does *not* do is define the *meanings* of these characters, in the sense of their meaning in use. That is entirely up to the people who use them, and more particularly, to people or agencies or committees or whoever decides to apply such characters in particular orthographies, formal syntax definitions, conventions, or whatever. Examples: 1. if a < b and c > 0 then ac < bc Here we have a simple algebraic expression, with ">" meaning 'is greater than' and "<" meaning 'is less than'. Talk to the mathematicians for exact meaning and usage. 2. a Here we have the "<" and ">" being used as start and end markers of tags in a markup scheme for text. Furthermore, the entire strings "" and "" have further defined meaning as start and end of italic style runs. Talk to W3C for exact meaning and usage. 3. ==> look here <== Here we have a common ASCII plain text convention for use of "<" and ">" are arrowheads for constructed arrows. Talk to... well, whoever, writes plain text email these days for exact meaning and usage. 4. Following is some quoted plain text email: > -R > >> >> -- Ken >> >> On Dec 7, 2011, at 6:41 AM, Richard COOK wrote: >> >>> On Dec 6, 2011, at 12:19 PM, Ken Lunde wrote: >>> >>>> Richard, Here we have another common ASCII plain text convention for use of ">" -- but this time it indicates both quotation and indentation. Repetition of use of the ">" indicates repeated re-quotation and further indentation. Talk to the implementers of plain text email clients for exact meaning and usage. 5. cout << "hello!" ; Here we have an instance from C++ program text, where two "<" in sequence represent a streaming operator. Talk to the documenters of the C++ standard for exact meaning and usage. 6. template Here we have a *different* instance from C++ program text, which looks a little like HTML tags, but is not. In this case we are using "<" and ">" again as paired delimiters ("angle brackets"), but the syntax and interpretation is distinct. This is not a "tag". Talk to the documenters of the C++ standard for exact meaning and usage. 7. This is a convention used in email and other contexts, where 003C and 003E used as paired delimiters (angle brackets) mark off an email address or a URL. This might look like the HTML usage, but it isn't. This isn't a tag. Talk to the implementers of email clients and similar software for exact meaning and usage. 8. Jean a dit : << Je veux le faire. >> Oops, here we have something different again. This is a *substitution* use of 003C and 003E to emulate proper French guilllemet punctuation marks. Poor guy doesn't have guillemets on his keyboard -- what is he gonna do?! I'm sure people could come up with many other examples in this vein. The point of the long-winded exemplification is that characters "mean" what people use them to "mean". As long as the *identity* of the character is not in question and the code points are correctly used and transmitted, then the plain text conformance requirements of the Unicode Standard have been met. And this is precisely as it should be. Just as it is not the business of the Unicode Standard to dictate to anyone how they should spell text, it is also not the business of the Unicode Standard to limit or otherwise constrain what conventions of interpretation and/or what additional layers of syntactic complexity (whether mathematics, markup, or anything else) people build on top of text characters. > > That use should not act as an Aunt Sally argument for stopping the addition of additional Unicode characters into regular Unicode. > > Adding some additional characters for producing italics, bold and maybe colour as well into regular Unicode so that the facilities available for use in plain text format are extended. would, in my opinion, be a good thing. It would be a bad thing. As Asmus has noted in this thread, proposals of this ilk are dead on arrival at the UTC, because they do not understand the appropriate layering of text processing. Just because a distinction is *in* text, does not mean that it should be, ipso facto, defined in plain text or encoded in characters. --Ken From michaelanortonster at gmail.com Fri Mar 27 14:57:57 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 15:57:57 -0400 Subject: Usage stats? In-Reply-To: <55158E46.5010004@att.net> References: <55158E46.5010004@att.net> Message-ID: Why wouldn't Unicode itself have it? On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler wrote: > Search engine companies (and in particular, Google) have such > information squirreled away in their index databases, at least as > far as usage stats for Unicode characters on the web go -- but it > is proprietary information, and they generally don't publish > information about such statistics. > > Perhaps there are researchers out there who have set web crawlers > on a mission to generate such web statistics for publication, and maybe > somebody on this list knows of such research -- but it would be > virtually impossible to generate such information for the much > wider collection of documents and data that are not easily accessible > for web indexing. (Behind password walls, in pdf document archives, > in proprietary databases, ... ) As an example of why this is a problem, > consider the fact that there are *peta*bytes of information picked up > and stored in databases from scanners and other devices used at > tens of millions of retail points of sale. Such data, by its nature, would > tend > to skew heavily towards use of ASCII a-z and digits 0-9 in its > character data. How would you end up weighting such (mostly > publicly inaccessible) data in trying to count up for overall statistics > on character use? > > There are more traditional usage count studies that focus on > counts of character frequency within single language orthographies > in single scripts (e.g., letter frequences for French text), but I don't > think that is what you were asking about. > > Here is some discussion of a similar question posted on stackoverflow: > > http://stackoverflow.com/questions/22184624/unicode- > character-usage-statistics > > --Ken > > On 3/27/2015 9:31 AM, Michael Norton wrote: > >> Hello and thank you for an incredible service (just joining the list). >> Is there a list of usage statistics per character of the Unicode set >> available somewhere? >> >> >> > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Mar 27 15:15:17 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 27 Mar 2015 13:15:17 -0700 Subject: Usage =?UTF-8?Q?stats=3F?= Message-ID: <20150327131517.665a7a7059d7ee80bb4d670165c8327d.ea3c46ecda.wbe@email03.secureserver.net> Michael Norton wrote: > Why wouldn't Unicode itself have it? Probably because the Unicode Consortium isn't responsible for indexing the entire web. Would you expect it to be? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From michaelanortonster at gmail.com Fri Mar 27 15:16:06 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 16:16:06 -0400 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> Message-ID: (I know this is way too simplistic a response but it is kind of like giving everyone an invisible cloak and an invisible dagger and not telling them what a cloak and dagger is for [cutting butter & keeping warm]). On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton < michaelanortonster at gmail.com> wrote: > Why wouldn't Unicode itself have it? > > On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler wrote: > >> Search engine companies (and in particular, Google) have such >> information squirreled away in their index databases, at least as >> far as usage stats for Unicode characters on the web go -- but it >> is proprietary information, and they generally don't publish >> information about such statistics. >> >> Perhaps there are researchers out there who have set web crawlers >> on a mission to generate such web statistics for publication, and maybe >> somebody on this list knows of such research -- but it would be >> virtually impossible to generate such information for the much >> wider collection of documents and data that are not easily accessible >> for web indexing. (Behind password walls, in pdf document archives, >> in proprietary databases, ... ) As an example of why this is a problem, >> consider the fact that there are *peta*bytes of information picked up >> and stored in databases from scanners and other devices used at >> tens of millions of retail points of sale. Such data, by its nature, >> would tend >> to skew heavily towards use of ASCII a-z and digits 0-9 in its >> character data. How would you end up weighting such (mostly >> publicly inaccessible) data in trying to count up for overall statistics >> on character use? >> >> There are more traditional usage count studies that focus on >> counts of character frequency within single language orthographies >> in single scripts (e.g., letter frequences for French text), but I don't >> think that is what you were asking about. >> >> Here is some discussion of a similar question posted on stackoverflow: >> >> http://stackoverflow.com/questions/22184624/unicode- >> character-usage-statistics >> >> --Ken >> >> On 3/27/2015 9:31 AM, Michael Norton wrote: >> >>> Hello and thank you for an incredible service (just joining the list). >>> Is there a list of usage statistics per character of the Unicode set >>> available somewhere? >>> >>> >>> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From john at mitre.org Fri Mar 27 15:23:17 2015 From: john at mitre.org (John D. Burger) Date: Fri, 27 Mar 2015 16:23:17 -0400 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> Message-ID: <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> On Mar 27, 2015, at 15:57 , Michael Norton > wrote: > Why wouldn't Unicode itself have it? Because as Ken explained, acquiring (and constantly updating) such statistics would require roughly the effort that Google puts into its crawler. And it wouldn't include all the printed material that isn't on the web. Turning your question around, why would Unicode have this information? What would be the value, and how would it be worth the (considerable) effort required? - John Burger MITRE > > On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler > wrote: > Search engine companies (and in particular, Google) have such > information squirreled away in their index databases, at least as > far as usage stats for Unicode characters on the web go -- but it > is proprietary information, and they generally don't publish > information about such statistics. > > Perhaps there are researchers out there who have set web crawlers > on a mission to generate such web statistics for publication, and maybe > somebody on this list knows of such research -- but it would be > virtually impossible to generate such information for the much > wider collection of documents and data that are not easily accessible > for web indexing. (Behind password walls, in pdf document archives, > in proprietary databases, ... ) As an example of why this is a problem, > consider the fact that there are *peta*bytes of information picked up > and stored in databases from scanners and other devices used at > tens of millions of retail points of sale. Such data, by its nature, would tend > to skew heavily towards use of ASCII a-z and digits 0-9 in its > character data. How would you end up weighting such (mostly > publicly inaccessible) data in trying to count up for overall statistics > on character use? > > There are more traditional usage count studies that focus on > counts of character frequency within single language orthographies > in single scripts (e.g., letter frequences for French text), but I don't > think that is what you were asking about. > > Here is some discussion of a similar question posted on stackoverflow: > > http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics > > --Ken > > On 3/27/2015 9:31 AM, Michael Norton wrote: > Hello and thank you for an incredible service (just joining the list). Is there a list of usage statistics per character of the Unicode set available somewhere? > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human body." > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Fri Mar 27 15:24:48 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 16:24:48 -0400 Subject: Usage stats? In-Reply-To: <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: Just using the tools and formulations we have at present ought to allow Unicode to produce a usage set without indexing the entire web which would provide implementors with an indication of variances for traffic, overflow, and override purposes relative to users of the standard. If the figure varies significantly from page:website, website:region, region:language, for example, it simplifies our ability to standardize the set. I have particular concerns, but, like Google, they are proprietary. On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger wrote: > On Mar 27, 2015, at 15:57 , Michael Norton > wrote: > > Why wouldn't Unicode itself have it? > > > Because as Ken explained, acquiring (and constantly updating) such > statistics would require roughly the effort that Google puts into its > crawler. And it wouldn't include all the printed material that isn't on the > web. > > Turning your question around, why would Unicode have this information? > What would be the value, and how would it be worth the (considerable) > effort required? > > - John Burger > MITRE > > > On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler wrote: > >> Search engine companies (and in particular, Google) have such >> information squirreled away in their index databases, at least as >> far as usage stats for Unicode characters on the web go -- but it >> is proprietary information, and they generally don't publish >> information about such statistics. >> >> Perhaps there are researchers out there who have set web crawlers >> on a mission to generate such web statistics for publication, and maybe >> somebody on this list knows of such research -- but it would be >> virtually impossible to generate such information for the much >> wider collection of documents and data that are not easily accessible >> for web indexing. (Behind password walls, in pdf document archives, >> in proprietary databases, ... ) As an example of why this is a problem, >> consider the fact that there are *peta*bytes of information picked up >> and stored in databases from scanners and other devices used at >> tens of millions of retail points of sale. Such data, by its nature, >> would tend >> to skew heavily towards use of ASCII a-z and digits 0-9 in its >> character data. How would you end up weighting such (mostly >> publicly inaccessible) data in trying to count up for overall statistics >> on character use? >> >> There are more traditional usage count studies that focus on >> counts of character frequency within single language orthographies >> in single scripts (e.g., letter frequences for French text), but I don't >> think that is what you were asking about. >> >> Here is some discussion of a similar question posted on stackoverflow: >> >> http://stackoverflow.com/questions/22184624/unicode- >> character-usage-statistics >> >> --Ken >> >> On 3/27/2015 9:31 AM, Michael Norton wrote: >> >>> Hello and thank you for an incredible service (just joining the list). >>> Is there a list of usage statistics per character of the Unicode set >>> available somewhere? >>> >>> >>> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Fri Mar 27 15:27:26 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 16:27:26 -0400 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: Easy example: what's the code for [blank space] U+020 across all language sets of Unicode? Is it the same ie: 100%? On Fri, Mar 27, 2015 at 4:24 PM, Michael Norton < michaelanortonster at gmail.com> wrote: > Just using the tools and formulations we have at present ought to allow > Unicode to produce a usage set without indexing the entire web which would > provide implementors with an indication of variances for traffic, overflow, > and override purposes relative to users of the standard. If the figure > varies significantly from page:website, website:region, region:language, > for example, it simplifies our ability to standardize the set. > > I have particular concerns, but, like Google, they are proprietary. > > On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger wrote: > >> On Mar 27, 2015, at 15:57 , Michael Norton >> wrote: >> >> Why wouldn't Unicode itself have it? >> >> >> Because as Ken explained, acquiring (and constantly updating) such >> statistics would require roughly the effort that Google puts into its >> crawler. And it wouldn't include all the printed material that isn't on the >> web. >> >> Turning your question around, why would Unicode have this information? >> What would be the value, and how would it be worth the (considerable) >> effort required? >> >> - John Burger >> MITRE >> >> >> On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler >> wrote: >> >>> Search engine companies (and in particular, Google) have such >>> information squirreled away in their index databases, at least as >>> far as usage stats for Unicode characters on the web go -- but it >>> is proprietary information, and they generally don't publish >>> information about such statistics. >>> >>> Perhaps there are researchers out there who have set web crawlers >>> on a mission to generate such web statistics for publication, and maybe >>> somebody on this list knows of such research -- but it would be >>> virtually impossible to generate such information for the much >>> wider collection of documents and data that are not easily accessible >>> for web indexing. (Behind password walls, in pdf document archives, >>> in proprietary databases, ... ) As an example of why this is a problem, >>> consider the fact that there are *peta*bytes of information picked up >>> and stored in databases from scanners and other devices used at >>> tens of millions of retail points of sale. Such data, by its nature, >>> would tend >>> to skew heavily towards use of ASCII a-z and digits 0-9 in its >>> character data. How would you end up weighting such (mostly >>> publicly inaccessible) data in trying to count up for overall statistics >>> on character use? >>> >>> There are more traditional usage count studies that focus on >>> counts of character frequency within single language orthographies >>> in single scripts (e.g., letter frequences for French text), but I don't >>> think that is what you were asking about. >>> >>> Here is some discussion of a similar question posted on stackoverflow: >>> >>> http://stackoverflow.com/questions/22184624/unicode- >>> character-usage-statistics >>> >>> --Ken >>> >>> On 3/27/2015 9:31 AM, Michael Norton wrote: >>> >>>> Hello and thank you for an incredible service (just joining the list). >>>> Is there a list of usage statistics per character of the Unicode set >>>> available somewhere? >>>> >>>> >>>> >>> _______________________________________________ >>> Unicode mailing list >>> Unicode at unicode.org >>> http://unicode.org/mailman/listinfo/unicode >>> >> >> >> >> -- >> >> Michael A. Norton, B.A. Cinema, M.P.A. >> My Cinema Home: http://www.NortonsNook.com >> >> "All great actors are mere mathematical masters of speech and the human >> body." >> >> >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> >> > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From addison at lab126.com Fri Mar 27 15:40:35 2015 From: addison at lab126.com (Phillips, Addison) Date: Fri, 27 Mar 2015 20:40:35 +0000 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB52ECE3319@ex10-mbx-9007.ant.amazon.com> What you might be looking for would be the CLDR project?s ?exemplar sets? (see for example [1]), which describes which characters are customarily used for a given language and which are sometimes used. However, this is not the same thing as statistical distribution. One of the points of Unicode is that any character can be used at any time in any document?regardless of language. [1] http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Michael Norton Sent: Friday, March 27, 2015 1:25 PM To: John D. Burger Cc: Vint Cerf; Unicode at unicode.org Subject: Re: Usage stats? Just using the tools and formulations we have at present ought to allow Unicode to produce a usage set without indexing the entire web which would provide implementors with an indication of variances for traffic, overflow, and override purposes relative to users of the standard. If the figure varies significantly from page:website, website:region, region:language, for example, it simplifies our ability to standardize the set. I have particular concerns, but, like Google, they are proprietary. On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger > wrote: On Mar 27, 2015, at 15:57 , Michael Norton > wrote: Why wouldn't Unicode itself have it? Because as Ken explained, acquiring (and constantly updating) such statistics would require roughly the effort that Google puts into its crawler. And it wouldn't include all the printed material that isn't on the web. Turning your question around, why would Unicode have this information? What would be the value, and how would it be worth the (considerable) effort required? - John Burger MITRE On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler > wrote: Search engine companies (and in particular, Google) have such information squirreled away in their index databases, at least as far as usage stats for Unicode characters on the web go -- but it is proprietary information, and they generally don't publish information about such statistics. Perhaps there are researchers out there who have set web crawlers on a mission to generate such web statistics for publication, and maybe somebody on this list knows of such research -- but it would be virtually impossible to generate such information for the much wider collection of documents and data that are not easily accessible for web indexing. (Behind password walls, in pdf document archives, in proprietary databases, ... ) As an example of why this is a problem, consider the fact that there are *peta*bytes of information picked up and stored in databases from scanners and other devices used at tens of millions of retail points of sale. Such data, by its nature, would tend to skew heavily towards use of ASCII a-z and digits 0-9 in its character data. How would you end up weighting such (mostly publicly inaccessible) data in trying to count up for overall statistics on character use? There are more traditional usage count studies that focus on counts of character frequency within single language orthographies in single scripts (e.g., letter frequences for French text), but I don't think that is what you were asking about. Here is some discussion of a similar question posted on stackoverflow: http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics --Ken On 3/27/2015 9:31 AM, Michael Norton wrote: Hello and thank you for an incredible service (just joining the list). Is there a list of usage statistics per character of the Unicode set available somewhere? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." [Image removed by sender.] _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." [Image removed by sender.] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 594 bytes Desc: image001.jpg URL: From michaelanortonster at gmail.com Fri Mar 27 15:44:23 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 16:44:23 -0400 Subject: Usage stats? In-Reply-To: <7C0AF84C6D560544A17DDDEB68A9DFB52ECE3319@ex10-mbx-9007.ant.amazon.com> References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> <7C0AF84C6D560544A17DDDEB68A9DFB52ECE3319@ex10-mbx-9007.ant.amazon.com> Message-ID: Thank you. What's the count for "universal characters" at this time? Eg: [SP] On Fri, Mar 27, 2015 at 4:40 PM, Phillips, Addison wrote: > What you might be looking for would be the CLDR project's "exemplar > sets" (see for example [1]), which describes which characters are > customarily used for a given language and which are sometimes used. > However, this is not the same thing as statistical distribution. One of the > points of Unicode is that any character can be used at any time in any > document--regardless of language. > > > > > > [1] > http://www.unicode.org/cldr/charts/27/by_type/core_data.alphabetic_information.main.html > > > > *From:* Unicode [mailto:unicode-bounces at unicode.org] *On Behalf Of *Michael > Norton > *Sent:* Friday, March 27, 2015 1:25 PM > *To:* John D. Burger > *Cc:* Vint Cerf; Unicode at unicode.org > *Subject:* Re: Usage stats? > > > > Just using the tools and formulations we have at present ought to allow > Unicode to produce a usage set without indexing the entire web which would > provide implementors with an indication of variances for traffic, overflow, > and override purposes relative to users of the standard. If the figure > varies significantly from page:website, website:region, region:language, > for example, it simplifies our ability to standardize the set. > > > > I have particular concerns, but, like Google, they are proprietary. > > > > On Fri, Mar 27, 2015 at 4:23 PM, John D. Burger wrote: > > On Mar 27, 2015, at 15:57 , Michael Norton > wrote: > > > > Why wouldn't Unicode itself have it? > > > > Because as Ken explained, acquiring (and constantly updating) such > statistics would require roughly the effort that Google puts into its > crawler. And it wouldn't include all the printed material that isn't on the > web. > > > > Turning your question around, why would Unicode have this information? > What would be the value, and how would it be worth the (considerable) > effort required? > > > > - John Burger > > MITRE > > > > > > On Fri, Mar 27, 2015 at 1:07 PM, Ken Whistler wrote: > > Search engine companies (and in particular, Google) have such > information squirreled away in their index databases, at least as > far as usage stats for Unicode characters on the web go -- but it > is proprietary information, and they generally don't publish > information about such statistics. > > Perhaps there are researchers out there who have set web crawlers > on a mission to generate such web statistics for publication, and maybe > somebody on this list knows of such research -- but it would be > virtually impossible to generate such information for the much > wider collection of documents and data that are not easily accessible > for web indexing. (Behind password walls, in pdf document archives, > in proprietary databases, ... ) As an example of why this is a problem, > consider the fact that there are *peta*bytes of information picked up > and stored in databases from scanners and other devices used at > tens of millions of retail points of sale. Such data, by its nature, would > tend > to skew heavily towards use of ASCII a-z and digits 0-9 in its > character data. How would you end up weighting such (mostly > publicly inaccessible) data in trying to count up for overall statistics > on character use? > > There are more traditional usage count studies that focus on > counts of character frequency within single language orthographies > in single scripts (e.g., letter frequences for French text), but I don't > think that is what you were asking about. > > Here is some discussion of a similar question posted on stackoverflow: > > > http://stackoverflow.com/questions/22184624/unicode-character-usage-statistics > > --Ken > > On 3/27/2015 9:31 AM, Michael Norton wrote: > > Hello and thank you for an incredible service (just joining the list). > Is there a list of usage statistics per character of the Unicode set > available somewhere? > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > > -- > > > Michael A. Norton, B.A. Cinema, M.P.A. > > My Cinema Home: http://www.NortonsNook.com > > > > "All great actors are mere mathematical masters of speech and the human > body." > > [image: Image removed by sender.] > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > > > > > > -- > > > Michael A. Norton, B.A. Cinema, M.P.A. > > My Cinema Home: http://www.NortonsNook.com > > > > "All great actors are mere mathematical masters of speech and the human > body." > > [image: Image removed by sender.] > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.jpg Type: image/jpeg Size: 594 bytes Desc: not available URL: From markus.icu at gmail.com Fri Mar 27 15:56:23 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 27 Mar 2015 13:56:23 -0700 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton < michaelanortonster at gmail.com> wrote: > Easy example: what's the code for [blank space] U+020 across all language > sets of Unicode? Is it the same ie: 100%? > I don't understand what you are asking, and I have a hunch you haven't said it in a way that anyone else understands it either. The code point value that the Unicode Standard assigns to the normal space is U+0020, but - not every language uses spaces - not every language that uses spaces uses them for the same purpose as English - there are some 30 other "space" characters in Unicode Statistics of character frequencies vary by corpus, as others have said. Even if you "only" look "on the web", that's undefined until you specify a crawling strategy. Dynamically generated content means that there is an infinite number of "web pages". Every crawler will come up with a different set. Maybe you are asking about statistics of character encodings? On the web? Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Fri Mar 27 16:03:44 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 17:03:44 -0400 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: Doug Ewell's getting it. He sent this back to me, so I asked him if he could provide the same dataset drawn from his written reply to me: * For example, your original e-mail (327characters) consists of:U+0020 - 14.07%U+0065 - 10.09%U+0061 - 7.03%U+0074 - 6.73%U+006F - 5.81%* This is good because when the volumes of traffic begin to exponentially increase over a space, if there are predominant formulations of Unicode for each, they need to be recognized for a number of reasons depending on which sector or, as you say, corpus, you're in. In the above example, I think it's safe to say U+0020 online, though I would like to compare with the other 30 "space" characters you mentioned Markus. If I know traffic figures for where the other space characters are used, I can draw a pretty good estimation and correlation of it. On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer wrote: > On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton < > michaelanortonster at gmail.com> wrote: > >> Easy example: what's the code for [blank space] U+020 across all language >> sets of Unicode? Is it the same ie: 100%? >> > > I don't understand what you are asking, and I have a hunch you haven't > said it in a way that anyone else understands it either. > > The code point value that the Unicode Standard assigns to the normal space > is U+0020, but > - not every language uses spaces > - not every language that uses spaces uses them for the same purpose as > English > - there are some 30 other "space" characters in Unicode > > Statistics of character frequencies vary by corpus, as others have said. > Even if you "only" look "on the web", that's undefined until you specify a > crawling strategy. Dynamically generated content means that there is an > infinite number of "web pages". Every crawler will come up with a different > set. > > Maybe you are asking about statistics of character encodings? On the web? > Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.? > > markus > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Fri Mar 27 16:23:11 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Fri, 27 Mar 2015 17:23:11 -0400 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: I'm trying to get a sense of the range and variance of the Unicode set in the same way I have with hypertext on the web: for every HTML or XHTML document URL, for example ,there is going to be a* >0* Minimum of* "<"* and* ">"* characters. Depending on which Markup set and schema(s) you are using, char-MIN's and (eventually) char-MAX's are useful to have. On Fri, Mar 27, 2015 at 5:03 PM, Michael Norton < michaelanortonster at gmail.com> wrote: > Doug Ewell's getting it. He sent this back to me, so I asked him if he > could provide the same dataset drawn from his written reply to me: > > > > > > > > > * For example, your original e-mail (327characters) consists of:U+0020 - > 14.07%U+0065 - 10.09%U+0061 - 7.03%U+0074 - 6.73%U+006F - 5.81%* > > This is good because when the volumes of traffic begin to exponentially > increase over a space, if there are predominant formulations of Unicode for > each, they need to be recognized for a number of reasons depending on which > sector or, as you say, corpus, you're in. > > In the above example, I think it's safe to say U+0020 online, though I > would like to compare with the other 30 "space" characters you mentioned > Markus. If I know traffic figures for where the other space characters > are used, I can draw a pretty good estimation and correlation of it. > > On Fri, Mar 27, 2015 at 4:56 PM, Markus Scherer > wrote: > >> On Fri, Mar 27, 2015 at 1:27 PM, Michael Norton < >> michaelanortonster at gmail.com> wrote: >> >>> Easy example: what's the code for [blank space] U+020 across all >>> language sets of Unicode? Is it the same ie: 100%? >>> >> >> I don't understand what you are asking, and I have a hunch you haven't >> said it in a way that anyone else understands it either. >> >> The code point value that the Unicode Standard assigns to the normal >> space is U+0020, but >> - not every language uses spaces >> - not every language that uses spaces uses them for the same purpose as >> English >> - there are some 30 other "space" characters in Unicode >> >> Statistics of character frequencies vary by corpus, as others have said. >> Even if you "only" look "on the web", that's undefined until you specify a >> crawling strategy. Dynamically generated content means that there is an >> infinite number of "web pages". Every crawler will come up with a different >> set. >> >> Maybe you are asking about statistics of character encodings? On the web? >> Such as, Unicode vs. Shift-JIS vs. ISO 8859-2 etc.? >> >> markus >> > > > > -- > > Michael A. Norton, B.A. Cinema, M.P.A. > My Cinema Home: http://www.NortonsNook.com > > "All great actors are mere mathematical masters of speech and the human > body." > > > > > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Fri Mar 27 16:32:24 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 27 Mar 2015 14:32:24 -0700 Subject: Usage stats? In-Reply-To: <55158E46.5010004@att.net> References: <55158E46.5010004@att.net> Message-ID: <5515CC68.9000402@efele.net> Would a corpus like wikipedia or Project Gutenberg be appropriate for you purpose ? Both are freely and easily accessible. and . Eric. From richard.wordingham at ntlworld.com Fri Mar 27 19:59:56 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 28 Mar 2015 00:59:56 +0000 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: <20150328005956.33afb1c5@JRWUBU2> On Fri, 27 Mar 2015 16:27:26 -0400 Michael Norton wrote: > Easy example: what's the code for [blank space] U+020 across all > language sets of Unicode? Is it the same ie: 100%? No. In China, U+3000 IDEOGRAPHIC SPACE, which is the appropriate ordinary intra-line white space character for use with ideographs, is also commonly used in scripts which elsewhere would use U+0020 SPACE. Richard. From prosfilaes at gmail.com Fri Mar 27 20:47:15 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 27 Mar 2015 18:47:15 -0700 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: On Fri, Mar 27, 2015 at 2:03 PM, Michael Norton wrote: > > This is good because when the volumes of traffic begin to exponentially increase over a space, if there are predominant formulations of Unicode for each, they need to be recognized for a number of reasons depending on which sector or, as you say, corpus, you're in. Huh? > In the above example, I think it's safe to say U+0020 online, though I would like to compare with the other 30 "space" characters you mentioned Markus. If I know traffic figures for where the other space characters are used, I can draw a pretty good estimation and correlation of it. ASCII characters are the safest to use online (everyone supports them), except when they are the most dangerous (characters found outside ASCII can rarely be used for tag/SQL/code injection). If you want to know what people can display, look at the fonts that come with the OSes that you're interested in. There's interesting things you can do with this data, but if you want to know what's safe online, it's way more important to be familiar with the basic preexisting character sets then to know what the distribution of characters is. ~ and ^ will work about everywhere, whereas ? won't and ? is even worse, and that has nothing to do with their frequency online. -- Kie ekzistas vivo, ekzistas espero. From cph13 at case.edu Fri Mar 27 21:15:18 2015 From: cph13 at case.edu (Clive Hohberger) Date: Fri, 27 Mar 2015 21:15:18 -0500 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> Message-ID: Interesting that you should bring up the ^ and tilde. Their OS independence and IBM mainframe compatibility is the reason in 1985 I chose them as the command prefixes for the now widely used ZPL label design programming language. ISO 646 IRV was chosen as the programming character set. On Friday, March 27, 2015, David Starner wrote: > On Fri, Mar 27, 2015 at 2:03 PM, Michael Norton > > wrote: > > > > This is good because when the volumes of traffic begin to exponentially > increase over a space, if there are predominant formulations of Unicode for > each, they need to be recognized for a number of reasons depending on which > sector or, as you say, corpus, you're in. > > Huh? > > > In the above example, I think it's safe to say U+0020 online, though I > would like to compare with the other 30 "space" characters you mentioned > Markus. If I know traffic figures for where the other space characters > are used, I can draw a pretty good estimation and correlation of it. > > ASCII characters are the safest to use online (everyone supports > them), except when they are the most dangerous (characters found > outside ASCII can rarely be used for tag/SQL/code injection). If you > want to know what people can display, look at the fonts that come with > the OSes that you're interested in. There's interesting things you can > do with this data, but if you want to know what's safe online, it's > way more important to be familiar with the basic preexisting character > sets then to know what the distribution of characters is. ~ and ^ will > work about everywhere, whereas ? won't and ? is even worse, and that > has nothing to do with their frequency online. > > > -- > Kie ekzistas vivo, ekzistas espero. > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Clive P. Hohberger, PhD MBA Managing Director Clive Hohberger, LLC +1 847 910 8794 cph13 at case.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From srl at icu-project.org Fri Mar 27 23:21:40 2015 From: srl at icu-project.org (Steven R. Loomis) Date: Fri, 27 Mar 2015 21:21:40 -0700 Subject: Usage stats? In-Reply-To: References: <55158E46.5010004@att.net> Message-ID: <7381ABF5-3D91-4627-A676-B3E3A83DC9E1@icu-project.org> Here's an analogy: it's more of a piano factory than a concert hall. S Enviado desde nuestro iPhone. > El mar 27, 2015, a las 1:16 PM, Michael Norton escribi?: > > (I know this is way too simplistic a response but it is kind of like giving everyone an invisible cloak and an invisible dagger and not telling them what a cloak and dagger is for [cutting butter & keeping warm]). > >> On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton wrote: >> Why wouldn't Unicode itself have it? -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Sat Mar 28 06:45:51 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 07:45:51 -0400 Subject: Usage stats? In-Reply-To: References: <20150327143119.665a7a7059d7ee80bb4d670165c8327d.40466a238f.wbe@email03.secureserver.net> Message-ID: *Important correction from my last sent email*: *Only 34% from your list exceed 10% of **the average percentile (2.9%)**. * This is serendipitously common (eg. the Earth:Moon albedo ratio is .36). A relationship about motion and other natural properties and charactetristics among the local texts begin to emerge. On Sat, Mar 28, 2015 at 7:30 AM, Michael Norton < michaelanortonster at gmail.com> wrote: > Thanks Doug. I did not know there exists a *representative* sample of > the world's text. :) I do know that 400 years ago there were about 10,000 > languages; now there are about 6,500. Time flies! > > Your frequency chart is great. The average char appearance is 2.91%. > Only 34% from your list exceed 10% of it. Therefore, U+0020 is the > elephant in the room (ie. 15%.05% is far > 2.91%). In fact, it's almost > >50% greater than the next most-appearing character. > > So from the two frequency lists you've given me (my email and yours) we > begin to see some patterns emerge. Provided prior data and observation, > most useful patterns prevail over other more obscure ones and present a > provocative opportunity for webbers out there....While this is probably out > of context for most of the 700 Unicode members, I can report that it's good > news. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From michaelanortonster at gmail.com Sat Mar 28 06:30:23 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 07:30:23 -0400 Subject: Usage stats? In-Reply-To: <20150327143119.665a7a7059d7ee80bb4d670165c8327d.40466a238f.wbe@email03.secureserver.net> References: <20150327143119.665a7a7059d7ee80bb4d670165c8327d.40466a238f.wbe@email03.secureserver.net> Message-ID: Thanks Doug. I did not know there exists a *representative* sample of the world's text. :) I do know that 400 years ago there were about 10,000 languages; now there are about 6,500. Time flies! Your frequency chart is great. The average char appearance is 2.91%. Only 34% from your list exceed 10% of it. Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far > 2.91%). In fact, it's almost >50% greater than the next most-appearing character. So from the two frequency lists you've given me (my email and yours) we begin to see some patterns emerge. Provided prior data and observation, most useful patterns prevail over other more obscure ones and present a provocative opportunity for webbers out there....While this is probably out of context for most of the 700 Unicode members, I can report that it's good news. On Fri, Mar 27, 2015 at 5:31 PM, Doug Ewell wrote: > Here is a frequency chart for my previous message. I used the Character > Frequency tool in Andrew West's BabelPad editor ( > http://www.babelstone.co.uk/Software/BabelPad.html) and sent the output > to Excel to calculate the percentages. To make Excel happy, I had to > manually add a single quote ' before the double quotation mark " . > > This is still *nowhere near* a realistic sample of which Unicode > characters are used with what frequency in the entire world. There are > still only 69 discrete characters, less than the printable ASCII set. And > according to this sample, Regional Indicator Symbols occur as often as > capital A, and capital R never occurs at all. > > In Japanese or Thai text you will have almost no instances of U+0020. > > If you search a non-representative sample of the world's text, you will > get non-representative statistics. > > > Code point Character Character Name Count U+0020 SPACE 177 15.05% > U+0065 e LATIN SMALL LETTER E 92 7.82% U+0074 t LATIN SMALL LETTER T 86 > 7.31% U+006F o LATIN SMALL LETTER O 76 6.46% U+0061 a LATIN SMALL LETTER A > 63 5.36% U+0069 i LATIN SMALL LETTER I 62 5.27% U+006E n LATIN SMALL > LETTER N 54 4.59% U+0072 r LATIN SMALL LETTER R 50 4.25% U+0073 s LATIN > SMALL LETTER S 47 4.00% U+006C l LATIN SMALL LETTER L 44 3.74% U+0063 c LATIN > SMALL LETTER C 38 3.23% U+0068 h LATIN SMALL LETTER H 34 2.89% U+0075 u LATIN > SMALL LETTER U 33 2.81% U+0064 d LATIN SMALL LETTER D 27 2.30% U+0079 y LATIN > SMALL LETTER Y 25 2.13% U+0067 g LATIN SMALL LETTER G 18 1.53% U+002E . FULL > STOP 16 1.36% U+0030 0 DIGIT ZERO 15 1.28% U+0062 b LATIN SMALL LETTER B > 15 1.28% U+0066 f LATIN SMALL LETTER F 15 1.28% U+003E > GREATER-THAN SIGN > 13 1.11% U+0070 p LATIN SMALL LETTER P 13 1.11% U+0077 w LATIN SMALL > LETTER W 12 1.02% U+002C , COMMA 11 0.94% U+006D m LATIN SMALL LETTER M 11 > 0.94% U+0055 U LATIN CAPITAL LETTER U 9 0.77% U+002D - HYPHEN-MINUS 8 > 0.68% U+0076 v LATIN SMALL LETTER V 7 0.60% U+0078 x LATIN SMALL LETTER X > 7 0.60% U+0027 ' APOSTROPHE 6 0.51% U+0025 % PERCENT SIGN 5 0.43% U+002B + PLUS > SIGN 5 0.43% U+0037 7 DIGIT SEVEN 5 0.43% U+006B k LATIN SMALL LETTER K 5 > 0.43% U+0022 '" QUOTATION MARK 4 0.34% U+0031 1 DIGIT ONE 4 0.34% U+0036 6 DIGIT > SIX 4 0.34% U+003A : COLON 4 0.34% U+003F ? QUESTION MARK 4 0.34% U+0032 2 DIGIT > TWO 3 0.26% U+0033 3 DIGIT THREE 3 0.26% U+0034 4 DIGIT FOUR 3 0.26% > U+0042 B LATIN CAPITAL LETTER B 3 0.26% U+004C L LATIN CAPITAL LETTER L 3 > 0.26% U+004F O LATIN CAPITAL LETTER O 3 0.26% U+0057 W LATIN CAPITAL > LETTER W 3 0.26% U+002F / SOLIDUS 2 0.17% U+0035 5 DIGIT FIVE 2 0.17% > U+0043 C LATIN CAPITAL LETTER C 2 0.17% U+0045 E LATIN CAPITAL LETTER E 2 > 0.17% U+0046 F LATIN CAPITAL LETTER F 2 0.17% U+0049 I LATIN CAPITAL > LETTER I 2 0.17% U+004E N LATIN CAPITAL LETTER N 2 0.17% U+007C | VERTICAL > LINE 2 0.17% U+0028 ( LEFT PARENTHESIS 1 0.09% U+0029 ) RIGHT PARENTHESIS > 1 0.09% U+0038 8 DIGIT EIGHT 1 0.09% U+0039 9 DIGIT NINE 1 0.09% U+003B ; > SEMICOLON 1 0.09% U+003C < LESS-THAN SIGN 1 0.09% U+0041 A LATIN CAPITAL > LETTER A 1 0.09% U+0044 D LATIN CAPITAL LETTER D 1 0.09% U+004A J LATIN > CAPITAL LETTER J 1 0.09% U+004D M LATIN CAPITAL LETTER M 1 0.09% U+0050 P LATIN > CAPITAL LETTER P 1 0.09% U+0054 T LATIN CAPITAL LETTER T 1 0.09% U+0059 Y LATIN > CAPITAL LETTER Y 1 0.09% U+1F1F8 ?? REGIONAL INDICATOR SYMBOL LETTER S 1 > 0.09% U+1F1FA ?? REGIONAL INDICATOR SYMBOL LETTER U 1 0.09% 1176 > 100.00% > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Sat Mar 28 07:48:17 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 28 Mar 2015 12:48:17 +0000 (GMT) Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <20150327180725.GA9968@math.berkeley.edu> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> Message-ID: <26071683.19632.1427546897670.JavaMail.defaultUser@defaultHost> >> If a two word phrase were to be typeset within a plain text file then each letter of the two words would need to have an instance of the COMBINING ITALICIZER after each letter of the word. Would one add an instance after the space character that is between the words? >> 27 March 2015 > Guys, it is just a 4 days wait. Then we can discuss the last question in depth until a consensus is reached! My question was because in the days of handset metal type, the spaces used with an italic font were usuallly the same as were used with a roman font. The only metal type font that I met that had its own special angled spaces was Palace Script. So would one do exactly as with metal type or would one use the COMBINING ITALICIZER after every character for completeness? William Overington 28 March 2015 From wjgo_10009 at btinternet.com Sat Mar 28 08:10:55 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 28 Mar 2015 13:10:55 +0000 (GMT) Subject: Plain text (from Re: Avoidance variants) Message-ID: <15115948.20960.1427548255474.JavaMail.defaultUser@defaultHost> Italic could be implemented using an OpenType font. Map the roman glyphs. Do not map the italic glyphs. Use glyph substitution rules in the font to access the italic glyphs if and only if a two character sequence occurs where the second of the two characters is the COMBINING ITALICIZER character. William Overington 28 March 2015 ----Original message---- >From : doug at ewellic.org Date : 27/03/2015 - 16:21 (GMTST) To : unicode at unicode.org Subject : Re: Plain text (from Re: Avoidance variants) So one of the concerns I have is the implication that "interesting" styling comprises: 1. bold 2. italic If formatting characters were encoded to support these two styling options, right away there would be calls to expand the set with: 3. underlining 4. strikeout 5. superscript 6. subscript 7. font face 8. font size 9. font color 10. highlighting 11. character spacing 12. line spacing 13. margins 14. page size 15. all the other styling options that word processors provide The proposed, simplified solution doesn't scale. It's the same as one of the concerns I have with encoding localizable sentences as characters. There aren't 20 or 50 or 100 sentences that people might want localized, but crores of them. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From wjgo_10009 at btinternet.com Sat Mar 28 08:51:12 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 28 Mar 2015 13:51:12 +0000 (GMT) Subject: Localizable sentences (from Re: Plain text (from Re: Avoidance variants)) Message-ID: <25273834.23423.1427550672412.JavaMail.defaultUser@defaultHost> Doug Ewell wrote as follows. > It's the same as one of the concerns I have with encoding localizable sentences as characters. There aren't 20 or 50 or 100 sentences that people might want localized, but crores of them. Well, that seems like an example of a status point moving along a path upon the surface of a cusp catastrophe manifold in a mathematical model of decision making produced as an application of catastrophe theory. One instant there is a desire by people for zero of them to be encoded, then the next instant there is the desire by people for crores of them to be encoded. Please read the following post, where the quoted sentence is one originally from Doug. http://www.unicode.org/mail-arch/unicode-ml/y2011-m01/0103.html Yes, implementing localizable sentences technology may well be modelled by catastrophe theory. Whatever was said then and whatever is said now, there may be a sudden change in what people want and localizable sentences technology would become implemented quite quickly. I cannot implement it on my own, and maybe if implementation takes place I would play little or indeed no part in implementing it, yet maybe one day it will become implemented. I am hoping that it will be implemented in time for the 2020 Olympiad, including the Cultural Olympiad. William Overington 28 March 2015 From michaelanortonster at gmail.com Sat Mar 28 10:50:22 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 11:50:22 -0400 Subject: Localizable sentences (from Re: Plain text (from Re: Avoidance variants)) In-Reply-To: <25273834.23423.1427550672412.JavaMail.defaultUser@defaultHost> References: <25273834.23423.1427550672412.JavaMail.defaultUser@defaultHost> Message-ID: Re: ListServ, is there any way for me to deprecate conversations I am not privvy to from the start of them, having just joined yesterday? On Sat, Mar 28, 2015 at 9:51 AM, William_J_G Overington < wjgo_10009 at btinternet.com> wrote: > Doug Ewell wrote as follows. > > > > It's the same as one of the concerns I have with encoding localizable > sentences as characters. > There aren't 20 or 50 or 100 sentences that people might want localized, > but crores of them. > > > Well, that seems like an example of a status point moving along a path > upon the surface of a cusp catastrophe manifold in a mathematical model of > decision making produced as an application of catastrophe theory. > > > One instant there is a desire by people for zero of them to be encoded, > then the next instant there is the desire by people for crores of them to > be encoded. > > > Please read the following post, where the quoted sentence is one > originally from Doug. > > > http://www.unicode.org/mail-arch/unicode-ml/y2011-m01/0103.html > > > Yes, implementing localizable sentences technology may well be modelled by > catastrophe theory. Whatever was said then and whatever is said now, there > may be a sudden change in what people want and localizable sentences > technology would become implemented quite quickly. > > > I cannot implement it on my own, and maybe if implementation takes place I > would play little or indeed no part in implementing it, yet maybe one day > it will become implemented. > > > I am hoping that it will be implemented in time for the 2020 Olympiad, > including the Cultural Olympiad. > > > William Overington > > > 28 March 2015 > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Mar 28 11:52:35 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 28 Mar 2015 10:52:35 -0600 Subject: Usage stats? In-Reply-To: References: <20150327143119.665a7a7059d7ee80bb4d670165c8327d.40466a238f.wbe@email03.secureserver.net> Message-ID: <7701AC8C44144C0294EE485C5F8999EE@DougEwell> Michael Norton wrote: > Thanks Doug. I did not know there exists a representative sample of > the world's text. :) There is not, which was the point. Thanks for reposting a private message back to the list, by the way. ?? > Your frequency chart is great. The average char appearance is 2.91%. > Only 34% from your list exceed 10% of it. Therefore, U+0020 is the > elephant in the room (ie. 15%.05% is far > 2.91%). In fact, it's > almost >50% greater than the next most-appearing character. Words in English are separated by spaces, and the average English word is about 5 letters long. It follows that English text will contain a lot of spaces. You can eyeball this. > Only 34% from your list exceed 10% of the average percentile (2.9%). > > This is serendipitously common (eg. the Earth:Moon albedo ratio is > .36). A relationship about motion and other natural properties and > charactetristics among the local texts begin to emerge. Right. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From michaelanortonster at gmail.com Sat Mar 28 11:57:32 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 12:57:32 -0400 Subject: Usage stats? In-Reply-To: <7701AC8C44144C0294EE485C5F8999EE@DougEwell> References: <20150327143119.665a7a7059d7ee80bb4d670165c8327d.40466a238f.wbe@email03.secureserver.net> <7701AC8C44144C0294EE485C5F8999EE@DougEwell> Message-ID: You needn't teach me about English sir, I am a writer most of the time. Science, however, has no shortage in the need for great teachers. There are even English people here in America short on the science out of England! Anyway, the pattern checks out on a web page I ran this morning in a Unicode counter this membership provided me. Equilibrium is of great importance in electromagnetism when considering whether or not work needs to be done in a given scenario. On Sat, Mar 28, 2015 at 12:52 PM, Doug Ewell wrote: > Michael Norton wrote: > > Thanks Doug. I did not know there exists a representative sample of >> the world's text. :) >> > > There is not, which was the point. > > Thanks for reposting a private message back to the list, by the way. [image: > ??] > > Your frequency chart is great. The average char appearance is 2.91%. >> Only 34% from your list exceed 10% of it. Therefore, U+0020 is the >> elephant in the room (ie. 15%.05% is far > 2.91%). In fact, it's >> almost >50% greater than the next most-appearing character. >> > > Words in English are separated by spaces, and the average English word is > about 5 letters long. It follows that English text will contain a lot of > spaces. You can eyeball this. > > Only 34% from your list exceed 10% of the average percentile (2.9%). >> >> This is serendipitously common (eg. the Earth:Moon albedo ratio is >> .36). A relationship about motion and other natural properties and >> charactetristics among the local texts begin to emerge. >> > > Right. > > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f4a2.png Type: image/png Size: 1504 bytes Desc: not available URL: From michaelanortonster at gmail.com Sat Mar 28 10:33:38 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 11:33:38 -0400 Subject: Principia Announcement: QMx! Message-ID: With the virtue of the Unicode mission statement, hereby it is announced a new field of Quantum Magnetics, of which I am "the Father" (I need this credit for my bank account purposes only) begat by the vectorization of James Clerk Maxwell, determinism, and Sir Tim Berners-Lee. All specialists and professionals in the existing Arts, Science, Tech, and Engineering fields are welcome. A list of hypothesized jobs (there are probably equivalents already to some of these) to sprout from Qmx inexhaustively is as follows: Tech Oncologist Nurse Magneticist Magnetic Engineer Professor of Magnetospherics E-magnetism Architect Mag-Quant Translational Magnetist Visual Magnifier ... As for character usage rates, I've looked at another web page and the U+0020 (blank space) ratio stabilizes again at about >50% then the next most-popular character. For web docs as opposed traditional docs, I would like to see the differentials next with regard to internal and external hypertext characters. Stefan Trost's software is helping out a great deal with that. Thanks (Stefan &) Tom Gewecki for that! Phillips & Grant's 1975 & 1990 *Electromagnetism* asserts often along with their comprehensive description of Maxwell's equations that the magnetic field differentials resulting as a cause of motional and induced electronics are mostly *negligible*; however it appears that the opportunity with the growth of the web & Internet since then brings forth the update to Newton's *Principia. * I've got a lot of data and writings so if anyone here is interested in book-writing and has time to collaborate, it would be fun to put it all together with you. Best & cheers, -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Sat Mar 28 13:14:47 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 28 Mar 2015 11:14:47 -0700 Subject: Plain text (from Re: Avoidance variants) In-Reply-To: <26071683.19632.1427546897670.JavaMail.defaultUser@defaultHost> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <26071683.19632.1427546897670.JavaMail.defaultUser@defaultHost> Message-ID: <5516EF97.1050405@ix.netcom.com> On 3/28/2015 5:48 AM, William_J_G Overington wrote: >>> If a two word phrase were to be typeset within a plain text file then each letter of the two words would need to have an instance of the COMBINING ITALICIZER after each letter of the word. Would one add an instance after the space character that is between the words? >>> 27 March 2015 >> Guys, it is just a 4 days wait. Then we can discuss the last question in depth until a consensus is reached! > My question was because in the days of handset metal type, the spaces used with an italic font were usuallly the same as were used with a roman font. > > The only metal type font that I met that had its own special angled spaces was Palace Script. > > So would one do exactly as with metal type or would one use the COMBINING ITALICIZER after every character for completeness? > > The only logical conclusion is that that this should be applied to line feeds, tabs and other control codes as well. Just because digitally encoded text include some entities that were not present in hot metal typography is not reason to treat them as second class citizens. A./ From public at khwilliamson.com Sat Mar 28 13:42:45 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 28 Mar 2015 12:42:45 -0600 Subject: Problems in google and yahoo searches Message-ID: <5516F625.60608@khwilliamson.com> I had thought Tangut was going to be in Unicode 8, but the beta files didn't include it, so I tried simply searching on "tangut" from http://www.unicode.org/search/ Only bing showed a match on the pipeline page which had the answer. Recently I wanted to review the email correspondence when properties whose names began with "Is' were proposed, but I didn't find anything in that regard, except something in the document registry referring to the controversy. It seems to me that there is something wrong with the searching of the Unicode site. Is there anything Unicode can do to enhance this? From michaelanortonster at gmail.com Sat Mar 28 13:43:49 2015 From: michaelanortonster at gmail.com (Michael Norton) Date: Sat, 28 Mar 2015 14:43:49 -0400 Subject: Usage stats? In-Reply-To: <7381ABF5-3D91-4627-A676-B3E3A83DC9E1@icu-project.org> References: <55158E46.5010004@att.net> <7381ABF5-3D91-4627-A676-B3E3A83DC9E1@icu-project.org> Message-ID: & another: the universe is the concert hall; its alements the instruments; we're but the resonance of its winds, strings, and voices' echoes. On Sat, Mar 28, 2015 at 12:21 AM, Steven R. Loomis wrote: > Here's an analogy: it's more of a piano factory than a concert hall. > > S > > Enviado desde nuestro iPhone. > > El mar 27, 2015, a las 1:16 PM, Michael Norton < > michaelanortonster at gmail.com> escribi?: > > (I know this is way too simplistic a response but it is kind of like > giving everyone an invisible cloak and an invisible dagger and not telling > them what a cloak and dagger is for [cutting butter & keeping warm]). > > On Fri, Mar 27, 2015 at 3:57 PM, Michael Norton < > michaelanortonster at gmail.com> wrote: > > Why wouldn't Unicode itself have it? >> > -- Michael A. Norton, B.A. Cinema, M.P.A. My Cinema Home: http://www.NortonsNook.com "All great actors are mere mathematical masters of speech and the human body." -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Sat Mar 28 15:05:43 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 28 Mar 2015 14:05:43 -0600 Subject: Meroitic cursive fractions numerical values Message-ID: <55170997.70109@khwilliamson.com> In the 8.0 Beta files, some numerical values are not reduced to their lowest forms. Is there a compelling reason that 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; is not written as 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;1/2;N;;;;; given that there is also a 109BD;MEROITIC CURSIVE FRACTION ONE HALF;No;0;R;;;;1/2;N;;;;; Aren't the numeric values of U+109FB and U+109BD the same? Existing software that looks at the numeric values of characters is written expecting that rational numbers will have been reduced to their lowest form. From asmus-inc at ix.netcom.com Sat Mar 28 15:25:32 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sat, 28 Mar 2015 13:25:32 -0700 Subject: Meroitic cursive fractions numerical values In-Reply-To: <55170997.70109@khwilliamson.com> References: <55170997.70109@khwilliamson.com> Message-ID: <55170E3C.4080109@ix.netcom.com> Unless there is a value in documenting the value of the numerator and denominator, in which case this should be prominently explained in the documentation. Or is that written down somewhere already? A./ On 3/28/2015 1:05 PM, Karl Williamson wrote: > In the 8.0 Beta files, some numerical values are not reduced to their > lowest forms. Is there a compelling reason that > > 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; > > is not written as > > 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;1/2;N;;;;; > > given that there is also a > > 109BD;MEROITIC CURSIVE FRACTION ONE HALF;No;0;R;;;;1/2;N;;;;; > > Aren't the numeric values of U+109FB and U+109BD the same? > > Existing software that looks at the numeric values of characters is > written expecting that rational numbers will have been reduced to > their lowest form. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From public at khwilliamson.com Sat Mar 28 15:37:25 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sat, 28 Mar 2015 14:37:25 -0600 Subject: Meroitic cursive fractions numerical values In-Reply-To: <55170E3C.4080109@ix.netcom.com> References: <55170997.70109@khwilliamson.com> <55170E3C.4080109@ix.netcom.com> Message-ID: <55171105.2010203@khwilliamson.com> On 03/28/2015 02:25 PM, Asmus Freytag (t) wrote: > Unless there is a value in documenting the value of the numerator and > denominator, in which case this should be prominently explained in the > documentation. It seems to me that the character name provides sufficient documentation of the numerator and denominator Or is that written down somewhere already? > > A./ > > > > On 3/28/2015 1:05 PM, Karl Williamson wrote: >> In the 8.0 Beta files, some numerical values are not reduced to their >> lowest forms. Is there a compelling reason that >> >> 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; >> >> is not written as >> >> 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;1/2;N;;;;; >> >> given that there is also a >> >> 109BD;MEROITIC CURSIVE FRACTION ONE HALF;No;0;R;;;;1/2;N;;;;; >> >> Aren't the numeric values of U+109FB and U+109BD the same? >> >> Existing software that looks at the numeric values of characters is >> written expecting that rational numbers will have been reduced to >> their lowest form. >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> > > From richard.wordingham at ntlworld.com Sat Mar 28 18:29:21 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Sat, 28 Mar 2015 23:29:21 +0000 Subject: Usage stats? In-Reply-To: <20150328005956.33afb1c5@JRWUBU2> References: <55158E46.5010004@att.net> <29A51D0A-47D6-4892-8B99-B5DF59FE275D@mitre.org> <20150328005956.33afb1c5@JRWUBU2> Message-ID: <20150328232921.0a47bef1@JRWUBU2> On Sat, 28 Mar 2015 00:59:56 +0000 Richard Wordingham wrote: > On Fri, 27 Mar 2015 16:27:26 -0400 > Michael Norton wrote: > > > Easy example: what's the code for [blank space] U+020 across all > > language sets of Unicode? Is it the same ie: 100%? I've seen a claim from a normally reliable source that U+0020 is extremely rare in Thai or Japanese text. It does occur in Japanese text, though quite possibly as an error for IDEOGRAPHIC SPACE. In Thai, U+0020 is an extremely common and prescribed punctuation mark. It is reliably used as a clause and sentence separator, and is also used to delimit names and also numbers composed of digits. In newspaper columns, it occurs in most lines, and in books there are usually several to the line. The other common punctuation marks in serious material are the abbreviation mark U+002E FULL STOP (especially for initialisms) and the list item separator U+002C COMMA. Quotation marks, exclamation marks and ellipses occur in fictional dialogue with pretty much the same meaning as in English. Richard. From verdy_p at wanadoo.fr Sat Mar 28 18:46:46 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 29 Mar 2015 00:46:46 +0100 Subject: Localizable sentences (from Re: Plain text (from Re: Avoidance variants)) In-Reply-To: <25273834.23423.1427550672412.JavaMail.defaultUser@defaultHost> References: <25273834.23423.1427550672412.JavaMail.defaultUser@defaultHost> Message-ID: 2015-03-28 14:51 GMT+01:00 William_J_G Overington : > Doug Ewell wrote as follows. > > > It's the same as one of the concerns I have with encoding localizable > sentences as characters. > There aren't 20 or 50 or 100 sentences that people might want localized, > but crores of them. > > > Well, that seems like an example of a status point moving along a path > upon the surface of a cusp catastrophe manifold in a mathematical model of > decision making produced as an application of catastrophe theory. > > > One instant there is a desire by people for zero of them to be encoded, > then the next instant there is the desire by people for crores of them to > be encoded. > You don't need Unicode encoding to encode localizable sentences. There are various ways to do it. Notably using URNs (or URLs to URN resolvers) where you can create your registry of entities to encode (use the encoding scheme you wish, you'll probably use unique identifiers such as UUIDs or a digital hash such as SHA1 of an initial sentence in any language) or XML paths (within some XML DOM scheme). These URNs can then be resolved either as images/icons (in various formats), or as plaintext (encoded in standard Unicode)., or rich text (such as HTML), or as glyph identifiers in some custom symbolic fonts. If you're not able to create this registry, don't even think that Unicode will host it (unless part of it demonstrates its usefulness for inclusion in CLDR by wishes of CLDR contributors and votes in the CLDR TC). Then only later, may be, some parts of CLDR data (or of some external data in a registry shared publicly and with a sizeable and open community of users that wish to extend the open use of their conventional encoding, such as stenographic notations, or chemistry symbols; or emoticons/emojis) may be encoded into plain-text Unicode (such as mathematics symbols which carry thir own stylistic requirements for their admissible glyphs, but more importantly their identity as *maths* symbols distinct from standard letters with too variable forms). Unicode is definitely not the first place to do your homework : you **first** need to create or find a community of interest and start documenting your processes and conventions in your own collaboration workspace and make sure it is open to collaboration and that this workspace has a measurable usage (outside yourself only). Absolutely nothing you constantly propose in this list has demonstrated this minimum requirement. All what you propose recurrently here is then out of topic. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Sat Mar 28 19:22:33 2015 From: kenwhistler at att.net (Ken Whistler) Date: Sat, 28 Mar 2015 17:22:33 -0700 Subject: Meroitic cursive fractions numerical values In-Reply-To: <55170997.70109@khwilliamson.com> References: <55170997.70109@khwilliamson.com> Message-ID: <551745C9.3050300@att.net> On 3/28/2015 1:05 PM, Karl Williamson wrote: > In the 8.0 Beta files, some numerical values are not reduced to their > lowest forms. Is there a compelling reason that > > 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;6/12;N;;;;; > > is not written as > > 109FB;MEROITIC CURSIVE FRACTION SIX TWELFTHS;No;0;R;;;;1/2;N;;;;; Well, obviously you might not consider it a "compelling" reason, but the numeric values were written that way in the original proposal (L2/12-206, June 6, 2012). Nobody said anything about rational numbers expressed as fractions being required to be lowest form, and the entries were just carried forward into the drafts of Unicode 8.0 UnicodeData.txt for beta review. > > given that there is also a > > 109BD;MEROITIC CURSIVE FRACTION ONE HALF;No;0;R;;;;1/2;N;;;;; > > Aren't the numeric values of U+109FB and U+109BD the same? Of course. > > Existing software that looks at the numeric values of characters is > written expecting that rational numbers will have been reduced to > their lowest form. Well, not all existing software, obviously, as the tools used to generate the derived data files didn't complain, and produced the correct results for these Meroitic fractions: http://www.unicode.org/Public/8.0.0/ucd/extracted/DerivedNumericValues-8.0.0d10.txt And there is nothing in the documentation of the Numeric_Value property (see UAX #44) that currently *requires* only an irreducible fraction (or an integer) in the field. (See also DerivedNumericValues.txt, which is silent on this.) You can always provide beta feedback requesting that the relevant fractions be changed to their lowest forms, for review by the May UTC meeting. Personally, I wouldn't object to a change like that, as I don't see any particular didactic value to expressing the fractional values with precisely the same numerator and denominator as the character form implies, if it isn't mathematically necessary. On the other hand, I would be loathe to make this a mandatory *requirement* of the Numeric_Value field, as that would then add yet another baroque invariant on the UCD data, and would imply yet more elaborated testing to verify for each release that a new invariant we imposed on ourselves what not somehow violated in the new data for the UCD. The set of invariants currently maintained is already bordering on impossible for any one participant in the data maintenance to understand. The other drawbacks of piling on invariants is that the UTC has been bitten by them in the past when something new comes up that wasn't anticipated. This particular requirement might be innocuous and safe -- but why tempt the fates? --Ken From verdy_p at wanadoo.fr Sat Mar 28 19:33:20 2015 From: verdy_p at wanadoo.fr (verdy_p) Date: Sun, 29 Mar 2015 01:33:20 +0100 (CET) Subject: No subject Message-ID: <1281291083.24117.1427589200805.JavaMail.www@wwinf1k18> [Note: message resent using another domain. Visibly the Unicode mailing list rejects as spam all emails posted from Gmail's webmail, and containing all relevant tracking mime headers and regularly signed by Google and my proven identity]. 2015-03-28 12:30 GMT+01:00 Michael Norton : > > Thanks Doug.? I did not know there exists a representative sample of the world's text. :) > I do know that 400 years ago there were about 10,000 languages; now there are about 6,500. > Time flies! ? > > Your frequency chart is great.The average char appearance is 2.91%. Only 34% from your list exceed 10% of it. > Therefore, U+0020 is the elephant in the room (ie. 15%.05% is far > 2.91%). > In fact, it's almost >50% greater than the next most-appearing character. ?? > > So from the two frequency lists you've given me (my email and yours) we begin to see some patterns emerge. > Provided prior data and observation, most useful patterns prevail over other more obscure ones > and present a provocative opportunity for webbers out there... ? > While this is probably out of context for most of the 700 Unicode members, I can report that it's good news. Long time ago I learned a "word" (or is it an acronym? it's not really an abbreviation by itself even if it is pronounceable) used by French cryptanalists (using simple encryption schemes by substitution): "ESARTINULOC" (some older sources gave "ESANTIRULO"). Which is the ordered list of most frequently basic letters used in French (ignoring case and diacritic differences). It's also used implicitly by gamers (e.g. playing or composing crosswords, or playing games such as Scrabble(TM), where the top letters of the list have lower scoring values, different between French Scrabble and English Scrabble). That "word" is slightly different in English, or in the limited "global" counting Doug did (over an extremely limited set of source texts); but of course in French the SPACE would also lead the list before that "word" (but that does not enter into account for crosswords or Scrabble, even in languages that don't use spaces for word separation). More accurate statistics may be found using statistics collected by databases with plain-text search capabilities (in the structure of their index), provided they correctly track the language used and their data concerns a large enough set of domains (e.g. statistics of plain-text search engines for each **localized** edition of Wikipedia, Wiktionnary, or Wikisource).?If you want "global" statistics it will be more difficult (Wikimedia Commons is insufficiently translated, with a too wide presence of English), but what you may do is to estimate the rate of usages for each main language (or macrolanguage) and weight the statistics collected for each language to return an estimated "global" frequency list. But be careful, each language has its own set of collation rules such that letters that are considered having the same primary weight in one language are distinguished and counted separately in some other language:?you may find that a source "?" or "?" had its rate actuelly computed as "UE" or "AE" in German, but only as "U" or "A" in English or French, and this wil not allow you to correctly estimate the global frequency rates of "U", "A" and "E". A simple linear mathematic transform (scalar products of usage rates of languages and usage rates of letters per language) would not work: the global usage rate of "E" would be underestimated where it also represents the German umlaut, and both "U" and "A" would be overestimated... From andrewcwest at gmail.com Sun Mar 29 04:41:17 2015 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 29 Mar 2015 10:41:17 +0100 Subject: Meroitic cursive fractions numerical values In-Reply-To: <55170997.70109@khwilliamson.com> References: <55170997.70109@khwilliamson.com> Message-ID: On 28 March 2015 at 20:05, Karl Williamson wrote: > > Existing software that looks at the numeric values of characters is written > expecting that rational numbers will have been reduced to their lowest form. That seems to be a rather rash statement. I have software (BabelPad) which parses the numeric values of characters for numeric sorting purposes, and it parses "6/12" for MEROITIC CURSIVE FRACTION SIX TWELFTHS as 0.5. Personally I find it hard to imagine how you could write software that accepts "6/12" as input and is unable to come up with the answer of a half. I would say that fractions should not be reduced to their lowest form in the Unicode data as some people may need to order fractions by numerator or denominator, and reducing to lowest form could break the expectations of some software. Having said that, I note that the numeric value of one character has been reduced in the Unicode data: U+2189 VULGAR FRACTION ZERO THIRDS is given the numeric value of "0" rather that "0/3". Andrew From verdy_p at wanadoo.fr Sun Mar 29 06:16:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 29 Mar 2015 13:16:33 +0200 Subject: Meroitic cursive fractions numerical values In-Reply-To: References: <55170997.70109@khwilliamson.com> Message-ID: How would you note the numeric value property of the mathematical pi symbol, if you use "0.5", assuming that it should be written as a single decimal value without using any operator ? You can't because there's an infinite number of decimals, unless you explciitly says that the numeric property is limtied to the precision of an IEEE 64-bit "double" floatting point value (or 80-bit "long double" supported natively by x86 processors). So you have to imagine that the numeric value property is effectively a mathematical expression using some conventional set of mathematical symbols (in which case the numeric value property of the pi symbol should be the symbol itself). In that case, writing "6/12" or "1/2" is fully equivalent, mathematically, as this property is a mathematical expression. Now that property should have a syntax defined. The problem being that for complex expressions there are several mathematical notations, the most common used in plaintext being using TeX (except that it does not just note the expression itself but its presentation and layout). Could Unicode define a basic plaintext syntax for a subset of mathematical expressions that are useful to parse the "numeric value" field ? It would of course contain the syntax for numbers (using all decimal digits from various scripts, but ignoring the localized conventions for decimal separators, reduced to just the ASCII dot, and the grouping separators, reduced to none), restricting the use of unnecessary whitespaces in that field, reducing the use of unnecessary leading zeroes, or trailing zeroes in decimal parts), it would contain the subset of symbolic constants encoded in Unicode as symbolic constants (such as pi, e, i). It would not contain any symbolic constant directly expressible with others. It could potentially contain superscript digits used for exponents. And of course it would contain the common set of arithmetic operators (+, the ASCII "MINUS-HYPHEN" or mathematical MINUS, ? or the ASCII ASTERISK, / or ??, ^ for noting exponentiation, and parentheses), or algebric operators (such as?). It would not include special operators (such as ?) that can't be evaluated to a single number in a single dimensional numerical body (so we limit us to the body of complex numbers ?). Further extensions would include some common functions such as core trigonometric and hyperbolic functions (sine, cosine, tangent, cotangent) and their inverse, and logarithms. That syntax would not specify if those expressions are effectively evaluatable such as 0/0 (it's up to implementations to check this according to their own numeric domain) as the syntax does not specify the numeric domain (body or ring?) in which it will be evaluated (for example 1/0 is valid in some rings where all member numbers are invertible, including zero), and it will not assume that "-1" is necessarily different from "+1" (they are equivalent in Z/2Z which just contains two members: 0 and 1, and where "2" or "4" are also equal "0") or the precision of numbers ("1/100" could be equal to "0" in an integer domain). This could be the base for defining a basic set of expressions that many programming languages could support in their syntax, using the precision they want or can support (even if their native syntax use other similar notations with simple substitution rules. For this reason, it seems more natural to avoid reducing fractions in the numeric property value, and keep them in their natural form : "6/12" NOT reduced to "1/2", and "0/3" NOT reduced to "0" (because this may incorrectly assume a subset of a linear numeric body): let the implementation define itself its numeric domain and these expressions are evaluatable in that domain: the parser will be the same, only the evaluator will be different as it completely depends on the numeric domain. 2015-03-29 11:41 GMT+02:00 Andrew West : > On 28 March 2015 at 20:05, Karl Williamson > wrote: > > > > Existing software that looks at the numeric values of characters is > written > > expecting that rational numbers will have been reduced to their lowest > form. > > That seems to be a rather rash statement. I have software (BabelPad) > which parses the numeric values of characters for numeric sorting > purposes, and it parses "6/12" for MEROITIC CURSIVE FRACTION SIX > TWELFTHS as 0.5. Personally I find it hard to imagine how you could > write software that accepts "6/12" as input and is unable to come up > with the answer of a half. > > I would say that fractions should not be reduced to their lowest form > in the Unicode data as some people may need to order fractions by > numerator or denominator, and reducing to lowest form could break the > expectations of some software. Having said that, I note that the > numeric value of one character has been reduced in the Unicode data: > U+2189 VULGAR FRACTION ZERO THIRDS is given the numeric value of "0" > rather that "0/3". > > Andrew > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnnyfarraj at yahoo.com Sat Mar 28 22:21:39 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Sun, 29 Mar 2015 03:21:39 +0000 (UTC) Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp Message-ID: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> Dear unicode list members,? I wish to get feedback about a new character submission proposal. Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: 266D ? MUSIC FLAT SIGN266F ? MUSIC SHARP SIGN while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP1D12B ?? MUSICAL SYMBOL DOUBLE FLAT1D12C ?? MUSICAL SYMBOL FLAT UP1D12D ?? MUSICAL SYMBOL FLAT DOWN1D130 ?? MUSICAL SYMBOL SHARP UP 1D131 ?? MUSICAL SYMBOL SHARP DOWN1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT?I am proposing the addition of 2 new characters to the Musical Symbols table: - the half-flat sign (lower note by a quarter tone)?- the half-sharp sign (raise note by a quarter tone) ??These are widely used in Arabic music notation, and they express intervals that are multiples of quarter tones. I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com,?the most widely used online resource on Arabic music theory, in English.I can also enlist the support of many academics in the ethnomusicology field, who specialize in Arabic music. I welcome any feedback on this proposal. thanks Johnny Farrajjohnnyfarraj at yahoo.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-flat sign.png Type: image/png Size: 3617 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: half-sharp sign.png Type: image/png Size: 2754 bytes Desc: not available URL: From everson at evertype.com Sun Mar 29 11:49:20 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 29 Mar 2015 18:49:20 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> Message-ID: <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Johnny, I?m interested in working with you and Sami on this. These two characters are often referred to as quarter sharp and quarter flat as well. The symbols are also widely used outside Arabic music. The western classical tradition from the 20th century on is full of them. They're not obscure symbols really. Musicians with even a moderate interest in contemporary music are aware of them. I?m travelling in Sweden working on Blissymbols at the moment, but when I get back home on Friday and can consult some of my reference works I?ll get in touch with you. It shouldn?t take long to put something together. Michael Everson On 29 Mar 2015, at 05:21, Johnny Farraj wrote: > Dear unicode list members, > > I wish to get feedback about a new character submission proposal. > > Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: > > 266D ? MUSIC FLAT SIGN > 266F ? MUSIC SHARP SIGN > > while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: > > 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP > 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT > 1D12C ?? MUSICAL SYMBOL FLAT UP > 1D12D ?? MUSICAL SYMBOL FLAT DOWN > 1D130 ?? MUSICAL SYMBOL SHARP UP > 1D131 ?? MUSICAL SYMBOL SHARP DOWN > 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP > 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT > > I am proposing the addition of 2 new characters to the Musical Symbols table: > > - the half-flat sign (lower note by a quarter tone) > - the half-sharp sign (raise note by a quarter tone) > > > > > > These are widely used in Arabic music notation, and they express intervals that are multiples of quarter tones. > > I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com, the most widely used online resource on Arabic music theory, in English. > I can also enlist the support of many academics in the ethnomusicology field, who specialize in Arabic music. > > I welcome any feedback on this proposal. > > thanks > > Johnny Farraj > johnnyfarraj at yahoo.com From johnnyfarraj at yahoo.com Sun Mar 29 12:53:59 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Sun, 29 Mar 2015 13:53:59 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: Message-ID: <4458F47E-B607-4AF6-A51D-07BFCA39BCEC@yahoo.com> Michael, Thanks for the swift response, and your interest. Your collaboration is greatly appreciated. Do you have any experience in submitting new Unicode character proposals? And/or with creating the reference copy of a symbol in the format required? I'm happy to do the bulk of the work for the submission as long as someone guides me. Johnny > On Mar 29, 2015, at 1:34 PM, sami shumays wrote: > > Wow thanks for the swift response, Michael! Johnny and I thought we were going to have to do some more aggressive lobbying to get this through, but we're delighted that there is already some interest! > > Here's another reference, my latest article in Music Theory Spectrum on maqam: > http://maqamlessons.com/analysis/media/MaqamAnalysisAPrimer_MTS3502_Shumays2013.pdf > Where you can see pretty standard usage of these symbols, identical to what Johnny sent you. I know MTS has an interest in publishing more on Arabic music because they reached out to me to do a few peer reviews over the last few years. I was planning on reaching out to them to solicit support for Johnny's proposal, but if you're willing to go ahead without it then I won't bother. > > -Sami > > Sent from my Verizon Wireless 4G LTE Smartphone > > > -------- Original message -------- > From: Michael Everson > Date:03/29/2015 12:49 PM (GMT-05:00) > To: unicode at unicode.org > Cc: Sami Shumays , Johnny Farraj > Subject: Re: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp > > Johnny, > > I?m interested in working with you and Sami on this. > > These two characters are often referred to as quarter sharp and quarter flat as well. The symbols are also widely used outside Arabic music. The western classical tradition from the 20th century on is full of them. They're not obscure symbols really. Musicians with even a moderate interest in contemporary music are aware of them. > > I?m travelling in Sweden working on Blissymbols at the moment, but when I get back home on Friday and can consult some of my reference works I?ll get in touch with you. It shouldn?t take long to put something together. > > Michael Everson > > On 29 Mar 2015, at 05:21, Johnny Farraj wrote: > > > Dear unicode list members, > > > > I wish to get feedback about a new character submission proposal. > > > > Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: > > > > 266D ? MUSIC FLAT SIGN > > 266F ? MUSIC SHARP SIGN > > > > while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: > > > > 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP > > 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT > > 1D12C ?? MUSICAL SYMBOL FLAT UP > > 1D12D ?? MUSICAL SYMBOL FLAT DOWN > > 1D130 ?? MUSICAL SYMBOL SHARP UP > > 1D131 ?? MUSICAL SYMBOL SHARP DOWN > > 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP > > 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT > > > > I am proposing the addition of 2 new characters to the Musical Symbols table: > > > > - the half-flat sign (lower note by a quarter tone) > > - the half-sharp sign (raise note by a quarter tone) > > > > > > > > > > > > These are widely used in Arabic music notation, and they express intervals that are multiples of quarter tones. > > > > I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com, the most widely used online resource on Arabic music theory, in English. > > I can also enlist the support of many academics in the ethnomusicology field, who specialize in Arabic music. > > > > I welcome any feedback on this proposal. > > > > thanks > > > > Johnny Farraj > > johnnyfarraj at yahoo.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From johnnyfarraj at yahoo.com Sun Mar 29 13:03:16 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Sun, 29 Mar 2015 14:03:16 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Message-ID: <097A5DB4-46A6-47C7-9435-7524C09DBB0E@yahoo.com> Actually I see the half-flat sign (flat sign with a slash) here in this Wikipedia page http://en.m.wikipedia.org/wiki/Flat_%28music%29 And the half-sharp sign in this page http://en.m.wikipedia.org/wiki/Sharp_(music) These are exactly the signs I was proposing. I wonder if they're already Unicode characters and I missed them? Johnny > On Mar 29, 2015, at 12:49 PM, Michael Everson wrote: > > Johnny, > > I?m interested in working with you and Sami on this. > > These two characters are often referred to as quarter sharp and quarter flat as well. The symbols are also widely used outside Arabic music. The western classical tradition from the 20th century on is full of them. They're not obscure symbols really. Musicians with even a moderate interest in contemporary music are aware of them. > > I?m travelling in Sweden working on Blissymbols at the moment, but when I get back home on Friday and can consult some of my reference works I?ll get in touch with you. It shouldn?t take long to put something together. > > Michael Everson > >> On 29 Mar 2015, at 05:21, Johnny Farraj wrote: >> >> Dear unicode list members, >> >> I wish to get feedback about a new character submission proposal. >> >> Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: >> >> 266D ? MUSIC FLAT SIGN >> 266F ? MUSIC SHARP SIGN >> >> while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: >> >> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP >> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT >> 1D12C ?? MUSICAL SYMBOL FLAT UP >> 1D12D ?? MUSICAL SYMBOL FLAT DOWN >> 1D130 ?? MUSICAL SYMBOL SHARP UP >> 1D131 ?? MUSICAL SYMBOL SHARP DOWN >> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP >> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT >> >> I am proposing the addition of 2 new characters to the Musical Symbols table: >> >> - the half-flat sign (lower note by a quarter tone) >> - the half-sharp sign (raise note by a quarter tone) >> >> >> >> >> >> These are widely used in Arabic music notation, and they express intervals that are multiples of quarter tones. >> >> I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com, the most widely used online resource on Arabic music theory, in English. >> I can also enlist the support of many academics in the ethnomusicology field, who specialize in Arabic music. >> >> I welcome any feedback on this proposal. >> >> thanks >> >> Johnny Farraj >> johnnyfarraj at yahoo.com > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Sun Mar 29 14:27:23 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Sun, 29 Mar 2015 20:27:23 +0100 (BST) Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp References: <4458F47E-B607-4AF6-A51D-07BFCA39BCEC@yahoo.com> Message-ID: On 2015-03-29, Johnny Farraj wrote: > Michael, > Thanks for the swift response, and your interest. > Your collaboration is greatly appreciated. > Do you have any experience in submitting new Unicode character proposals? > And/or with creating the reference copy of a symbol in the format required? I think Michael should print this out and frame it, as a modern equivalent of the slave muttering "memento mori" ... -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From johnnyfarraj at yahoo.com Sun Mar 29 14:37:00 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Sun, 29 Mar 2015 15:37:00 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <097A5DB4-46A6-47C7-9435-7524C09DBB0E@yahoo.com> References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> <097A5DB4-46A6-47C7-9435-7524C09DBB0E@yahoo.com> Message-ID: <71349063-B84A-421A-AAA9-3BE2167C57FE@yahoo.com> Thank you Tom for pointing this out, the 2 signs on Wikipedia are images not Unicode characters: So we still need this proposal. Johnny > On Mar 29, 2015, at 2:03 PM, Johnny Farraj wrote: > > Actually I see the half-flat sign (flat sign with a slash) here in this Wikipedia page > > http://en.m.wikipedia.org/wiki/Flat_%28music%29 > > And the half-sharp sign in this page > > http://en.m.wikipedia.org/wiki/Sharp_(music) > > These are exactly the signs I was proposing. I wonder if they're already Unicode characters and I missed them? > > > Johnny > >> On Mar 29, 2015, at 12:49 PM, Michael Everson wrote: >> >> Johnny, >> >> I?m interested in working with you and Sami on this. >> >> These two characters are often referred to as quarter sharp and quarter flat as well. The symbols are also widely used outside Arabic music. The western classical tradition from the 20th century on is full of them. They're not obscure symbols really. Musicians with even a moderate interest in contemporary music are aware of them. >> >> I?m travelling in Sweden working on Blissymbols at the moment, but when I get back home on Friday and can consult some of my reference works I?ll get in touch with you. It shouldn?t take long to put something together. >> >> Michael Everson >> >>> On 29 Mar 2015, at 05:21, Johnny Farraj wrote: >>> >>> Dear unicode list members, >>> >>> I wish to get feedback about a new character submission proposal. >>> >>> Currently the Miscellaneous Symbols table (2600-26FF) includes the following characters: >>> >>> 266D ? MUSIC FLAT SIGN >>> 266F ? MUSIC SHARP SIGN >>> >>> while the Musical Symbols table (1D100 - 1D1FF) includes the following characters: >>> >>> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP >>> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT >>> 1D12C ?? MUSICAL SYMBOL FLAT UP >>> 1D12D ?? MUSICAL SYMBOL FLAT DOWN >>> 1D130 ?? MUSICAL SYMBOL SHARP UP >>> 1D131 ?? MUSICAL SYMBOL SHARP DOWN >>> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP >>> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT >>> >>> I am proposing the addition of 2 new characters to the Musical Symbols table: >>> >>> - the half-flat sign (lower note by a quarter tone) >>> - the half-sharp sign (raise note by a quarter tone) >>> >>> >>> >>> >>> >>> These are widely used in Arabic music notation, and they express intervals that are multiples of quarter tones. >>> >>> I am the primary sponsor of this proposal. As far as my credentials, I am the owner of http://maqamworld.com, the most widely used online resource on Arabic music theory, in English. >>> I can also enlist the support of many academics in the ethnomusicology field, who specialize in Arabic music. >>> >>> I welcome any feedback on this proposal. >>> >>> thanks >>> >>> Johnny Farraj >>> johnnyfarraj at yahoo.com >> >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 16384 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 12288 bytes Desc: not available URL: From gwalla at gmail.com Sun Mar 29 15:02:46 2015 From: gwalla at gmail.com (Garth Wallace) Date: Sun, 29 Mar 2015 13:02:46 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Message-ID: Wouldn't it be easier just to change the example glyphs for U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP and U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT? The ones currently in the charts do not appear to be in common use. The most common symbol for the quarter tone flat, from what I've gathered, is a reversed flat sign. Some composers use the flat with stroke. One potential complication: AIUI the Arel-Ezgi-Uzdilek system for notating Turkish music, which divides each whole tone into nine koma, uses both, along with a few altered sharps. On Sunday, March 29, 2015, Michael Everson wrote: > Johnny, > > I?m interested in working with you and Sami on this. > > These two characters are often referred to as quarter sharp and quarter > flat as well. The symbols are also widely used outside Arabic music. The > western classical tradition from the 20th century on is full of them. > They're not obscure symbols really. Musicians with even a moderate interest > in contemporary music are aware of them. > > I?m travelling in Sweden working on Blissymbols at the moment, but when I > get back home on Friday and can consult some of my reference works I?ll get > in touch with you. It shouldn?t take long to put something together. > > Michael Everson > > On 29 Mar 2015, at 05:21, Johnny Farraj > wrote: > > > Dear unicode list members, > > > > I wish to get feedback about a new character submission proposal. > > > > Currently the Miscellaneous Symbols table (2600-26FF) includes the > following characters: > > > > 266D ? MUSIC FLAT SIGN > > 266F ? MUSIC SHARP SIGN > > > > while the Musical Symbols table (1D100 - 1D1FF) includes the following > characters: > > > > 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP > > 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT > > 1D12C ?? MUSICAL SYMBOL FLAT UP > > 1D12D ?? MUSICAL SYMBOL FLAT DOWN > > 1D130 ?? MUSICAL SYMBOL SHARP UP > > 1D131 ?? MUSICAL SYMBOL SHARP DOWN > > 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP > > 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT > > > > I am proposing the addition of 2 new characters to the Musical Symbols > table: > > > > - the half-flat sign (lower note by a quarter tone) > > - the half-sharp sign (raise note by a quarter tone) > > > > > > > > > > > > These are widely used in Arabic music notation, and they express > intervals that are multiples of quarter tones. > > > > I am the primary sponsor of this proposal. As far as my credentials, I > am the owner of http://maqamworld.com, the most widely used online > resource on Arabic music theory, in English. > > I can also enlist the support of many academics in the ethnomusicology > field, who specialize in Arabic music. > > > > I welcome any feedback on this proposal. > > > > thanks > > > > Johnny Farraj > > johnnyfarraj at yahoo.com > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Mar 29 15:50:44 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 29 Mar 2015 22:50:44 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <4458F47E-B607-4AF6-A51D-07BFCA39BCEC@yahoo.com> References: <4458F47E-B607-4AF6-A51D-07BFCA39BCEC@yahoo.com> Message-ID: On 29 Mar 2015, at 19:53, Johnny Farraj wrote: > Michael, > Thanks for the swift response, and your interest. > Your collaboration is greatly appreciated. My pleasure. > Do you have any experience in submitting new Unicode character proposals? And/or with creating the reference copy of a symbol in the format required? A little. See http://www.evertype.com/formal.html :-) Michael Everson From gwalla at gmail.com Sun Mar 29 16:29:49 2015 From: gwalla at gmail.com (Garth Wallace) Date: Sun, 29 Mar 2015 14:29:49 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: Message-ID: Right, I was just pointing out that Turkish music is a potential complication if changing the glyph for MUSICAL SYMBOL QUARTER TONE FLAT. Here's how I understand it: Arabic music - uses the flat-with-stroke exclusively as a quarter tone flat Western quarter tone music - uses the reversed flat and flat-with-stroke interchangeably as a quarter tone flat, but the reversed flat is more common Turkish music - uses both the reversed flat and flat-with-stroke contrastively (neither, strictly speaking, as a quarter tone flat since quarter tones do not exist in Turkish music) On Sun, Mar 29, 2015 at 1:41 PM, sami shumays wrote: > Just one comment: the reversed flat is not commonly used in Arabic notation, > it is primarily a Turkish symbol. The symbols Johnny is proposing are > important so that we can have easy access to the symbols appropriate to our > music. > > Though Arabic and Turkish music systems share many characteristics, they are > not one unified system. And Johnny and I are not experts in the Turkish > system, although we have familiarity with it, so experts in Turkish music > would need to weigh in regarding any additional symbols needed for their > music. For Arabic Music notation, the two Johnny proposes would be > sufficient. > > -Sami > > Sent from my Verizon Wireless 4G LTE Smartphone > > > -------- Original message -------- > From: Garth Wallace > Date:03/29/2015 4:02 PM (GMT-05:00) > To: Michael Everson > Cc: unicode at unicode.org, Johnny Farraj , Sami Shumays > Subject: Re: preliminary proposal: New Unicode characters for Arabic music > half-flat and half-sharp > > Wouldn't it be easier just to change the example glyphs for U+1D132 MUSICAL > SYMBOL QUARTER TONE SHARP and U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT? The > ones currently in the charts do not appear to be in common use. > > The most common symbol for the quarter tone flat, from what I've gathered, > is a reversed flat sign. Some composers use the flat with stroke. One > potential complication: AIUI the Arel-Ezgi-Uzdilek system for notating > Turkish music, which divides each whole tone into nine koma, uses both, > along with a few altered sharps. > > On Sunday, March 29, 2015, Michael Everson wrote: > > Johnny, > > I?m interested in working with you and Sami on this. > > These two characters are often referred to as quarter sharp and quarter flat > as well. The symbols are also widely used outside Arabic music. The western > classical tradition from the 20th century on is full of them. They're not > obscure symbols really. Musicians with even a moderate interest in > contemporary music are aware of them. > > I?m travelling in Sweden working on Blissymbols at the moment, but when I > get back home on Friday and can consult some of my reference works I?ll get > in touch with you. It shouldn?t take long to put something together. > > Michael Everson > > On 29 Mar 2015, at 05:21, Johnny Farraj wrote: > >> Dear unicode list members, >> >> I wish to get feedback about a new character submission proposal. >> >> Currently the Miscellaneous Symbols table (2600-26FF) includes the >> following characters: >> >> 266D ? MUSIC FLAT SIGN >> 266F ? MUSIC SHARP SIGN >> >> while the Musical Symbols table (1D100 - 1D1FF) includes the following >> characters: >> >> 1D12A ?? MUSICAL SYMBOL DOUBLE SHARP >> 1D12B ?? MUSICAL SYMBOL DOUBLE FLAT >> 1D12C ?? MUSICAL SYMBOL FLAT UP >> 1D12D ?? MUSICAL SYMBOL FLAT DOWN >> 1D130 ?? MUSICAL SYMBOL SHARP UP >> 1D131 ?? MUSICAL SYMBOL SHARP DOWN >> 1D132 ?? MUSICAL SYMBOL QUARTER TONE SHARP >> 1D133 ?? MUSICAL SYMBOL QUARTER TONE FLAT >> >> I am proposing the addition of 2 new characters to the Musical Symbols >> table: >> >> - the half-flat sign (lower note by a quarter tone) >> - the half-sharp sign (raise note by a quarter tone) >> >> >> >> >> >> These are widely used in Arabic music notation, and they express intervals >> that are multiples of quarter tones. >> >> I am the primary sponsor of this proposal. As far as my credentials, I am >> the owner of http://maqamworld.com, the most widely used online resource on >> Arabic music theory, in English. >> I can also enlist the support of many academics in the ethnomusicology >> field, who specialize in Arabic music. >> >> I welcome any feedback on this proposal. >> >> thanks >> >> Johnny Farraj >> johnnyfarraj at yahoo.com > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From everson at evertype.com Sun Mar 29 16:39:09 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 29 Mar 2015 23:39:09 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Message-ID: On 29 Mar 2015, at 22:02, Garth Wallace wrote: > Wouldn't it be easier just to change the example glyphs for U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP and U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT? The ones currently in the charts do not appear to be in common use. It would be better to let the symbols be used as intended to be drawn. Documenting such widely varied ?glyph variation? would not end up serving the user community, I think. > The most common symbol for the quarter tone flat, from what I've gathered, is a reversed flat sign. Some composers use the flat with stroke. One potential complication: AIUI the Arel-Ezgi-Uzdilek system for notating Turkish music, which divides each whole tone into nine koma, uses both, along with a few altered sharps. Have you references? This topic may be a little obscure for the general list; we can take it offline. Michael Everson * http://www.evertype.com/ From everson at evertype.com Sun Mar 29 16:40:56 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 29 Mar 2015 23:40:56 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: Message-ID: <109CE573-657C-45AE-92F7-5789E1366A72@evertype.com> On 29 Mar 2015, at 22:41, sami shumays wrote: > Just one comment: the reversed flat is not commonly used in Arabic notation, But is it used? > it is primarily a Turkish symbol. The symbols Johnny is proposing are important so that we can have easy access to the symbols appropriate to our music. > > Though Arabic and Turkish music systems share many characteristics, they are not one unified system. And Johnny and I are not experts in the Turkish system, although we have familiarity with it, so experts in Turkish music would need to weigh in regarding any additional symbols needed for their music. For Arabic Music notation, the two Johnny proposes would be sufficient. This is why we should have detailed discussion about these and related characters. Michael Everson * http://www.evertype.com/ From everson at evertype.com Sun Mar 29 16:42:15 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 29 Mar 2015 23:42:15 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: Message-ID: On 29 Mar 2015, at 23:29, Garth Wallace wrote: > Right, I was just pointing out that Turkish music is a potential > complication if changing the glyph for MUSICAL SYMBOL QUARTER TONE > FLAT. > > Here's how I understand it: > Arabic music - uses the flat-with-stroke exclusively as a quarter tone flat > Western quarter tone music - uses the reversed flat and > flat-with-stroke interchangeably as a quarter tone flat, but the > reversed flat is more common > Turkish music - uses both the reversed flat and flat-with-stroke > contrastively (neither, strictly speaking, as a quarter tone flat > since quarter tones do not exist in Turkish music) That?s quite some variety. There are also the three-quarter flat and sharp in Western music to consider. I?ll be able to dig into this after I get back to Ireland from Sweden on Friday. Michael Everson * http://www.evertype.com/ From wl at gnu.org Sun Mar 29 17:07:38 2015 From: wl at gnu.org (Werner LEMBERG) Date: Mon, 30 Mar 2015 00:07:38 +0200 (CEST) Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: Message-ID: <20150330.000738.23342035.wl@gnu.org> > That?s quite some variety. There are also the three-quarter flat and > sharp in Western music to consider. I?ll be able to dig into this > after I get back to Ireland from Sweden on Friday. You should check the Standard Music Font Layout (SmuFL) for details; it also has a freely available font that covers it. http://www.smufl.org The recent version of the specification can be found at http://www.smufl.org/files/smufl-1.12.pdf Werner From asmus-inc at ix.netcom.com Sun Mar 29 17:47:59 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 29 Mar 2015 15:47:59 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Message-ID: <5518811F.9090708@ix.netcom.com> On 3/29/2015 2:39 PM, Michael Everson wrote: > On 29 Mar 2015, at 22:02, Garth Wallace wrote: > >> Wouldn't it be easier just to change the example glyphs for U+1D132 MUSICAL SYMBOL QUARTER TONE SHARP and U+1D133 MUSICAL SYMBOL QUARTER TONE FLAT? The ones currently in the charts do not appear to be in common use. > It would be better to let the symbols be used as intended to be drawn. Documenting such widely varied ?glyph variation? would not end up serving the user community, I think. Glyph variations are fine as long as there is a) never a contrasting usage in the same context b) a common practice of alternating presentations w/o change of meaning c) a common understanding that the details of presentation are stylistic In this case, one or more of these conditions appear not to be met. A./ > >> The most common symbol for the quarter tone flat, from what I've gathered, is a reversed flat sign. Some composers use the flat with stroke. One potential complication: AIUI the Arel-Ezgi-Uzdilek system for notating Turkish music, which divides each whole tone into nine koma, uses both, along with a few altered sharps. > Have you references? > > This topic may be a little obscure for the general list; we can take it offline. > > Michael Everson * http://www.evertype.com/ > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From asmus-inc at ix.netcom.com Sun Mar 29 17:49:34 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Sun, 29 Mar 2015 15:49:34 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150330.000738.23342035.wl@gnu.org> References: <20150330.000738.23342035.wl@gnu.org> Message-ID: <5518817E.5070506@ix.netcom.com> It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. A./ On 3/29/2015 3:07 PM, Werner LEMBERG wrote: >> That?s quite some variety. There are also the three-quarter flat and >> sharp in Western music to consider. I?ll be able to dig into this >> after I get back to Ireland from Sweden on Friday. > You should check the Standard Music Font Layout (SmuFL) for details; > it also has a freely available font that covers it. > > http://www.smufl.org > > The recent version of the specification can be found at > > http://www.smufl.org/files/smufl-1.12.pdf > > Werner > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From everson at evertype.com Sun Mar 29 17:51:48 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 30 Mar 2015 00:51:48 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <5518817E.5070506@ix.netcom.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> Message-ID: On 30 Mar 2015, at 00:49, Asmus Freytag (t) wrote: > It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. Yup. Michael Everson * http://www.evertype.com/ From mark at kli.org Sun Mar 29 18:01:06 2015 From: mark at kli.org (Mark E. Shoulson) Date: Sun, 29 Mar 2015 19:01:06 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> Message-ID: <55188432.3060602@kli.org> On 03/29/2015 06:51 PM, Michael Everson wrote: > On 30 Mar 2015, at 00:49, Asmus Freytag (t) wrote: > >> It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. > Yup. Just read through (most of) the smufl reference... Yes, I think we've finally found something to do with one of those empty planes that have been lying around... (or at least part of one) ~mark From haberg-1 at telia.com Sun Mar 29 18:14:05 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 30 Mar 2015 01:14:05 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150330.000738.23342035.wl@gnu.org> References: <20150330.000738.23342035.wl@gnu.org> Message-ID: <815CC8B4-7065-40AD-9903-014891F5B438@telia.com> > On 30 Mar 2015, at 00:07, Werner LEMBERG wrote: > You should check the Standard Music Font Layout (SmuFL) for details; > it also has a freely available font that covers it. > > http://www.smufl.org > > The recent version of the specification can be found at > > http://www.smufl.org/files/smufl-1.12.pdf It works with LilyPond , using some OpenLilyPond extensions. Also, there a SMuFL glyph browsing page here: http://www.smufl.org/version/latest/ From haberg-1 at telia.com Sun Mar 29 18:21:58 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 30 Mar 2015 01:21:58 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: References: <1704421146.575190.1427599299769.JavaMail.yahoo@mail.yahoo.com> <36440F04-1E6E-4EB5-8C8D-29256770B6F5@evertype.com> Message-ID: > On 29 Mar 2015, at 22:02, Garth Wallace wrote: > The most common symbol for the quarter tone flat, from what I've gathered, is a reversed flat sign. Some composers use the flat with stroke. One potential complication: AIUI the Arel-Ezgi-Uzdilek system for notating Turkish music, which divides each whole tone into nine koma, uses both, along with a few altered sharps. Some of the Turkish systems are discussed by Ozan Yarman, "A Comparative Evaluation of Pitch Notations in Turkish Makam Music?. From public at khwilliamson.com Mon Mar 30 14:38:54 2015 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 30 Mar 2015 13:38:54 -0600 Subject: Meroitic cursive fractions numerical values In-Reply-To: References: <55170997.70109@khwilliamson.com> Message-ID: <5519A64E.10707@khwilliamson.com> On 03/29/2015 03:41 AM, Andrew West wrote: > On 28 March 2015 at 20:05, Karl Williamson wrote: >> >> Existing software that looks at the numeric values of characters is written >> expecting that rational numbers will have been reduced to their lowest form. > > That seems to be a rather rash statement. I have software (BabelPad) > which parses the numeric values of characters for numeric sorting > purposes, and it parses "6/12" for MEROITIC CURSIVE FRACTION SIX > TWELFTHS as 0.5. Personally I find it hard to imagine how you could > write software that accepts "6/12" as input and is unable to come up > with the answer of a half. The statement is not rash, as it is simply a statement of objective fact. I am the maintainer of software that fails with beta 8.0 due to this change. And it has nothing to do with not being able to do arithmetic division; your assumption was wrong. The software essentially creates a database of Unicode properties for regular expression pattern matching. so that someone can say /\p{Numeric_Value=0.5}/ and quickly determine if the matched string contains a code point with that characteristic. Because the database is copied as-is to many different computers with different word sizes and different floating point implementations, it can't do the division ahead of time because of the inherent fuzziness of floating point numbers. It solves this the same way Unicode has, by leaving rational numbers in their original precisely specified format. Thus it creates a table for the property-value combination of Numeric_Value and 1/2, taking the UCD value as-is. Prior to beta 8, the UCD came with all fractions already reduced. It would not occur to someone with a mainly mathematical or computer science background that the input data would come otherwise, as the mathematical convention is to specify in irreducible terms, even though this isn't promised by Unicode, so of course there is no code to handle the new case. The code thus creates a second table for the property-value combination of Numeric_Value and 6/12, which causes problems. It's a small matter to add code to reduce the UCD-specified rational numbers, but it's just one more complication to have to deal with along with the many that the UCD already presents, and if there is not a good reason the data for these new characters is specified contrary to mathematical convention, then the data should be changed instead of having to code around it. > > I would say that fractions should not be reduced to their lowest form > in the Unicode data as some people may need to order fractions by > numerator or denominator, and reducing to lowest form could break the > expectations of some software. Having said that, I note that the > numeric value of one character has been reduced in the Unicode data: > U+2189 VULGAR FRACTION ZERO THIRDS is given the numeric value of "0" > rather that "0/3". So there is some precedent for reducing. > > Andrew > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From haberg-1 at telia.com Mon Mar 30 15:54:12 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 30 Mar 2015 22:54:12 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <5518817E.5070506@ix.netcom.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> Message-ID: <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> > On 30 Mar 2015, at 00:49, Asmus Freytag (t) wrote: > > It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. There is a similar issue to that of the math symbols, namely, one might add some which are not actually in use currently, added for future completeness: Persian music notation uses two microtonal accidentals: the lowering koron and the raising sori. http://www.smufl.org/version/latest/range/persianAccidentals/ Then intervals which will result in combinations of these are performed but not currently notated. This would be similar to the standard accidentals: koron-koron, koron-sori, and sori-sori. In addition, these accidentals are not exact quartertones, which means that a lowering of the sori interval is not a koron, and raising a koron interval is not a sori. The reason they are not present in Persian notation, is that one usually do not transpose, but it is easy to do that in modern music computer engraving programs, so it might be nice to have them for that reason. The same problem arises in Arab music notation, but possibly one might use some already present Western microtonal accidentals there. > On 3/29/2015 3:07 PM, Werner LEMBERG wrote: >>> That?s quite some variety. There are also the three-quarter flat and >>> sharp in Western music to consider. I?ll be able to dig into this >>> after I get back to Ireland from Sweden on Friday. >> You should check the Standard Music Font Layout (SmuFL) for details; >> it also has a freely available font that covers it. >> >> http://www.smufl.org >> >> The recent version of the specification can be found at >> >> http://www.smufl.org/files/smufl-1.12.pdf >> >> Werner >> From johnnyfarraj at yahoo.com Mon Mar 30 16:13:48 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Mon, 30 Mar 2015 17:13:48 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> Message-ID: <0D30E373-6A48-4595-9018-6D3C98341DE2@yahoo.com> >The same problem arises in Arab music notation, Hi Hans, I'm not sure what you mean by that statement? The Arabic half-flat and half-sharp symbols do not mean exact quartertones, but that's understood by Arabic music performers as the exact intonation is then learned by ear. That fact does not make them impractical to use, as they are used extensively in Arabic music notation. Johnny > On Mar 30, 2015, at 4:54 PM, Hans Aberg wrote: > > The same problem arises in Arab music notation, From haberg-1 at telia.com Mon Mar 30 16:32:34 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Mon, 30 Mar 2015 23:32:34 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <0D30E373-6A48-4595-9018-6D3C98341DE2@yahoo.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> <0D30E373-6A48-4595-9018-6D3C98341DE2@yahoo.com> Message-ID: > On 30 Mar 2015, at 23:13, Johnny Farraj wrote: > > >> The same problem arises in Arab music notation, > > Hi Hans, > > I'm not sure what you mean by that statement? > > The Arabic half-flat and half-sharp symbols do not mean exact quartertones, but that's understood by Arabic music performers as the exact intonation is then learned by ear. That fact does not make them impractical to use, as they are used extensively in Arabic music notation. It has to with the transposable staff notation system: if for example Maqam Sikah as here http://www.maqamworld.com/maqamat/sikah.html is transposed to key E, then one will see an accidental that raises on the F the same amount that the lowering Arab microtonal accidentals. It is also performed in Sikah Baladi according my pitch measurements of the examples on your site (the link above), similar to Turkish music, but the microtonal offsets larger. > Johnny > >> On Mar 30, 2015, at 4:54 PM, Hans Aberg wrote: >> >> The same problem arises in Arab music notation, From doug at ewellic.org Mon Mar 30 17:06:05 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 30 Mar 2015 15:06:05 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp Message-ID: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> Johnny Farraj wrote: > The Arabic half-flat and half-sharp symbols do not mean exact > quartertones, but that's understood by Arabic music performers as the > exact intonation is then learned by ear. That fact does not make them > impractical to use, as they are used extensively in Arabic music > notation. This is true for Western (European) music notation as well. Nothing about the classical written notation indicates which tuning system is to be used -- equal, just, Pythagorean, etc. -- and for many pieces the choice is left to the performer, but the notation is equally valid for all. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From johnnyfarraj at yahoo.com Mon Mar 30 17:48:40 2015 From: johnnyfarraj at yahoo.com (Johnny Farraj) Date: Mon, 30 Mar 2015 18:48:40 -0400 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> References: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> Message-ID: <108AE5EB-BEFF-46AE-9431-1B7019748BD3@yahoo.com> That's a good point. I was thinking in the confines of equal temperament. Johnny > On Mar 30, 2015, at 6:06 PM, "Doug Ewell" wrote: > > Johnny Farraj wrote: > >> The Arabic half-flat and half-sharp symbols do not mean exact >> quartertones, but that's understood by Arabic music performers as the >> exact intonation is then learned by ear. That fact does not make them >> impractical to use, as they are used extensively in Arabic music >> notation. > > This is true for Western (European) music notation as well. Nothing > about the classical written notation indicates which tuning system is to > be used -- equal, just, Pythagorean, etc. -- and for many pieces the > choice is left to the performer, but the notation is equally valid for > all. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From haberg-1 at telia.com Mon Mar 30 18:10:52 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 31 Mar 2015 01:10:52 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> References: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> Message-ID: > On 31 Mar 2015, at 00:06, Doug Ewell wrote: > > Johnny Farraj wrote: > >> The Arabic half-flat and half-sharp symbols do not mean exact >> quartertones, but that's understood by Arabic music performers as the >> exact intonation is then learned by ear. That fact does not make them >> impractical to use, as they are used extensively in Arabic music >> notation. > > This is true for Western (European) music notation as well. Nothing > about the classical written notation indicates which tuning system is to > be used -- equal, just, Pythagorean, etc. -- and for many pieces the > choice is left to the performer, but the notation is equally valid for > all. This is a good point: In CPP (Common Practise Period) music, for variable pitch instruments doing harmony, the staff system expresses the Pythagorean tuning, which the strings are tuned to, but the harmony rules encourages to adapt into 5-limit Just Intonation. But chord pivoting in some common harmony sequences makes it impossible to play JI without further pitch adaptation - microtonalists call this ?comma pumps?, as the amount may be a syntonic comma 81/80. For fixed pitch instruments, one has in the past experimented with a plethora of tunings. A method for tuning a piano accurately to E12 was developed first in the beginning of the 20th century according to one source, though the concept dates back to Ancient Greece, and E12-like tunings have been used on lutes for centuries. From haberg-1 at telia.com Mon Mar 30 18:28:09 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 31 Mar 2015 01:28:09 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <108AE5EB-BEFF-46AE-9431-1B7019748BD3@yahoo.com> References: <20150330150605.665a7a7059d7ee80bb4d670165c8327d.c4792ce755.wbe@email03.secureserver.net> <108AE5EB-BEFF-46AE-9431-1B7019748BD3@yahoo.com> Message-ID: <7B789297-E053-4CEB-A381-87E0BFD3A14A@telia.com> > On 31 Mar 2015, at 00:48, Johnny Farraj wrote: > > That's a good point. I was thinking in the confines of equal temperament. The basis for music from Middle Ages, from West down to Persia at least, is the Pythagorean tuning. Then in Western art music, CPP (Common Practise Period), one adds 5-limit Just Intonation, whereas in Turkish, Arab and Persian music, one adds a microtonal interval. The equal temperament E12 is a Pythagorean tuning approximation: if one computes continued fractions of log_2 3/2, to find approximations of the pure 5th 3/2 in equal temperaments, one gets the convergents 7/12, 24/41, 31/53, 179/306, ? Here, the 7/12 is the pure 5th on note 7 in E12. The 31/53, the pure 5th on note 31 in E53, is used in the AEU (Arel-Ezgi-Uzdilek) description of Turkish music. It very close to the Pythagorean tuning - going higher up in ETs is not much of a point. >> On Mar 30, 2015, at 6:06 PM, "Doug Ewell" wrote: >> >> Johnny Farraj wrote: >> >>> The Arabic half-flat and half-sharp symbols do not mean exact >>> quartertones, but that's understood by Arabic music performers as the >>> exact intonation is then learned by ear. That fact does not make them >>> impractical to use, as they are used extensively in Arabic music >>> notation. >> >> This is true for Western (European) music notation as well. Nothing >> about the classical written notation indicates which tuning system is to >> be used -- equal, just, Pythagorean, etc. -- and for many pieces the >> choice is left to the performer, but the notation is equally valid for >> all. From asmus-inc at ix.netcom.com Mon Mar 30 22:09:04 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Mon, 30 Mar 2015 20:09:04 -0700 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> Message-ID: <551A0FD0.7050101@ix.netcom.com> On 3/30/2015 1:54 PM, Hans Aberg wrote: >> On 30 Mar 2015, at 00:49, Asmus Freytag (t) wrote: >> >> It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. > There is a similar issue to that of the math symbols, namely, one might add some which are not actually in use currently, That is not what I had intended to imply, rather that the collection of music symbols encoded today might have left out a considerable number of lesser-used (not: never-used) symbols. The emphasis was on making sure that attested symbols don't dribble in; the recommendation would be to make a concerted effort to collect and process them in the largest chunks that are reasonably feasible. > added for future completeness: > > Persian music notation uses two microtonal accidentals: the lowering koron and the raising sori. > http://www.smufl.org/version/latest/range/persianAccidentals/ > > Then intervals which will result in combinations of these are performed but not currently notated. This would be similar to the standard accidentals: koron-koron, koron-sori, and sori-sori. > > In addition, these accidentals are not exact quartertones, which means that a lowering of the sori interval is not a koron, and raising a koron interval is not a sori. The reason they are not present in Persian notation, is that one usually do not transpose, but it is easy to do that in modern music computer engraving programs, so it might be nice to have them for that reason. I could see additions like that if "modern music computer engraving program" suppliers were making a representation that this would solve a real or closely anticipated problem (and that they and the community of their users would expect to be able to use such symbols). A mere observation that these might "logically" exist, I would find much less compelling. > > The same problem arises in Arab music notation, but possibly one might use some already present Western microtonal accidentals there. Again, if there's a consensus by practitioners that such notation is required, that would be one thing; mere speculation about them being logically necessary doesn't carry the same weight, because we know of many examples where actual practice ended up with illogical notation. A./ > >> On 3/29/2015 3:07 PM, Werner LEMBERG wrote: >>>> That?s quite some variety. There are also the three-quarter flat and >>>> sharp in Western music to consider. I?ll be able to dig into this >>>> after I get back to Ireland from Sweden on Friday. >>> You should check the Standard Music Font Layout (SmuFL) for details; >>> it also has a freely available font that covers it. >>> >>> http://www.smufl.org >>> >>> The recent version of the specification can be found at >>> >>> http://www.smufl.org/files/smufl-1.12.pdf >>> >>> Werner >>> > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From haberg-1 at telia.com Tue Mar 31 03:46:47 2015 From: haberg-1 at telia.com (Hans Aberg) Date: Tue, 31 Mar 2015 10:46:47 +0200 Subject: preliminary proposal: New Unicode characters for Arabic music half-flat and half-sharp In-Reply-To: <551A0FD0.7050101@ix.netcom.com> References: <20150330.000738.23342035.wl@gnu.org> <5518817E.5070506@ix.netcom.com> <166E4AB3-FF22-42F2-A831-0292CB7FCCBF@telia.com> <551A0FD0.7050101@ix.netcom.com> Message-ID: <1617A384-5A92-4ED6-82D7-E745AE7B2543@telia.com> > On 31 Mar 2015, at 05:09, Asmus Freytag (t) wrote: > > On 3/30/2015 1:54 PM, Hans Aberg wrote: >>> On 30 Mar 2015, at 00:49, Asmus Freytag (t) wrote: >>> >>> It would be worth to bring the collection of music symbols up to a more comprehensive set in one go, rather than to do it piecemeal. >> There is a similar issue to that of the math symbols, namely, one might add some which are not actually in use currently, > > That is not what I had intended to imply, rather that the collection of music symbols encoded today might have left out a considerable number of lesser-used (not: never-used) symbols. > > The emphasis was on making sure that attested symbols don't dribble in; the recommendation would be to make a concerted effort to collect and process them in the largest chunks that are reasonably feasible. One might reserve space (code point positions) in unclear cases, that might be filled in in the future. >> added for future completeness: >> >> Persian music notation uses two microtonal accidentals: the lowering koron and the raising sori. >> http://www.smufl.org/version/latest/range/persianAccidentals/ >> >> Then intervals which will result in combinations of these are performed but not currently notated. This would be similar to the standard accidentals: koron-koron, koron-sori, and sori-sori. >> >> In addition, these accidentals are not exact quartertones, which means that a lowering of the sori interval is not a koron, and raising a koron interval is not a sori. The reason they are not present in Persian notation, is that one usually do not transpose, but it is easy to do that in modern music computer engraving programs, so it might be nice to have them for that reason. > > I could see additions like that if "modern music computer engraving program" suppliers were making a representation that this would solve a real or closely anticipated problem (and that they and the community of their users would expect to be able to use such symbols). > > A mere observation that these might "logically" exist, I would find much less compelling. I have written C++11 code for an arbitrary number of generators. The traditional staff uses two generator, for example the major and minor seconds (traditionally the perfect fifth and octave). The Turkish, Persian and Arab music requires three generators. In Persian music, Hormoz Farhat, ?The Dastgah Concept in Persian Music?, uses a neutral second for that description. >> The same problem arises in Arab music notation, but possibly one might use some already present Western microtonal accidentals there. > > Again, if there's a consensus by practitioners that such notation is required, that would be one thing; mere speculation about them being logically necessary doesn't carry the same weight, because we know of many examples where actual practice ended up with illogical notation. Right. That?s why I cc?ed some experts. Personally, I prefer the Extended Helmholtz-Ellis JI Pitch Notation, which does not have those limitations. http://www.marcsabat.com/pdfs/notation.pdf From doug at ewellic.org Tue Mar 31 12:30:58 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 31 Mar 2015 10:30:58 -0700 Subject: Meroitic cursive fractions numerical values Message-ID: <20150331103058.665a7a7059d7ee80bb4d670165c8327d.f6b0d19fa7.wbe@email03.secureserver.net> Karl Williamson wrote: > It's a small matter to add code to reduce the UCD-specified rational > numbers, but it's just one more complication to have to deal with > along with the many that the UCD already presents, and if there is not > a good reason the data for these new characters is specified contrary > to mathematical convention, then the data should be changed instead of > having to code around it. UAX #44, Section 5.9.1 says: | For all numeric properties, and for properties such as | Unicode_Radical_Stroke which are constructed from combinations of | numeric values, use loose matching rule UAX44-LM1 when comparing | property values. | | UAX44-LM1. Apply numeric equivalences. | ? "01.00" is equivalent to "1". | ? "1.666667" in the UCD is a repeating fraction, and equivalent to | "10/6" or "5/3". This strongly suggests that the implementation should be changed, not to match the data, but to match the specification. -- Doug Ewell | http://ewellic.org | Thornton, CO ????