From mark at kli.org Tue Oct 7 19:23:39 2014 From: mark at kli.org (Mark E. Shoulson) Date: Tue, 07 Oct 2014 20:23:39 -0400 Subject: And what happened to... Message-ID: <5434840B.9010809@kli.org> An HTML attachment was scrubbed... URL: From rscook at wenlin.com Tue Oct 7 20:43:13 2014 From: rscook at wenlin.com (Richard Cook) Date: Tue, 7 Oct 2014 18:43:13 -0700 Subject: Biang,was: And what happened to... In-Reply-To: <5434840B.9010809@kli.org> References: <5434840B.9010809@kli.org> Message-ID: <30A57EDE-B72F-4A4B-83FC-038C2F1F57CC@wenlin.com> On Oct 7, 2014, at 5:23 PM, Mark E. Shoulson wrote: > > The infamous Biang-Biang Noodle Mark, You seem to know as much as anyone about biang. All I can say is, biang is attested in tones 2, 4 and 1, and enshrined (along with a glyph variant) in Wenlin CDL PUA at U+E999, with 51 or 57 strokes (your stroke count may vary). Yes, I just happened to remember the code point and trivia. If you'd like to see the CDL let me know ... -Richard From andrewcwest at gmail.com Wed Oct 8 03:09:26 2014 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 8 Oct 2014 09:09:26 +0100 Subject: And what happened to... In-Reply-To: <5434840B.9010809@kli.org> References: <5434840B.9010809@kli.org> Message-ID: On 8 October 2014 01:23, Mark E. Shoulson wrote: > > The other thing I wanted to ask about has, sure enough, disappeared. It's the only Han character I'm following. The infamous Biang-Biang Noodle character, discussed at http://en.wikipedia.org/wiki/Biangbiang_noodles The WP page said it was scheduled for Extension E (I know it says Extension F now: I changed it), which has already been passed, so I looked through the IRG web site and read up on a bunch of discussion tracing its fate. The character was part of Unicode's Urgently Needed Characters (UNC) submission of 19 characters to IRG in 2013 (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg40/IRGN1936_UTC_UNC.zip), but other IRG members had concerns about biang and some other characters in the submission, and so Unicode's UNC resubmission in 2014 (http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg42/IRGN2005_UTC_UNC.zip) was reduced to five characters, with all controversial or questioned characters removed. Those five characters are currently scheduled for inclusion in Unicode 9.0, but biang remains in limbo. On the other hand, I am pleased to see that yet another two variants of the character ? bi?n (for which there are already 21 Ideographic Variation Sequences defined) are scheduled for encoding at 2DD84 and 2DD85 (http://std.dkuug.dk/JTC1/SC2/WG2/docs/n4637.pdf). The situation is highly unsatisfactory, but in my opinion the whole CJK encoding process is highly unsatisfactory. Andrew From johannes at bergerhausen.com Wed Oct 8 12:14:10 2014 From: johannes at bergerhausen.com (Johannes Bergerhausen) Date: Wed, 8 Oct 2014 19:14:10 +0200 Subject: decodeunicode jp Message-ID: Dear list, I am happy to announce the japanese version of the decodeunicode book: www.amazon.co.jp/gp/product/4327377368/ at Kenkyusha, publishing house for dictionaries since 1907. >From the original german/english version there are some copies left: www.amazon.com/Decodeunicode-Siri-Poarangan-Johannes-Bergerhausen/dp/3874398137/ Best regards, Johannes Bergerhausen, Hochschule Mainz, Germany From johannes at bergerhausen.com Wed Oct 8 13:46:00 2014 From: johannes at bergerhausen.com (Johannes Bergerhausen) Date: Wed, 8 Oct 2014 20:46:00 +0200 Subject: Unicode Version 7.0 - Complete Text of the Core Specification Published In-Reply-To: <543572A1.9040507@unicode.org> References: <543572A1.9040507@unicode.org> Message-ID: ? a thousand pages (without code charts). Very impressive. Congratulations! From naz at gassiep.com Thu Oct 9 06:41:01 2014 From: naz at gassiep.com (Naz Gassiep) Date: Thu, 09 Oct 2014 22:41:01 +1100 Subject: Proposals for Arabic honorifics Message-ID: <5436744D.6030303@gassiep.com> Hi there, I was wondering how I can help on a particular issue. Currently, the Unicode spec has two Arabic honorifics, being U+FDFA and U+FDFB. There are also miscellaneous other phrases and formal marks. When authoring documents, I use the two (U+FDFA and U+FDFB) and then write out the various others that are needed in full manually. This leads to an inconsistent looking document, and my inner perfectionist grimaces every time. I note that there are proposals to add a wider range of Arabic honorifics as are commonly used. Proposals L2/14-147 and L2/14-152 would add a wide range of honorifics that are used extensively in texts that contain Arabic names of historical significance. The proposals do contain some examples, however I could provide more extensive examples in historical and contemporary texts. I have some experience using these characters and phrases, and would greatly like to see them included in the spec. Is there any way that I can help this process by providing examples of the real world usage of these honorifics and the frequency with which they are used? I would like to see these characters included not only to ease publication of Arabic material, but also to provide consistency in the way that these class of phrases are handled. Best regards, - Naz. From pdm42 at cam.ac.uk Thu Oct 9 09:21:37 2014 From: pdm42 at cam.ac.uk (P.D. Myers) Date: Thu, 09 Oct 2014 15:21:37 +0100 Subject: Christian Palestinian Aramaic Message-ID: <8ad6702159ced5d55403dd837faaa034@cam.ac.uk> Hello all, The Unicode manual, p. 384 (http://www.unicode.org/versions/Unicode7.0.0/ch09.pdf) states: "Christian Palestinian Aramaic. Manuscripts of this dialect employ a script that is akin to Estrangela. It can be considered a subcategory of Estrangela." However, I am working on a CPA font developed for a team who have been transcribing a CPA palimpsest which has several more joining characters than are found in Syriac scripts. Examples: in the palimpsest both waw (?) and hey (?) are double joining characters, whereas in Serto these letters are only right joining. 1. Is it possible, using OpenType tables in FontLab 5, to produce a font that renders this behaviour on desktop software? When I script the tables as standard Syriac features, then these letters do not join in word processing software (Word, Pages, or Mellel). However, if I script the tables without using the preset Syriac features, I can get all letters to join together, but then some mission-critical functions in word processing software are not available (for example, the zero-width-joining character does not force a joining ligature?this is needed to force joining characters next to punctuation marks used for text-critical purposes). The documentation for FL5 has led me to the conclusion that this is an insurmountable problem, as the join-behaviour is fixed by the Unicode standard (in other words, I can't treat a Unicode character as double joining, unless it is defined as double joining by the standard). Am I correct in this conclusion? 2. Is there a case to be made here for CPA to be given its own unicode block? Kind regards, Pete Myers -- Rev Peter D. Myers PhD Candidate Cambridge University, Hebrew transcription in Greek script Faculty of Asian and Middle Eastern Studies, Wolfson College pdm42 at cam.ac.uk 07930 22 22 17 revpetemyers.com From roozbeh at unicode.org Thu Oct 9 11:12:26 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Thu, 9 Oct 2014 09:12:26 -0700 Subject: Proposals for Arabic honorifics In-Reply-To: <5436744D.6030303@gassiep.com> References: <5436744D.6030303@gassiep.com> Message-ID: Hi, I'm the author of L2/14-147. Nice to see you interested in the topic. It would be great to have more samples of the usage of these characters. Feel free to send the samples you can find to me or as a separate proposal document for the UTC, whichever way you prefer. On Oct 9, 2014 7:32 AM, "Naz Gassiep" wrote: > Hi there, > I was wondering how I can help on a particular issue. > > Currently, the Unicode spec has two Arabic honorifics, being U+FDFA and > U+FDFB. There are also miscellaneous other phrases and formal marks. When > authoring documents, I use the two (U+FDFA and U+FDFB) and then write out > the various others that are needed in full manually. This leads to an > inconsistent looking document, and my inner perfectionist grimaces every > time. > > I note that there are proposals to add a wider range of Arabic honorifics > as are commonly used. Proposals L2/14-147 and L2/14-152 would add a wide > range of honorifics that are used extensively in texts that contain Arabic > names of historical significance. The proposals do contain some examples, > however I could provide more extensive examples in historical and > contemporary texts. I have some experience using these characters and > phrases, and would greatly like to see them included in the spec. > > Is there any way that I can help this process by providing examples of the > real world usage of these honorifics and the frequency with which they are > used? I would like to see these characters included not only to ease > publication of Arabic material, but also to provide consistency in the way > that these class of phrases are handled. > > Best regards, > - Naz. > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roozbeh at unicode.org Thu Oct 9 11:24:50 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Thu, 9 Oct 2014 09:24:50 -0700 Subject: Christian Palestinian Aramaic In-Reply-To: <8ad6702159ced5d55403dd837faaa034@cam.ac.uk> References: <8ad6702159ced5d55403dd837faaa034@cam.ac.uk> Message-ID: Two things: 1. You should be able to get the behavior you want through 'calt' or 'dlig' OpenType features. You'd need to add more lookups to your font than a typical Syriac font, but that's to be expected, considering that you are working with slightly atypical material. 2. Based on the exact situation, we can encode more characters in the Syriac block or change the properties of existing characters. It'd be great if you could write a document for the UTC with samples and explanation, so we can figure it out at the next meeting. On Oct 9, 2014 7:33 AM, "P.D. Myers" wrote: > Hello all, > > The Unicode manual, p. 384 (http://www.unicode.org/ > versions/Unicode7.0.0/ch09.pdf) states: > > "Christian Palestinian Aramaic. Manuscripts of this dialect employ a > script that > is akin to Estrangela. It can be considered a subcategory of Estrangela." > > However, I am working on a CPA font developed for a team who have been > transcribing a CPA palimpsest which has several more joining characters > than are found in Syriac scripts. > > Examples: in the palimpsest both waw (?) and hey (?) are double joining > characters, whereas in Serto these letters are only right joining. > > 1. Is it possible, using OpenType tables in FontLab 5, to produce a font > that renders this behaviour on desktop software? When I script the tables > as standard Syriac features, then these letters do not join in word > processing software (Word, Pages, or Mellel). However, if I script the > tables without using the preset Syriac features, I can get all letters to > join together, but then some mission-critical functions in word processing > software are not available (for example, the zero-width-joining character > does not force a joining ligature?this is needed to force joining > characters next to punctuation marks used for text-critical purposes). The > documentation for FL5 has led me to the conclusion that this is an > insurmountable problem, as the join-behaviour is fixed by the Unicode > standard (in other words, I can't treat a Unicode character as double > joining, unless it is defined as double joining by the standard). Am I > correct in this conclusion? > > 2. Is there a case to be made here for CPA to be given its own unicode > block? > > Kind regards, > Pete Myers > > -- > Rev Peter D. Myers > PhD Candidate Cambridge University, Hebrew transcription in Greek script > Faculty of Asian and Middle Eastern Studies, Wolfson College > pdm42 at cam.ac.uk > 07930 22 22 17 > revpetemyers.com > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pdm42 at cam.ac.uk Mon Oct 13 04:23:13 2014 From: pdm42 at cam.ac.uk (P.D. Myers) Date: Mon, 13 Oct 2014 10:23:13 +0100 Subject: Proposal form and docs problems Message-ID: <2eb6e4d9493c4df0b878ce0c0aa4626e@cam.ac.uk> Hello all, The weblinks for submission form to the UTC http://std.dkuug.dk/JTC1/SC2/WG2/docs/summaryform.html and for the principles and procedures for submission http://std.dkuug.dk/JTC1/SC2/WG2/docs/principles.html are broken (for me at least). I found these links on this site: http://www.unicode.org/pending/proposals.html Kind regards, Pete Myers -- Rev Peter D. Myers PhD Candidate Cambridge University, Hebrew transcription in Greek script Faculty of Asian and Middle Eastern Studies, Wolfson College pdm42 at cam.ac.uk 07930 22 22 17 revpetemyers.com From shervinafshar at gmail.com Mon Oct 13 09:38:04 2014 From: shervinafshar at gmail.com (Shervin Afshar) Date: Mon, 13 Oct 2014 07:38:04 -0700 Subject: Proposal form and docs problems In-Reply-To: <2eb6e4d9493c4df0b878ce0c0aa4626e@cam.ac.uk> References: <2eb6e4d9493c4df0b878ce0c0aa4626e@cam.ac.uk> Message-ID: Seems like `std.dkuug.dk` server is down. So the links are not necessarily dead links. ? Shervin On Mon, Oct 13, 2014 at 2:23 AM, P.D. Myers wrote: > Hello all, > > The weblinks for submission form to the UTC http://std.dkuug.dk/JTC1/SC2/ > WG2/docs/summaryform.html and for the principles and procedures for > submission http://std.dkuug.dk/JTC1/SC2/WG2/docs/principles.html are > broken (for me at least). > > I found these links on this site: http://www.unicode.org/ > pending/proposals.html > > Kind regards, > Pete Myers > > -- > Rev Peter D. Myers > PhD Candidate Cambridge University, Hebrew transcription in Greek script > Faculty of Asian and Middle Eastern Studies, Wolfson College > pdm42 at cam.ac.uk > 07930 22 22 17 > revpetemyers.com > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jf at colson.eu Mon Oct 13 16:23:22 2014 From: jf at colson.eu (=?UTF-8?B?SmVhbi1GcmFuw6dvaXMgQ29sc29u?=) Date: Mon, 13 Oct 2014 23:23:22 +0200 Subject: Bliss? Message-ID: <543C42CA.20009@colson.eu> Hello I?ve found a 16-year-old proposal for Blissymbolics ( http://www.evertype.com/standards/iso10646/pdf/bliss.pdf ) but nothing more recent. Was that script rejected? Was it forgotten? Are there any technical difficulties related to that proposal? Thx Jean-Fran?ois Colson From markus.icu at gmail.com Mon Oct 13 16:46:26 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 13 Oct 2014 14:46:26 -0700 Subject: Bliss? In-Reply-To: <543C42CA.20009@colson.eu> References: <543C42CA.20009@colson.eu> Message-ID: On Mon, Oct 13, 2014 at 2:23 PM, Jean-Fran?ois Colson wrote: > I?ve found a 16-year-old proposal for Blissymbolics ( > http://www.evertype.com/standards/iso10646/pdf/bliss.pdf ) but nothing more > recent. Was that script rejected? Was it forgotten? Are there any technical > difficulties related to that proposal? http://www.unicode.org/pending/pending.html#initial_and_exploratory Proposals in Initial and Exploratory Stage The scripts in this stage have had preliminary proposals for encoding submitted to the Unicode Technical Committee and/or ISO/IEC JTC1/SC2/WG2, but these proposals are not yet complete, and further information is required in order to evaluate them, so that they may progress toward encoding. They may not yet have undergone technical review, either for lack of relevant expertise or simply because the material itself is exploratory in nature. Review Input Requested: For these proposals, the UTC is seeking expert feedback to assist in completing the proposals to the level where a well-formed encoding can be technically evaluated, and where there can be reasonable assurance that at least the basic repertoire is presented concisely and completely in a manner consistent with the encoding practices of the committees. Expert reviewers of these scripts may be able to work with the proposers by contacting the Script Encoding Initiative. ... Blissymbolics ... markus From everson at evertype.com Mon Oct 13 17:22:22 2014 From: everson at evertype.com (Michael Everson) Date: Mon, 13 Oct 2014 23:22:22 +0100 Subject: Bliss? In-Reply-To: References: <543C42CA.20009@colson.eu> Message-ID: <93E71F7B-B78A-4382-B6A5-DB37A21ABDAB@evertype.com> Marcus, that was ill-informed. No reason to give to Jean-fran?ois a generic FAQ entry. Better to describe the UTC discussion about the script. Their chief concern at that time was that the user community were still creating new characters. The user community has been working on this in the intervening years, and Bliss is much closer to maturity for encoding. Michael Everson * http://www.evertype.com/ From jf at colson.eu Mon Oct 13 19:47:52 2014 From: jf at colson.eu (=?windows-1252?Q?Jean-Fran=E7ois_Colson?=) Date: Tue, 14 Oct 2014 02:47:52 +0200 Subject: Bliss? In-Reply-To: <93E71F7B-B78A-4382-B6A5-DB37A21ABDAB@evertype.com> References: <543C42CA.20009@colson.eu> <93E71F7B-B78A-4382-B6A5-DB37A21ABDAB@evertype.com> Message-ID: <543C72B8.5040004@colson.eu> Le 14/10/14 00:22, Michael Everson a ?crit : > Marcus, that was ill-informed. No reason to give to Jean-fran?ois a generic FAQ entry. > > Better to describe the UTC discussion about the script. Their chief concern at that time was that the user community were still creating new characters. The user community has been working on this in the intervening years, and Bliss is much closer to maturity for encoding. > > Michael Everson * http://www.evertype.com/ > The script wasn?t mature enough. Does that mean that the shape of many characters changed in the following years? You say their chief concern was that the user community were still creating new characters. But isn?t that inherent to most living ideographic scripts? Aren?t there new characters that appear here and there in Chinese/Japanese/Korean? From markus.icu at gmail.com Mon Oct 13 20:32:30 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 13 Oct 2014 18:32:30 -0700 Subject: Bliss? In-Reply-To: <543C72B8.5040004@colson.eu> References: <543C42CA.20009@colson.eu> <93E71F7B-B78A-4382-B6A5-DB37A21ABDAB@evertype.com> <543C72B8.5040004@colson.eu> Message-ID: As Michael said, I don't have information. But I found this which might help: http://en.wikipedia.org/wiki/Blissymbols#Towards_the_international_standardization_of_the_script markus From eliz at gnu.org Tue Oct 14 06:48:56 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 14 Oct 2014 14:48:56 +0300 Subject: Bidi Parenthesis Algorithm and BidiCharacterTest.txt Message-ID: <83tx366fdj.fsf@gnu.org> Hi, One of the test cases in BidiCharacterTest.txt seems to me to contradict the description of the rules N0 through N2 of the UBA. Or maybe I'm missing something. Here are the details. The test case in question, on line 114 of BidiCharacterTest.txt, is as follows: 0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029;1;1;2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1;16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 The first field, up to the 1st semicolon, is the sequence of characters given by their Unicode codepoints, in the logical order. Translated into readable text, it looks like this: a ( ( { b ? [ ] ) } [ c [ ] ] ? ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 where I inserted blanks between every 2 characters, for better readability, and added position numbers. The next field of the test case data, whose value is 1, specifies that the paragraph direction is RTL, i.e. the embedding level is 1. Let me now present the application of N0 through N2, as I understand them, to this text. (Since there are no explicit directional codes here, and no weak characters, we can skip all the rules before N0.) The results of identifying bracket pairs, per BD16, sorted by the position of the opening bracket, are as follows: 2 and 17 3 and 9 7 and 8 11 and 15 13 and 14 Applying N0, we see that: . The pair 2-17 encloses '?', which matches the embedding direction, so N0b instructs to resolve this pair as matching the embedding direction, i.e. R. . The pair 3-9 encloses 'b', whose direction is opposite to the embedding direction, and has 'a' before the opening bracket, so N0c1 says we should resolve this pair as L, the direction opposite to the embedding one. . The pair 7-8 encloses no strong characters, so it is left as is. . The pair 11-15 encloses 'c' and is preceded by 'b', so N0c1 again says to resolve this pair as L. . The pair 13-14 encloses no strong characters, so is left alone. Therefore, the result after N0 is this: a ( ( { b ? [ ] ) } [ c [ ] ] ? ) L R L N L N N N L N L L N N L R R Applying N1, we then obtain the following result: a ( ( { b ? [ ] ) } [ c [ ] ] ? ) L R L L L L L L L L L L L L L R R There are no neutrals left, so N2 doesn't need to be applied. Now I2 gives the following resolved levels: a ( ( { b ? [ ] ) } [ c [ ] ] ? ) 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 However, BidiCharacterTest.txt gives a different sequence of resolved levels: 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 Could someone please point out what am I missing or doing incorrectly? Thanks in advance. From doug at ewellic.org Tue Oct 14 11:06:50 2014 From: doug at ewellic.org (Doug Ewell) Date: Tue, 14 Oct 2014 09:06:50 -0700 Subject: =?UTF-8?Q?Bliss=3F?= Message-ID: <20141014090650.665a7a7059d7ee80bb4d670165c8327d.ce071a2522.wbe@email03.secureserver.net> Markus Scherer wrote: > As Michael said, I don't have information. But I found this which > might help: > http://en.wikipedia.org/wiki/Blissymbols#Towards_the_international_standardization_of_the_script Statements in the linked article such as the following (not written by Markus) always trouble me: "The proposed encoding does not use the lexical encoding model used in the existing ISO-IR/169 registered character set, but instead applies the Unicode and ISO character-glyph model to the Bliss-character model already adopted by BCI, since this would significantly reduce the number of needed characters." since my understanding has always been that the reasons behind the character-glyph model go much deeper than reducing the number of encoded characters. -- Doug Ewell | Thornton, CO, USA | http://ewellic.org From andrewcwest at gmail.com Tue Oct 14 11:59:58 2014 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 14 Oct 2014 17:59:58 +0100 Subject: Bliss? In-Reply-To: <20141014090650.665a7a7059d7ee80bb4d670165c8327d.ce071a2522.wbe@email03.secureserver.net> References: <20141014090650.665a7a7059d7ee80bb4d670165c8327d.ce071a2522.wbe@email03.secureserver.net> Message-ID: On 14 October 2014 17:06, Doug Ewell wrote: > > Statements in the linked article such as the following (not written by > Markus) always trouble me: Gosh, I wonder who it could have been? https://en.wikipedia.org/w/index.php?title=Blissymbols&diff=331226727&oldid=331223779 Andrew From everson at evertype.com Tue Oct 14 12:26:54 2014 From: everson at evertype.com (Michael Everson) Date: Tue, 14 Oct 2014 18:26:54 +0100 Subject: Bliss? In-Reply-To: References: <20141014090650.665a7a7059d7ee80bb4d670165c8327d.ce071a2522.wbe@email03.secureserver.net> Message-ID: <349F01E4-E1B5-4325-BEF7-2DB70E856526@evertype.com> On 14 Oct 2014, at 17:59, Andrew West wrote: > On 14 October 2014 17:06, Doug Ewell wrote: >> >> Statements in the linked article such as the following (not written by >> Markus) always trouble me: > > Gosh, I wonder who it could have been? > > https://en.wikipedia.org/w/index.php?title=Blissymbols&diff=331226727&oldid=331223779 Oof. Folks, I?m a member of the BC-UK committee and have been working with BCI for years to ready Bliss for encoding. Work proceeds apace. Michael Everson * http://www.evertype.com/ From eliz at gnu.org Tue Oct 14 14:56:42 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Tue, 14 Oct 2014 22:56:42 +0300 Subject: Bidi Parenthesis Algorithm and BidiCharacterTest.txt In-Reply-To: <106c2afc4aaa45d1bb42006f88d3e0ee@BN1PR03MB139.namprd03.prod.outlook.com> References: <83tx366fdj.fsf@gnu.org> <106c2afc4aaa45d1bb42006f88d3e0ee@BN1PR03MB139.namprd03.prod.outlook.com> Message-ID: <83mw8y5ssl.fsf@gnu.org> > From: "Andrew Glass (WINDOWS)" > Date: Tue, 14 Oct 2014 18:07:24 +0000 > > The difference is that N0 is applied per bracket pair and the result of the > resolution of one bracket pair may impact the resolution of other bracket pairs > in the same isolating run sequence. So in your example: > > ? 2-17 is resolved to R as you say. > > ? Since 2-17 is now R and not neutral, the resolution of 3-9 is R because the > check for context finds the opening parenthesis at 2 (now R) before the a at 1. > Therefore 2-17 is R under N0c2. But there's nothing about this in the UAX#9 language! How did you arrive at this dependency, using just what the UBA says? > The proposed update attempts to make this clearer in the intro to 3.3.5: > > http://www.unicode.org/reports/tr9/tr9-32.html#N0 > > Note that this rule is applied based on the current bidirectional character > type of each paired bracket and not the original type, as this could have > changed under X6. > > Perhaps this should be emended to include that N0 can also update the type for > subsequent tests under N0, which is the case here. There's a big difference between X6 and N0. X6 is about the explicit override, and is applied before N0. Your interpretation makes N0 a recursive rule, something that is not even hinted at by the UBA spec. > Currently N0 states: > > N0. Process bracket pairs in an isolating run sequence sequentially in the > logical order of the text positions of the opening paired brackets using the > logic given below. > > Example 1 illustrates a similar case in that the neutral ! resolves to R > because of the bracket resolution to R rather than the context between two Ls. > This of course takes place in N1 and not N0 as in the example you ask about. Of course! And so Example 1 is very different from what we are discussing, because each stage of the algorithm is applied to the results of the previous stage. But there's no other place, AFAICS, where the same stage is applied recursively. So I really don't see how this interpretation could be gleaned from the UBA description. Thanks for explaining, but it is really frustrating to find out about these untold subtleties at this late stage. (And yes, I've read the proposed changes in tr9-32.html, and not even they say anything about this.) How can we be sure that your interpretation is indeed correct, if it is not even hinted anywhere? From ken.whistler at sap.com Tue Oct 14 17:14:02 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Tue, 14 Oct 2014 22:14:02 +0000 Subject: Bidi Parenthesis Algorithm and BidiCharacterTest.txt In-Reply-To: <83mw8y5ssl.fsf@gnu.org> References: <83tx366fdj.fsf@gnu.org> <106c2afc4aaa45d1bb42006f88d3e0ee@BN1PR03MB139.namprd03.prod.outlook.com> <83mw8y5ssl.fsf@gnu.org> Message-ID: Eli asked in response to Andrew: > > ? Since 2-17 is now R and not neutral, the resolution of 3-9 is R because the > > check for context finds the opening parenthesis at 2 (now R) before the a > at 1. > > Therefore 2-17 is R under N0c2. > > But there's nothing about this in the UAX#9 language! How did you > arrive at this dependency, using just what the UBA says? See below. > > Perhaps this should be emended to include that N0 can also update the > type for > > subsequent tests under N0, which is the case here. > > There's a big difference between X6 and N0. X6 is about the explicit > override, and is applied before N0. Your interpretation makes N0 a > recursive rule, something that is not even hinted at by the UBA spec. I disagree that this makes N0 a "recursive" rule. It is a rule with repeatedly applicable subparts. And like nearly all the rules in the UBA (except ones which explicitly state that they apply to *original* Bidi_Class values, which thus have to be stored across the life of the processing of the string in question), all rules apply to the *current* Bidi_Class values of the examined context. In this sense, the UBA, for most rules, operates as a set of "change and forget" steps. Thus in the case of N0, if you are processing a sequential list of bracket pairs, you just process each pair, one at a time, and it sees as its input whatever the *current* state is -- which may be (and often is) changed by the last step. What you do *not* need to do for N0 is preserve the starting state when N0 was initiated, and independently check each bracket pair against *that* array of Bidi_Class values while you are busy setting them to new values. > > Of course! And so Example 1 is very different from what we are > discussing, because each stage of the algorithm is applied to the > results of the previous stage. But there's no other place, AFAICS, > where the same stage is applied recursively. So I really don't see > how this interpretation could be gleaned from the UBA description. I agree that this could (and should) be made more explicit, as it is apparent that people can run into problems of interpretation here. An examination of the functioning of the N0 rule in the bidi reference implementations could, however, also be used to help explain what is intended here. For example, in the particular test case in question, the bidiref C implementation can have its debug diagnostics cranked up, and you find: Trace: Entering br_UBA_ResolveEN [W7] Current State: 13 Text: 0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029 Bidi_Class: L ON ON ON L ON ON ON ON ON ON L ON ON ON R ON Levels: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Runs: ? Trace: Exiting br_SortPairList Pair list: {1,16} {2,8} {6,7} {10,14} {12,13} Debug: Strong direction e between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets Debug: Strong direction o between brackets Debug: No strong direction between brackets Current State: 14 Text: 0061 0028 0028 007B 0062 2680 005B 005D 0029 007D 005B 0063 005B 005D 005D 05D0 0029 Bidi_Class: L R R ON L ON ON ON R ON R L ON ON R R R Levels: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Runs: Which is the clue needed to track down how the interpretation based on comparing Bidi_Class values retained from the initiation of rule N0 is incorrect. --Ken > > Thanks for explaining, but it is really frustrating to find out about > these untold subtleties at this late stage. (And yes, I've read the > proposed changes in tr9-32.html, and not even they say anything about > this.) How can we be sure that your interpretation is indeed correct, > if it is not even hinted anywhere? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Wed Oct 15 00:36:36 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 15 Oct 2014 08:36:36 +0300 Subject: Bidi Parenthesis Algorithm and BidiCharacterTest.txt In-Reply-To: References: <83tx366fdj.fsf@gnu.org> <106c2afc4aaa45d1bb42006f88d3e0ee@BN1PR03MB139.namprd03.prod.outlook.com> <83mw8y5ssl.fsf@gnu.org> Message-ID: <83iojl6gij.fsf@gnu.org> > From: "Whistler, Ken" > Date: Tue, 14 Oct 2014 22:14:02 +0000 > Cc: "Whistler, Ken" , > "unicode at unicode.org" > > I disagree that this makes N0 a "recursive" rule. It is a rule with repeatedly > applicable subparts. And like nearly all the rules in the UBA (except ones > which explicitly state that they apply to *original* Bidi_Class values, > which thus have to be stored across the life of the processing of > the string in question), all rules apply to the *current* Bidi_Class > values of the examined context. Can you point out where this is stated in the UBA? According to my reading of the UBA, only W7 could qualify as something similar to the "recursive" interpretation of N0. All the other rules are either defined in a way that the "recursion" cannot happen (because the conditions for applying the rule disappear after it is applied once), or explicitly speak about a sequence of similar characters whose bidi types are modified in the same manner. > Trace: Exiting br_SortPairList > Pair list: {1,16} {2,8} {6,7} {10,14} {12,13} > Debug: Strong direction e between brackets > Debug: Strong direction o between brackets > Debug: No strong direction between brackets > Debug: Strong direction o between brackets > Debug: No strong direction between brackets This doesn't explain _why_ the decision was that the direction between brackets was one or the other. Which is at the core of the issue at hand. So this debugging output doesn't really help here. In any case, when designing an implementation, one normally expects to read some formal requirements, not learn those requirements from another implementation. Anyway, I'm glad we all agree that, once again, the new additions to the UBA, and the BPA-related ones in particular, are not described well enough to avoid misinterpretations and misunderstanding such as this one, and that the language should be improved and clarified, hopefully sooner rather than later. I've just lost 20 hours of work due to that. From ken.whistler at sap.com Wed Oct 15 13:55:12 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 15 Oct 2014 18:55:12 +0000 Subject: Bidi Parenthesis Algorithm and BidiCharacterTest.txt In-Reply-To: <83iojl6gij.fsf@gnu.org> References: <83tx366fdj.fsf@gnu.org> <106c2afc4aaa45d1bb42006f88d3e0ee@BN1PR03MB139.namprd03.prod.outlook.com> <83mw8y5ssl.fsf@gnu.org> <83iojl6gij.fsf@gnu.org> Message-ID: > > I disagree that this makes N0 a "recursive" rule. It is a rule with repeatedly > > applicable subparts. And like nearly all the rules in the UBA (except ones > > which explicitly state that they apply to *original* Bidi_Class values, > > which thus have to be stored across the life of the processing of > > the string in question), all rules apply to the *current* Bidi_Class > > values of the examined context. > > Can you point out where this is stated in the UBA? It isn't explicitly stated, as I think is evident from folks agreeing that further clarification of the text would be helpful. And I am not an author/editor of UAX #9 -- the complaints about lack of clarity need to go to them, preferably through explicit text improvement suggestions provided as feedback on PRI #279: http://www.unicode.org/review/pri279/ And I see that Laurentiu Iancu has already helpfully summarized this text issue and already provided some explicit feedback there -- so I think the upcoming UTC is covered for that. I did, however, write a reference implementation for the UBA, including the bidi bracket pairing for UBA 6.3 -- and reading the UBA spec to do so (without looking at anybody else's implementation), I came to the same conclusion regarding the processing of Bidi_Class values in rule N0 as the other author of a reference implementation did. > > According to my reading of the UBA, only W7 could qualify as something > similar to the "recursive" interpretation of N0. All the other rules > are either defined in a way that the "recursion" cannot happen > (because the conditions for applying the rule disappear after it is > applied once), or explicitly speak about a sequence of similar > characters whose bidi types are modified in the same manner. > > > Trace: Exiting br_SortPairList > > Pair list: {1,16} {2,8} {6,7} {10,14} {12,13} > > Debug: Strong direction e between brackets > > Debug: Strong direction o between brackets > > Debug: No strong direction between brackets > > Debug: Strong direction o between brackets > > Debug: No strong direction between brackets > > This doesn't explain _why_ the decision was that the direction between > brackets was one or the other. Which is at the core of the issue at > hand. So this debugging output doesn't really help here. Well, as author of the code that produced that debug output, I can agree that the debug output doesn't explain *why* it made the decisions it did -- I didn't think it was necessary. But I can see that this is a confusing part of the algorithm, and it would be fairly simple to further enhance the debugging output the reference implementation provides in future revisions, so I will endeavor to do so. In the meantime, however, the source code for that reference implementation is posted and is easily available. The relevant part of the rule processing we are talking about can be found in brrule.c. http://www.unicode.org/Public/PROGRAMS/BidiReferenceC/6.3.0/source/brrule.c The function br_ResolvePairEmbeddingLevels() (line 4368), reads down the sorted pair list, and processes each pair in sequence, calling the function br_ResolveOnePair() (line 4235). And if you examine the code in br_ResolveOnePair, you can see that it simply searches the string between the bracket pair for a *current* Bidi_Class value that counts as a strong value, and then, if necessary, searches back to find a *current* Bidi_Class value to the left of the left bracket pair that counts as a strong value. And depending on the results of those searches, it then resolves the Bidi_Class of the bracket pair itself. Following that logic, then, it is pretty clear that the behavior of each successive call to br_ResolveOnePair could, in principle, depend in checking a Bidi_Class value that had been changed by a prior call to br_ResolveOnePair. So yes, multiple, successive calls that depend on the results of the prior rule subpart. > > In any case, when designing an implementation, one normally expects to > read some formal requirements, not learn those requirements from > another implementation. The UBA has *always* been a difficult algorithm to write a clear and complete specification for. One of the reasons why several of us, over the years, have done the work to write *reference* implementations for it is so that examination of the exact behavior of those coded implementations for the various odd edge cases of the algorithm can be referred to in instances (just like this one) where the implications of a specific attempt to write out the specification of a rule in English in UAX #9 might still have some lingering ambiguity in it. And in turn, the experience of the writers of implementations has often come back around and led to suggestions to improve the wording of the specification, precisely because questions arose as to what the correct choices were for edge case behavior. > > Anyway, I'm glad we all agree that, once again, the new additions to > the UBA, and the BPA-related ones in particular, are not described > well enough to avoid misinterpretations and misunderstanding such as > this one, and that the language should be improved and clarified, > hopefully sooner rather than later. I've just lost 20 hours of work > due to that. I'm sure that the editors of UAX #9 will take this on board, and let's hope that some clearer wording can be agreed on. --Ken From cewcathar at hotmail.com Sun Oct 19 13:32:22 2014 From: cewcathar at hotmail.com (CE Whitehead) Date: Sun, 19 Oct 2014 14:32:22 -0400 Subject: Proposed Update UAX #9, Unicode Bidirectional Algorithm Message-ID: Here are my final comments (which I've also submitted to the feedback page) on TR9, http://www.unicode.org/reports/tr9/tr9-32.html#BD11 (3.1.2), as well as on sections 3.3 and 4.3. These are mostly grammar/proofreading nits, but the one on 4.3 is important to fix. Also I made an error in my previous comments (September 30) on 3.1.3, on the algorithm for BDI 16 -- the original text is correct: "If the current stack element is at the bottom of the stack, and the values match, meaning the two characters form a bracket pair, then Append the text position in the current stack element together with the text position of the closing paired bracket to the list. Pop the stack through the current stack element inclusively. Else, if the current stack element is not at the bottom of the stack, advance it to the next element deeper in the stack and go back to step 2." {COMMENT: leave as is; my error} The other proofreading comment I made on September 30 should remain. * * * 3.1.2 BD11 algorithm "Initialize a counter to one. Scan the text following the embedding initiator: At an isolate initiator, skip past the matching PDI, or if there is no matching PDI, to the end of the paragraph. At the end of a paragraph, or at a PDI that matches an isolate initiator before the embedding initiator, stop: the embedding initiator has no matching PDF. At an embedding initiator, increment the counter. At a PDF, decrement the counter. If its new value is zero, stop: this is the matching PDF." {COMMENT: a nitpick: in the second bullet you say "at a PDI that matches an isolate initiator before the embedding initiator" -- this use of "before" is confusing to me; you don't mean that you reach the pdi before reaching the embedding initiator. This can't be the case as you are scanning the text following the embedding initiator; to me the wording is not right; I would change it to: "that matches an isolating intiator that occurred outside/before the/prior to embedding initiator"} => "Initialize a counter to one. Scan the text following the embedding initiator: At an isolate initiator, skip past the matching PDI, or if there is no matching PDI, to the end of the paragraph. At the end of a paragraph, or at a PDI that matches an isolate initiator that occurred prior to the embedding initiator, stop: the embedding initiator has no matching PDF. At an embedding initiator, increment the counter. At a PDF, decrement the counter. If its new value is zero, stop: this is the matching PDF." * * * 3.3.2 "Explict Embeddings", Rule X2, 1rst par, last bullet "With each RLE, perform the following steps: Otherwise, this is an overflow RLE. If the overflow isolate count is zero, increment the overflow embedding count by one. Leave all other variables unchanged." {COMMENT: INSERT HERE FOR CLARITY=>"Otherwise this overflow RLE is within the scope of an overflow isolate initiator, so do nothing."} * * * Rule X3, first par, last bullet "Otherwise, this is an overflow LRE. If the overflow isolate count is zero, increment the overflow embedding count by one. Leave all other variables unchanged. {COMMENT: INSERT HERE FOR CLARITY =>"Otherwise this overflow LRE is within the scope of an overflow isolate initiator, so do nothing."} {QUESTION: So the embeddings that are done in an overflow isolate are only terminated by the overflow isolate terminator, I gather? No need to reply but my correction only makes sense if this is true.} * * * 3.3.2 "Explicit Levels and Directions", "Terminating Isolates", X6A, third bullet, then 2nd sub-bullet: "While the directional isolate status of the last entry on the stack is false, pop the last entry from the directional status stack. (This terminates the scope of those valid embedding initiators within the scope of the matched isolate initiator whose scopes have not been terminated by a matching PDF, and which thus lack a matching PDF. Given that the valid isolate count is non-zero, the directional status stack must contain an entry with directional isolate status true before this step, and thus after this step the last entry on the stack will indeed have a true directional isolate status, i.e. represent the scope of the matched isolate initiator. This cannot be the stack's first entry, which always belongs to the paragraph level and has a false directional status, so there is at least one more entry before it on the stack.)" {COMMENT: again, the use of "before"and "after" is confusing; the entry that the "directional isolate status" set to "true" was PLACED before this step but I would not say that "the stack contains it before this step"; to me that is sort of comparing "apples and oranges" -- comparing a directional isolate status entry to a step; but this may be nitpicking but I found this tough to read} => "While the directional isolate status of the last entry on the stack is false, pop the last entry from the directional status stack. (This terminates the scope of those valid embedding initiators within the scope of the matched isolate initiator whose scopes have not been terminated by a matching PDF, and which thus lack a matching PDF. Given that the valid isolate count is non-zero, the directional status stack must contain an entry with directional isolate status true; [this entry must have been placed prior the PDI], and thus, once all false entries are popped, the last entry on the stack will indeed have a true directional isolate status, i.e. represent the scope of the matched isolate initiator. This cannot be the stack's first entry, which always belongs to the paragraph level and has a false directional status, so there is at least one more entry before it on the stack.)" * * * 3.3.5 "Resolving Neutral and Isolate Formatting Types", N0, 2nd bullet, section c "Otherwise, if there is a strong type it must be opposite the embedding direction. Therefore, test for an established context with a preceding strong type by checking backwards before the opening paired bracket until the first strong type (L, R, or sos) is found." {COMMENT: would it be better to say, "by checking backwards within the isolating run in which the bracket pair occurs"? You do mean to check just within the current isolating run, I believe. is this correct?} => ? "Otherwise, if there is a strong type it must be opposite the embedding direction. Therefore, test for an established context with a preceding strong type by checking backwards from the opening paired bracket until the first strong type (L, R, or sos) is found. If there is no strong type within the isolating run sequence where the bracket pair occurs, then set the bracket pair to the embedding direction." * * * X6A, last bullet, last sub-bullet "If the entry's directional override status is not neutral, reset the current character type from PDI to L if the override status is left-to-right, and to R if the override status is right-to-left." {Just nitpicking; it's usually clearer to start an "if-then" clause with "if" than it is to start it with "then" but you can ignore this suggestion} =>? "If the entry's directional override status is not neutral, then, if the override status is left-to-right, reset the current character type from PDI to L; set it to R if the override status is right-to-left." * * * There is one typo you do need to fix: 4.3 "Higher-level Protocols" "Certain characters that do not have the Bidi_Mirrored property can also be depicted by a mirrored glyph in specialized contexts. Such contexts include, but are not limited to, historic scripts and associated punctuation, private-use characters, and characters in mathematical expressions. (See Section 6, Mirroring.) These characters are those that fit at least one of the following conditions:" {COMMENT: you mean "section 7", which is what the link goes to.} => "Certain characters that do not have the Bidi_Mirrored property can also be depicted by a mirrored glyph in specialized contexts. Such contexts include, but are not limited to, historic scripts and associated punctuation, private-use characters, and characters in mathematical expressions. (See Section 7, Mirroring.) These characters are those that fit at least one of the following conditions:" * * * * * * Best, -- C. E. Whitehead cewcathar at hotmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Oct 22 02:27:40 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 22 Oct 2014 09:27:40 +0200 Subject: fonts for U7.0 scripts Message-ID: I'm looking for freely downloadable TTF fonts for any of the following. I'd appreciate links to sites for any of these: 1. Bassa_Vah 2. Duployan 3. Grantha 4. Khojki 5. Khudawadi 6. Mahajani 7. Mende_Kikakui 8. Modi 9. Mro 10. Nabataean 11. Old_Permic 12. Palmyrene 13. Pau_Cin_Hau 14. Tirhuta 15. Warang_Citi Coverage doesn't need to be complete, and ?the font doesn't need to support shaping (these are just for charts / illustrations). Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From as at signographie.de Wed Oct 22 04:36:32 2014 From: as at signographie.de (=?iso-8859-1?Q?Andreas_St=F6tzner?=) Date: Wed, 22 Oct 2014 11:36:32 +0200 Subject: fonts for U7.0 scripts In-Reply-To: References: Message-ID: <56053D7C-0932-46DD-81C5-8DFD66E845FC@signographie.de> Am 22.10.2014 um 09:27 schrieb Mark Davis ??: > > Bassa_Vah > Duployan > Grantha > Khojki > Khudawadi > Mahajani > Mende_Kikakui > Modi > Mro > Nabataean > Old_Permic > Palmyrene > Pau_Cin_Hau > Tirhuta > Warang_Citi You?re asking for quite a lot ? for nothing. best, Andreas St?tzner. (font producer) _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 http://stoetzner-gestaltung.prosite.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Wed Oct 22 04:50:30 2014 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 22 Oct 2014 10:50:30 +0100 Subject: fonts for U7.0 scripts In-Reply-To: References: Message-ID: On 22 October 2014 08:27, Mark Davis ?? wrote: > I'm looking for freely downloadable TTF fonts for any of the following. I'd > appreciate links to sites for any of these: > > Bassa_Vah > Duployan > Grantha > Khojki > Khudawadi > Mahajani > Mende_Kikakui > Modi > Mro > Nabataean > Old_Permic > Palmyrene > Pau_Cin_Hau > Tirhuta > Warang_Citi Was the encoding of any of these scripts funded by the Script Encoding Initiative? According to the SEI (http://www.linguistics.berkeley.edu/sei/help.html) "Funding is used primarily for the creation of proposals on a per-project basis and for fonts. Fonts will be made available over the Unicode website and will be available for free distribution but cannot be bundled with commercial products." Although I have to say that I cannot see anywhere on the Unicode website that provides fonts for SEI-funded scripts. Andrew From ishida at w3.org Wed Oct 22 06:57:21 2014 From: ishida at w3.org (Richard Ishida) Date: Wed, 22 Oct 2014 12:57:21 +0100 Subject: fonts for U7.0 scripts In-Reply-To: References: Message-ID: <54479BA1.3050206@w3.org> ScriptSource has links to fonts, and you may find some there. For instance, I immediately found three Bassa_Vah fonts, two of which appear to be free, one of which costs only $19. There's also a freeware font for Grantha. I didn't search further. (Fwiw, you can find the right ScriptSource pages quickly by going to http://rishida.net/scriptlinks and selecting the script. Look near the bottom of the list that appears for the direct link.) ri On 22/10/2014 08:27, Mark Davis ?? wrote: > I'm looking for freely downloadable TTF fonts for any of the following. > I'd appreciate links to sites for any of these: > > 1. Bassa_Vah > 2. Duployan > 3. Grantha > 4. Khojki > 5. Khudawadi > 6. Mahajani > 7. Mende_Kikakui > 8. Modi > 9. Mro > 10. Nabataean > 11. Old_Permic > 12. Palmyrene > 13. Pau_Cin_Hau > 14. Tirhuta > 15. Warang_Citi > > Coverage doesn't need to be complete, and > ?the font > doesn't need to support shaping (these are just for charts / > illustrations). > > Mark > > ////// > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > From samjnaa at gmail.com Wed Oct 22 07:08:23 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 22 Oct 2014 17:38:23 +0530 Subject: fonts for U7.0 scripts In-Reply-To: <54479BA1.3050206@w3.org> References: <54479BA1.3050206@w3.org> Message-ID: The Grantha link is broken. The site no longer exists. I have contacted the original author. Will post here once he replies. -- Shriramana Sharma ???????????? ???????????? From dwanders at sonic.net Wed Oct 22 08:48:00 2014 From: dwanders at sonic.net (Deborah W. Anderson) Date: Wed, 22 Oct 2014 06:48:00 -0700 Subject: fonts for U7.0 scripts In-Reply-To: References: Message-ID: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> Dear Andrew, Most of the scripts listed below did come via Script Encoding Initiative (SEI), you are correct. The intent of SEI was to work on proposals and provide fonts but, to date, the focus of the work has been almost exclusively on getting scripts into Unicode and not on the creation of distributable fonts. I will modify the wording on the webpage accordingly. Ideally, I would like to have free fonts made available via SEI, but it hasn't been possible due to funding constraints. In the future, I plan to work closely with ScriptSource (and other projects that make free fonts available), and will encourage the creation and submission of free fonts to such projects, though at this point SEI doesn't have the funding itself to pay for such work, unfortunately. Debbie Anderson -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew West Sent: Wednesday, October 22, 2014 2:51 AM To: Mark Davis ?? Cc: Unicode Public Subject: Re: fonts for U7.0 scripts On 22 October 2014 08:27, Mark Davis ?? wrote: > I'm looking for freely downloadable TTF fonts for any of the > following. I'd appreciate links to sites for any of these: > > Bassa_Vah > Duployan > Grantha > Khojki > Khudawadi > Mahajani > Mende_Kikakui > Modi > Mro > Nabataean > Old_Permic > Palmyrene > Pau_Cin_Hau > Tirhuta > Warang_Citi Was the encoding of any of these scripts funded by the Script Encoding Initiative? According to the SEI (http://www.linguistics.berkeley.edu/sei/help.html) "Funding is used primarily for the creation of proposals on a per-project basis and for fonts. Fonts will be made available over the Unicode website and will be available for free distribution but cannot be bundled with commercial products." Although I have to say that I cannot see anywhere on the Unicode website that provides fonts for SEI-funded scripts. Andrew _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From eliz at gnu.org Wed Oct 22 10:53:27 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 22 Oct 2014 18:53:27 +0300 Subject: Limits in UBA Message-ID: <83ppdk85jc.fsf@gnu.org> Hi, I have 2 questions related to the Unicode Bidirectional Algorithm, both regarding limits on certain aspects of the UBA. First, I'd like to ask about the 127 entries of the directional status stack; it had 63 entries in the version of the UBA before Unicode 6.3. Where and why are such deep embeddings/isolates needed? Does anyone know of practical examples of text that requires such a depth? I personally never saw a situation where one or 2 embeddings/overrides were not enough. This is a far cry from the UAX#9 numbers. Implementing such a deep stack requires memory-management solutions that are non-trivial, and add complexity to an already complex algorithm, but if I implement only a small fraction of that, I cannot claim bidirectional conformity. So I wonder if there's a practical justification for such a deep UBA stack. The second question is about the stack required for implementing the BPA resolution of brackets, as described in BD16 and N0. The UBA doesn't place any limits on the depth of that stack. This means that text with a large enough number of opening bracket characters and no closing brackets could exhaust the entire memory space of an application. What is the implementation supposed to do in this situation? Crashing or exiting with a fatal error code is clearly inappropriate in some applications. Is it even reasonable not to have any limits for this stack? Thanks in advance for any insights. From eliz at gnu.org Wed Oct 22 13:20:28 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Wed, 22 Oct 2014 21:20:28 +0300 Subject: Limits in UBA In-Reply-To: References: <83ppdk85jc.fsf@gnu.org> Message-ID: <83d29k7yqb.fsf@gnu.org> > From: "Andrew Glass (WINDOWS)" > Date: Wed, 22 Oct 2014 17:57:52 +0000 Thanks for responding. > Embeddings are common in generated text. The guiding principle, is seemingly, when in doubt wrap the string in an embedding. At the UTC, we heard, that this can lead to very deep stacks - but I've never actually seen one with more than 63 levels - but that is not my topic here. I'd appreciate some pointers to such texts, if they are publicly accessible. I'd be very interested to see why such deep embeddings are necessary. In Emacs, we do use embeddings and overrides in a few places in text we generate, for example, to make sure information about a character displayed by a specialized command doesn't get jumbled due to that character's bidi class. But we never needed more than one, maximum 2 levels. Most of the cases can be resolved by using LRM or RLM. > The BPA is not as subject to the extremes of generated text, and therefore brackets should follow a natural limit such that it is possible for a human to parse and track the bracketed levels. As such, the max depth is going to be quite low in normal text. Most cases of the BPA involve one pair. Nested pairs beyond three become quite artificial - except in languages such as LISP. However, supporting correct display of Bidi LISP code is not a goal of the BPA. I'm not sure what the maximum depth used by the test data is - I think that is the best current guide unless we introduce something. The test data doesn't have more than 3 nested levels, I think. For Emacs, I limited the BPA stack at 1024 levels, which is probably way too much, but it was cheap, so I saw no reason forcing an arbitrary lower limit. As for Lisp and similar languages, since the BPA in otherwise all-L2R text is equivalent to "normal" resolution of neutrals per N1 and N2, I simply bypass the BPA in that case -- because N1/N2 processing is much cheaper in the Emacs case. So Lisp is not the case that worries me. But I do wonder why there's absolutely no guidance in the UBA regarding this issue, which in practice every implementor will probably bump into. Thanks. From ken.whistler at sap.com Wed Oct 22 14:18:38 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 22 Oct 2014 19:18:38 +0000 Subject: Limits in UBA In-Reply-To: <83d29k7yqb.fsf@gnu.org> References: <83ppdk85jc.fsf@gnu.org> <83d29k7yqb.fsf@gnu.org> Message-ID: Eli, > > Embeddings are common in generated text. The guiding principle, is > seemingly, when in doubt wrap the string in an embedding. At the UTC, we > heard, that this can lead to very deep stacks - but I've never actually seen > one with more than 63 levels - but that is not my topic here. > > I'd appreciate some pointers to such texts, if they are publicly > accessible. I'd be very interested to see why such deep embeddings > are necessary. They aren't necessary for human-generated text. There is no normal human text reading case for them. But as Andrew indicated, the problem arises from the potential for automated injection of text wrapped in an embedding. There is no expectation that any of that would actually be readable text in most cases. But on the other hand, the generated text could end up in logs or other text stores which, in turn, could end up processed by some text rendering for display in a window somewhere. You don't then want an arbitrarily low limit for handling embeddings in the UBA to suddenly crap out the display: that just leads to bug reports and a lot of confused thrashing up and down the customer support chain. An example I could think of off the top of my head might involve some complicated database application working with Arabic data. If the mechanism generating some automated queries was automatically encapsulating string literals in the "where db103.tbl246.col27='blah'" qualifiers *and* the query was encapsulating each full "select xxx" statement *and* the query was using nested subqueries, then if the generation of the query ended up nesting 32 subqueries (which can occur, although it might not be good practice), then you would already have bumped over the prior 63 level embedding limit for UBA. With *big* database applications, where installations may have thousands of tables, with thousands of partitions, and multiple terabytes of data, automated generation of very large and complicated SQL queries is common. And while the database itself doesn't care about UBA or display order when parsing and compiling such queries, the SQL text can be and *is* routinely logged. And the worry by the UTC is that when such logged generated text might include encapsulated embedded chunks, you don't want UBA per se to be introducing limits that cause failures when there might be a use case to display such text for diagnostics, for example. I don't happen to *know* of a particular example of such text to point you to, but that kind of thing is the relevant use scenario. --Ken From andrewcwest at gmail.com Wed Oct 22 14:29:00 2014 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 22 Oct 2014 20:29:00 +0100 Subject: fonts for U7.0 scripts In-Reply-To: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> Message-ID: Debbie, Thanks for the explanation. I just wonder, in order to get a script accepted for encoding the proposer has to provide a font for the Unicode/10646 code charts, so creating a font (that is at least good enough for the code charts even if it does not have full shaping behaviour) is an essential part of the proposal process, so if the SEI is funding someone to research/write a proposal is not the funding provided by SEI at least indirectly funding the creation of a font, and if so should not the font be made freely available at the end of the project? Andrew On 22 October 2014 14:48, Deborah W. Anderson wrote: > Dear Andrew, > Most of the scripts listed below did come via Script Encoding Initiative (SEI), you are correct. > > The intent of SEI was to work on proposals and provide fonts but, to date, the focus of the work has been almost exclusively on getting scripts into Unicode and not on the creation of distributable fonts. I will modify the wording on the webpage accordingly. > > Ideally, I would like to have free fonts made available via SEI, but it hasn't been possible due to funding constraints. In the future, I plan to work closely with ScriptSource (and other projects that make free fonts available), and will encourage the creation and submission of free fonts to such projects, though at this point SEI doesn't have the funding itself to pay for such work, unfortunately. > > Debbie Anderson > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew West > Sent: Wednesday, October 22, 2014 2:51 AM > To: Mark Davis ?? > Cc: Unicode Public > Subject: Re: fonts for U7.0 scripts > > On 22 October 2014 08:27, Mark Davis ?? wrote: >> I'm looking for freely downloadable TTF fonts for any of the >> following. I'd appreciate links to sites for any of these: >> >> Bassa_Vah >> Duployan >> Grantha >> Khojki >> Khudawadi >> Mahajani >> Mende_Kikakui >> Modi >> Mro >> Nabataean >> Old_Permic >> Palmyrene >> Pau_Cin_Hau >> Tirhuta >> Warang_Citi > > Was the encoding of any of these scripts funded by the Script Encoding Initiative? According to the SEI > (http://www.linguistics.berkeley.edu/sei/help.html) "Funding is used primarily for the creation of proposals on a per-project basis and for fonts. Fonts will be made available over the Unicode website and will be available for free distribution but cannot be bundled with commercial products." > > Although I have to say that I cannot see anywhere on the Unicode website that provides fonts for SEI-funded scripts. > > Andrew > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > --- > This email is free from viruses and malware because avast! Antivirus protection is active. > http://www.avast.com > From ken.whistler at sap.com Wed Oct 22 14:42:06 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Wed, 22 Oct 2014 19:42:06 +0000 Subject: Limits in UBA In-Reply-To: <83d29k7yqb.fsf@gnu.org> References: <83ppdk85jc.fsf@gnu.org> <83d29k7yqb.fsf@gnu.org> Message-ID: Eli, I think you are correct that the BidiCharacterTest.txt data currently does not go beyond 3 nesting levels for testing the BPA part of UBA. I agree with Andrew that that is reasonable guide to the normal limit of meaningful bracket embeddings one might find in text. However, I don't think it is safe to assume that 3 is the deepest that the conformance test data would ever have in it. Unlike the bidi format control embeddings, which are hard to visualize and involve special input or programming, it is *easy* for people to generate strings with deeply embedded bracket pairs: ((((((((((((((((((((((((((((((((((((((99))))))))))))))))))))))))))))))))))))) So it might make sense to add test cases with data like that to BidiCharacterTest. In such cases, fallback behavior when hitting the implementation limit are presumably o.k., but is advisable to check implementations to ensure that they don't actually fall over if they *do* hit their limit. In the C BidiRef reference implementation I wrote, the limit I picked was simply half the maximum string length it would process, on the assumption that the worst case it would have to deal with would be a string consisting of *nothing but* bracket pairs. If supporting 1024 bracket pair levels in "cheap" for Emacs support, that seems like a defensible limit choice to me. --Ken > > The BPA is not as subject to the extremes of generated text, and therefore > brackets should follow a natural limit such that it is possible for a human to > parse and track the bracketed levels. As such, the max depth is going to be > quite low in normal text. Most cases of the BPA involve one pair. Nested > pairs beyond three become quite artificial - except in languages such as LISP. > However, supporting correct display of Bidi LISP code is not a goal of the > BPA. I'm not sure what the maximum depth used by the test data is - I think > that is the best current guide unless we introduce something. > > The test data doesn't have more than 3 nested levels, I think. > > For Emacs, I limited the BPA stack at 1024 levels, which is probably > way too much, but it was cheap, so I saw no reason forcing an > arbitrary lower limit. From dwanders at sonic.net Wed Oct 22 16:13:58 2014 From: dwanders at sonic.net (Deborah W. Anderson) Date: Wed, 22 Oct 2014 14:13:58 -0700 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> Message-ID: <009d01cfee3d$18a0a270$49e1e750$@sonic.net> Dear Andrew, It is true that proposals require a font to create the code charts, but I was careful in my comments to say SEI doesn't currently fund creation of "distributable" fonts. Fonts for proposals are usually very basic, and often partly auto-generated by font editing software, usually with an ASCII cmap. They appear marginally OK in the code chart, and while they are acceptable for talking about charts or to use as examples in papers about the script, they are typically not acceptable for most purposes that contain running text, including publication in printed form, e.g., in books. Just because someone develops a basic set of outlines for a script proposal doesn't necessarily mean (a) that they have done any work to make their font "useful" for anything else and (b) that they have, or will license their font for public use. (They don't sign up for that automatically when doing a proposal, and it has not really budgeted into any proposals, so far.) At the moment, SEI is severely budget-constrained, and proposal authors are not earning much doing proposal work. The more work put in for purposes beyond the proposal itself, the lower their hourly income. And as John Hudson or Ken Lunde can probably attest, good font development is labor intensive. In sum, it would take additional resources for a developer to do work on a font to make it acceptable for distribution. However, like Andrew Glass, I commend the work on Noto fonts, which is a way to help make free working fonts available. With best wishes, Debbie -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew West Sent: Wednesday, October 22, 2014 12:29 PM To: Deborah W. Anderson Cc: Mark Davis ??; Unicode Public Subject: Re: fonts for U7.0 scripts Debbie, Thanks for the explanation. I just wonder, in order to get a script accepted for encoding the proposer has to provide a font for the Unicode/10646 code charts, so creating a font (that is at least good enough for the code charts even if it does not have full shaping behaviour) is an essential part of the proposal process, so if the SEI is funding someone to research/write a proposal is not the funding provided by SEI at least indirectly funding the creation of a font, and if so should not the font be made freely available at the end of the project? Andrew On 22 October 2014 14:48, Deborah W. Anderson wrote: > Dear Andrew, > Most of the scripts listed below did come via Script Encoding Initiative (SEI), you are correct. > > The intent of SEI was to work on proposals and provide fonts but, to date, the focus of the work has been almost exclusively on getting scripts into Unicode and not on the creation of distributable fonts. I will modify the wording on the webpage accordingly. > > Ideally, I would like to have free fonts made available via SEI, but it hasn't been possible due to funding constraints. In the future, I plan to work closely with ScriptSource (and other projects that make free fonts available), and will encourage the creation and submission of free fonts to such projects, though at this point SEI doesn't have the funding itself to pay for such work, unfortunately. > > Debbie Anderson > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew > West > Sent: Wednesday, October 22, 2014 2:51 AM > To: Mark Davis ?? > Cc: Unicode Public > Subject: Re: fonts for U7.0 scripts > > On 22 October 2014 08:27, Mark Davis ?? wrote: >> I'm looking for freely downloadable TTF fonts for any of the >> following. I'd appreciate links to sites for any of these: >> >> Bassa_Vah >> Duployan >> Grantha >> Khojki >> Khudawadi >> Mahajani >> Mende_Kikakui >> Modi >> Mro >> Nabataean >> Old_Permic >> Palmyrene >> Pau_Cin_Hau >> Tirhuta >> Warang_Citi > > Was the encoding of any of these scripts funded by the Script Encoding > Initiative? According to the SEI > (http://www.linguistics.berkeley.edu/sei/help.html) "Funding is used primarily for the creation of proposals on a per-project basis and for fonts. Fonts will be made available over the Unicode website and will be available for free distribution but cannot be bundled with commercial products." > > Although I have to say that I cannot see anywhere on the Unicode website that provides fonts for SEI-funded scripts. > > Andrew > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > > --- > This email is free from viruses and malware because avast! Antivirus protection is active. > http://www.avast.com > _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode --- This email is free from viruses and malware because avast! Antivirus protection is active. http://www.avast.com From asmusf at ix.netcom.com Wed Oct 22 17:58:07 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Wed, 22 Oct 2014 15:58:07 -0700 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> Message-ID: <5448367F.90801@ix.netcom.com> On 10/22/2014 12:29 PM, Andrew West wrote: > should not the font be made freely available at the end of > the project? The policy requires that a license is given to produce the charts and related documents. No more, no less. This allows people and organizations to donate a free license for use by the editors, but otherwise seek for commercial distribution of their work. In other words, they retain all the rights to their intellectual property that are not strictly required for the encoding process. Nothing prevents people to put their fonts in the public domain, if they so desire, but that can't be a requirement of the character encoding process. Debbie might approach people who have provided chart fonts with a query as to whether they might like to issue a broader license, or to list their fonts with sites that distribute free fonts - in some cases they might be motivated as this might increase the chance that their script is used and implemented. But we need to be very clear that this would be highly voluntary. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From eliz at gnu.org Wed Oct 22 21:41:06 2014 From: eliz at gnu.org (Eli Zaretskii) Date: Thu, 23 Oct 2014 05:41:06 +0300 Subject: Limits in UBA In-Reply-To: References: <83ppdk85jc.fsf@gnu.org> <83d29k7yqb.fsf@gnu.org> Message-ID: <834muv8q4d.fsf@gnu.org> > From: "Whistler, Ken" > CC: "unicode at unicode.org" , "Whistler, Ken" > > Date: Wed, 22 Oct 2014 19:18:38 +0000 > Accept-Language: en-US > > > I'd appreciate some pointers to such texts, if they are publicly > > accessible. I'd be very interested to see why such deep embeddings > > are necessary. > > They aren't necessary for human-generated text. There is no normal human text > reading case for them. But if humans aren't going to read that text, the embeddings aren't necessary at all, because programs read and process text in logical order anyway. Bidi reordering is a display-time feature, meant for human consumption. > An example I could think of off the top of my head might involve some > complicated database application working with Arabic data. Again, if the query is to be submitted to a program, there should not be a need for embeddings at all. > And while the database itself doesn't care about UBA or display > order when parsing and compiling such queries, the SQL text can be > and *is* routinely logged. And the worry by the UTC is that when > such logged generated text might include encapsulated embedded > chunks, you don't want UBA per se to be introducing limits that > cause failures when there might be a use case to display such text > for diagnostics, for example. I don't happen to *know* of a > particular example of such text to point you to, but that kind of > thing is the relevant use scenario. Still, the number 63 or 127 sounds arbitrary, and unnecessarily large to me. From andrewcwest at gmail.com Thu Oct 23 03:25:11 2014 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 23 Oct 2014 09:25:11 +0100 Subject: fonts for U7.0 scripts In-Reply-To: <5448367F.90801@ix.netcom.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> <5448367F.90801@ix.netcom.com> Message-ID: On 22 October 2014 23:58, Asmus Freytag wrote: > > Nothing prevents people to put their fonts in the public domain, if they so > desire, but that can't be a requirement of the character encoding process. I never said or implied that making the font freely available should be a requirement of the character encoding process (although I personally think it ought to be). I said that if the production of the font was funded by the SEI then it should be made freely available, and I think that is what donors to the SEI would expect, certainly based on the text I quoted earlier which had been on the SEI web site for many years before Debbie removed it yesterday. Andrew From andrewcwest at gmail.com Thu Oct 23 03:46:50 2014 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 23 Oct 2014 09:46:50 +0100 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net> Message-ID: On 22 October 2014 21:47, Andrew Glass (WINDOWS) wrote: > > I think that distributing fonts that are known to be deficient in shaping does not address needs > other than reproducing code charts and supressing tofu. Moreover, such fonts create can > mislead lead users into thinking that a script is supported when we know that more work remains > to be done. When work appears to be complete to someone that can't read a script, then the > motivation to address the remaining issues to support that script are undermined. There can also > be other negative consequences. I think that making a set of character only fonts available would > be against the interests of the SEI and Unicode. Well, not all scripts have complex rendering behaviour, so for some scripts the code chart font mapped to the correct Unicode code points is all that is needed. Even for fonts with deficient rendering behaviour or which are mapped to ASCII or PUA code points, if the font was released under the SIL Open Font license or an equivalent free license then people could use it for the basis for a fully functional Unicode font. > In this respect, I think the effort of the Noto project to including shaping support for complex > scripts is commendable. I hope that the current gaps in Noto will soon be filled by suitable fonts > so that the need to release 'chart-only' fonts is removed. I'm a great fan of the Noto project, but as Mark's original question indicates Noto does not supply a solution for newly encoded scripts, and I very much dislike the idea of Google having a monopoly on supplying free fonts for minor and historic scripts. A code chart font, released under a free license such as the SIL OFL (with any necessary limitations clearly stated) is far and away better than leaving people puzzling over little square boxes for years. Andrew From cannona at fireantproductions.com Thu Oct 23 11:54:16 2014 From: cannona at fireantproductions.com (Aaron Cannon) Date: Thu, 23 Oct 2014 11:54:16 -0500 Subject: Question about a Normalization test Message-ID: Hi all, from the latest version of the standard, on line 16977 of the normalization tests, I am a bit confused by the NFC form. It appears incorrect to me. Here's the line, sans comment: 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062; Just looking at column 2, which according to the comments at the top is the NFC form: 0061 05AE 0305 0300 0315 0062: This, however, does not appear to be in NFC form. The first character, and the second or third characters do not compose. However, the first and fourth (0061 and 0300) do, composing to 00E0. Since there are no further compositions, the normalized form should be 00E0 05AE 0305 0315 0062 What am I missing? Thanks in advance for your help! Aaron From petercon at microsoft.com Thu Oct 23 13:03:59 2014 From: petercon at microsoft.com (Peter Constable) Date: Thu, 23 Oct 2014 18:03:59 +0000 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

Message-ID: <4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> I think Debbie's position is entirely reasonable. Sure, having useful fonts in the public domain soon after standardization would be great. But publishing fonts created for the purpose of chart production may lead to all kinds of problems if they are not truly functional, Unicode-conformant fonts - which is not necessarily a product of SEI-funded proposal work. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew West Sent: Thursday, October 23, 2014 1:47 AM To: Andrew Glass (WINDOWS) Cc: Unicode Public Subject: Re: fonts for U7.0 scripts On 22 October 2014 21:47, Andrew Glass (WINDOWS) wrote: > > I think that distributing fonts that are known to be deficient in > shaping does not address needs other than reproducing code charts and > supressing tofu. Moreover, such fonts create can mislead lead users > into thinking that a script is supported when we know that more work > remains to be done. When work appears to be complete to someone that > can't read a script, then the motivation to address the remaining > issues to support that script are undermined. There can also be other negative consequences. I think that making a set of character only fonts available would be against the interests of the SEI and Unicode. Well, not all scripts have complex rendering behaviour, so for some scripts the code chart font mapped to the correct Unicode code points is all that is needed. Even for fonts with deficient rendering behaviour or which are mapped to ASCII or PUA code points, if the font was released under the SIL Open Font license or an equivalent free license then people could use it for the basis for a fully functional Unicode font. > In this respect, I think the effort of the Noto project to including > shaping support for complex scripts is commendable. I hope that the > current gaps in Noto will soon be filled by suitable fonts so that the need to release 'chart-only' fonts is removed. I'm a great fan of the Noto project, but as Mark's original question indicates Noto does not supply a solution for newly encoded scripts, and I very much dislike the idea of Google having a monopoly on supplying free fonts for minor and historic scripts. A code chart font, released under a free license such as the SIL OFL (with any necessary limitations clearly stated) is far and away better than leaving people puzzling over little square boxes for years. Andrew _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From mark at macchiato.com Thu Oct 23 13:06:59 2014 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 23 Oct 2014 20:06:59 +0200 Subject: Question about a Normalization test In-Reply-To: References: Message-ID: On Thu, Oct 23, 2014 at 6:54 PM, Aaron Cannon < cannona at fireantproductions.com> wrote: > 0061 05AE 0305 0300 0315 0062 http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5Cu0061+%5Cu05AE+%5Cu0305+%5Cu0300+%5Cu0315+%5Cu0062&g=ccc ?0305 and 0300 have the same ccc, so the first one blocks the second. http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G49576 The older spec is shorter, although not as precise: http://www.unicode.org/reports/tr15/tr15-29.html#Specification Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From tom at bluesky.org Thu Oct 23 13:13:45 2014 From: tom at bluesky.org (Tom Gewecke) Date: Thu, 23 Oct 2014 11:13:45 -0700 Subject: fonts for U7.0 scripts In-Reply-To: <4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: On Oct 23, 2014, at 11:03 AM, Peter Constable wrote: > I think Debbie's position is entirely reasonable. Sure, having useful fonts in the public domain soon after standardization would be great. But publishing fonts created for the purpose of chart production may lead to all kinds of problems if they are not truly functional, Unicode-conformant fonts - which is not necessarily a product of SEI-funded proposal work. How about even having just the glyphs in the Unicode.org charts being in the public domain? Am I correct that this is currently not the case? From ken.whistler at sap.com Thu Oct 23 13:15:00 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Thu, 23 Oct 2014 18:15:00 +0000 Subject: Question about a Normalization test In-Reply-To: References: Message-ID: Aaron Cannon asked: > Hi all, from the latest version of the standard, on line 16977 of the > normalization tests, I am a bit confused by the NFC form. It appears > incorrect to me. Here's the line, sans comment: > > 0061 0305 0315 0300 05AE 0062;0061 05AE 0305 0300 0315 0062;0061 05AE > 0305 0300 0315 0062;0061 05AE 0305 0300 0315 0062;0061 05AE 0305 0300 > 0315 0062; > > Just looking at column 2, which according to the comments at the top > is the NFC form: > > 0061 05AE 0305 0300 0315 0062: > > This, however, does not appear to be in NFC form. > > The first character, and the second or third characters do not > compose. However, the first and fourth (0061 and 0300) do, composing > to 00E0. > > Since there are no further compositions, the normalized form should be > 00E0 05AE 0305 0315 0062 > > What am I missing? > Input is: Code points: 0061 0305 0315 0300 05AE 0062 Ccc: 0 230 232 230 228 0 Output of canonical reordering is: Code points: 0061 05AE 0305 0300 0315 0062 Ccc: 0 228 230 230 232 0 Next step is to start from 0061 and test each successive combining mark, looking for composition candidates. 0061 does not compose with 05AE. 0061 does not compose with 0305. 0061 *could* compose with 0300 (00E0 = 0061 + 0300), *but* 0300 is *blocked* from 0061 by the intervening combining mark 0305 with the *same* ccc value as 0300. So the composition does not occur. 0061 does not compose with 0315. The next character is 0062, ccc=0, a starter, so we are done. For the relevant definitions, see: http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G50628 and scroll down a couple pages to D115 on p. 139. Test cases like this are included in NormalizationTest.txt precisely to ensure that implementations are correctly detecting these sequences where composition is blocked. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From cannona at fireantproductions.com Thu Oct 23 13:22:28 2014 From: cannona at fireantproductions.com (Aaron Cannon) Date: Thu, 23 Oct 2014 13:22:28 -0500 Subject: Question about a Normalization test In-Reply-To: References: Message-ID: On 10/23/14, Whistler, Ken wrote: > Test cases like this are included in NormalizationTest.txt precisely > to ensure that implementations are correctly detecting these > sequences where composition is blocked. And I am indeed glad that they are, as I completely missed this small but critical detail. Thanks so much all! Aaron From emuller at adobe.com Thu Oct 23 13:31:28 2014 From: emuller at adobe.com (Eric Muller) Date: Thu, 23 Oct 2014 20:31:28 +0200 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: <54494980.8090607@adobe.com> > How about even having just the glyphs in the Unicode.org charts being in the public domain? Very easy to achieve: 1. Ask the owner of the font how much money he wants to part with his property. 2. Write a check for the corresponding amount. 3. You are now the owner, you can put the font in the public domain. Eric. From everson at evertype.com Thu Oct 23 13:33:44 2014 From: everson at evertype.com (Michael Everson) Date: Thu, 23 Oct 2014 19:33:44 +0100 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: <1BEC4E32-1214-4605-B6B2-BC3E3F8071A5@evertype.com> On 23 Oct 2014, at 19:13, Tom Gewecke wrote: > How about even having just the glyphs in the Unicode.org charts being in the public domain? Am I correct that this is currently not the case? If only one were independently wealthy. Michael Everson * http://www.evertype.com/ From as at signographie.de Thu Oct 23 14:08:17 2014 From: as at signographie.de (=?iso-8859-1?Q?Andreas_St=F6tzner?=) Date: Thu, 23 Oct 2014 21:08:17 +0200 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

Message-ID: <0F9E0290-6D04-4F40-A4D5-645BF0108196@signographie.de> Am 23.10.2014 um 10:46 schrieb Andrew West: > A code chart > font, released under a free license such as the SIL OFL (with any > necessary limitations clearly stated) is far and away better than > leaving people puzzling over little square boxes for years. what are you mourning about? If you need a certain font just go down and hire a font designer. Then you?ll get anything you want, as usual. When you?re not in the position to commission such work yourself, convince third parties to fund the work which is needed to be done. I am a font producer and I have provided fonts/sets of glyphs for code chart purpose twice (traffic signs, old Albanian). In doing so I have participated in funding the encoding process. Other parties have funded SEI font work for encoding. No more, no less, as Asmus Freytag puts it precisely. That is one thing. But supplying fonts for editorial use to someone else is another thing. That cannot be the task of standardization bodies. Am 23.10.2014 um 00:58 schrieb Asmus Freytag: > Nothing prevents people to put their fonts in the public domain, if they so desire, but that can't be a requirement of the character encoding process. Absolutely right. best regards, Andreas St?tzner. _______________________________________________________________________________ Andreas St?tzner Gestaltung Signographie Fontentwicklung Haus des Buches Gerichtsweg 28, Raum 434 04103 Leipzig 0176-86823396 http://stoetzner-gestaltung.prosite.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From samjnaa at gmail.com Thu Oct 23 18:47:32 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Fri, 24 Oct 2014 05:17:32 +0530 Subject: fonts for U7.0 scripts In-Reply-To: <4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: On Thursday, October 23, 2014, Peter Constable wrote: > . But publishing fonts created for the purpose of chart production may > lead to all kinds of problems if they are not truly functional, > Unicode-conformant fonts - > Dear Peter, Can you clarify what "all kinds of problems" you foresee? -- Shriramana Sharma ???????????? ???????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Thu Oct 23 20:02:20 2014 From: petercon at microsoft.com (Peter Constable) Date: Fri, 24 Oct 2014 01:02:20 +0000 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: Sure: People find a font that isn?t a truly functional, Unicode-conformant font for script X and? - They try using it, find it doesn?t display text as expected, and conclude that Unicode doesn?t work for their script - The font has glyphs mapped from ASCII characters; they try typing and it seems to display their text as desired, so they start generating content. Now we have data interop problems. - The font kinda works, but not perfectly. They decide that they can fix it by just changing some of the glyphs to certain presentation forms and by adding certain other glyphs on some unused code positions. Then they start generating content. Now we have data interop problems. The last scenario is really similar to the serious problems we have now for Myanmar. Iow, this isn?t just hypothetical. Peter From: Shriramana Sharma [mailto:samjnaa at gmail.com] Sent: Thursday, October 23, 2014 4:48 PM To: Peter Constable Cc: Andrew West; Andrew Glass (WINDOWS); Unicode Public Subject: Re: fonts for U7.0 scripts On Thursday, October 23, 2014, Peter Constable > wrote: . But publishing fonts created for the purpose of chart production may lead to all kinds of problems if they are not truly functional, Unicode-conformant fonts - Dear Peter, Can you clarify what "all kinds of problems" you foresee? -- Shriramana Sharma ???????????? ???????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Oct 23 20:21:03 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Thu, 23 Oct 2014 18:21:03 -0700 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: <5449A97F.2070109@ix.netcom.com> Peter is correct. The only fonts that should be released to the public are those that are Unicode encoded and have the correct shaping tables. Unlike the public, the code chart editors for Unicode have tools that can correctly handle not only ASCII-hacked fonts as well as PUA-assigned fonts, but also fonts that use the "wrong" Unicode encoding (because they were designed for an earlier draft with different code point assignments). These tools ignore all shaping tables, so the lack of such tables isn't an issue. The documents created by the code charts editors are no editable in the normal sense, so they can be published without causing problems, like establishing a de-facto encoding. They don't contain running text in these fonts, so there isn't an issue with search - the searchable contents are all character names, annotations etc in Latin letters and digits. Releasing such fonts to the public would establish a de-facto non-sanctioned encoding, because people could create (and interchange) running text using them. A./ On 10/23/2014 6:02 PM, Peter Constable wrote: > > Sure: People find a font that isn?t a truly functional, > Unicode-conformant font for script X and? > > -They try using it, find it doesn?t display text as expected, and > conclude that Unicode doesn?t work for their script > > -The font has glyphs mapped from ASCII characters; they try typing and > it seems to display their text as desired, so they start generating > content. Now we have data interop problems. > > -The font kinda works, but not perfectly. They decide that they can > fix it by just changing some of the glyphs to certain presentation > forms and by adding certain other glyphs on some unused code > positions. Then they start generating content. Now we have data > interop problems. > > The last scenario is really similar to the serious problems we have > now for Myanmar. Iow, this isn?t just hypothetical. > > Peter > > *From:*Shriramana Sharma [mailto:samjnaa at gmail.com] > *Sent:* Thursday, October 23, 2014 4:48 PM > *To:* Peter Constable > *Cc:* Andrew West; Andrew Glass (WINDOWS); Unicode Public > *Subject:* Re: fonts for U7.0 scripts > > On Thursday, October 23, 2014, Peter Constable > wrote: > > . But publishing fonts created for the purpose of chart production > may lead to all kinds of problems if they are not truly > functional, Unicode-conformant fonts - > > Dear Peter, > > Can you clarify what "all kinds of problems" you foresee? > > > > -- > Shriramana Sharma ???????????? ???????????? > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Oct 24 03:17:10 2014 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 24 Oct 2014 17:17:10 +0900 Subject: Code charts and code points (was: Re: fonts for U7.0 scripts) In-Reply-To: <5449A97F.2070109@ix.netcom.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> Message-ID: <544A0B06.9000506@it.aoyama.ac.jp> On 2014/10/24 10:21, Asmus Freytag wrote: > Peter is correct. > > The only fonts that should be released to the public are those that are > Unicode encoded and have the correct shaping tables. > > Unlike the public, the code chart editors for Unicode have tools that > can correctly handle not only ASCII-hacked fonts as well as PUA-assigned > fonts, but also fonts that use the "wrong" Unicode encoding (because > they were designed for an earlier draft with different code point > assignments). These tools ignore all shaping tables, so the lack of such > tables isn't an issue. > > The documents created by the code charts editors are no editable in the > normal sense, so they can be published without causing problems, like > establishing a de-facto encoding. They don't contain running text in > these fonts, so there isn't an issue with search - the searchable > contents are all character names, annotations etc in Latin letters and > digits. > > Releasing such fonts to the public would establish a de-facto > non-sanctioned encoding, because people could create (and interchange) > running text using them. Hello Asmus, The code charts are published as PDFs. In general, text in PDFs can be copypasted elsewhere. Is there something in place that makes sure that "wrong" Unicode encodings for glyphs published in code charts don't leak elsewhere? Regards, Martin. From jkorpela at cs.tut.fi Fri Oct 24 06:51:10 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 24 Oct 2014 14:51:10 +0300 Subject: Code charts and code points In-Reply-To: <544A0B06.9000506@it.aoyama.ac.jp> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> Message-ID: <544A3D2E.4020200@cs.tut.fi> 2014-10-24 11:17, "Martin J. D?rst" wrote: > The code charts are published as PDFs. In general, text in PDFs can be > copypasted elsewhere. Is there something in place that makes sure that > "wrong" Unicode encodings for glyphs published in code charts don't leak > elsewhere? It seems that there isn?t. Whether this is serious is a different issue. I tested with the arbitrarily chosen Ornamental Dingbats block, with the chart http://www.unicode.org/charts/PDF/Unicode-7.0/U70-1F780.pdf Opening it in Adobe Reader XI on Win 7, I was able to select the characters with the mouse and copy and paste them to a text editor, BabelPad. It shows most of them as just boxes, identified with the correct Unicode numbers; this is the expected behavior when the editor has no suitable font in its disposal. But instead of U+1F67C VERY HEAVY SOLIDUS and U+1F67D VERY HEAVY REVERSE SOLIDUS, it shows ?/? and ?/?, identified as U+002F SOLIDUS and U+005C REVERSE SOLIDUS. So apparently the font designer had placed the glyphs as assigned to SOLIDUS and REVERSE SOLIDUS, which is understandable. But this means that when the characters in the code charts are copied and pasted, or otherwise accessed at the character level, they are wrong characters. I think it is imaginable that someone wants to copy a block of characters from the code charts, as a handy way of getting them for inspection, e.g. for testing how some particular software renders them using some particular font(s). I would expect some confusion then if you had partly got all wrong characters (code points). Yucca From samjnaa at gmail.com Fri Oct 24 07:05:12 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Fri, 24 Oct 2014 17:35:12 +0530 Subject: Code charts and code points (was: Re: fonts for U7.0 scripts) In-Reply-To: <544A0B06.9000506@it.aoyama.ac.jp> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> Message-ID: Hi Martin. If you haven't noticed it before, opening Unicode charts in PDF readers has something like "SECURED" on the top i.o.w. the charts are sorta DRM-protected. So you can't copy-paste the characters. Heck you can't even copy-paste the character *names*! -- Shriramana Sharma ???????????? ???????????? From andrewcwest at gmail.com Fri Oct 24 07:10:00 2014 From: andrewcwest at gmail.com (Andrew West) Date: Fri, 24 Oct 2014 13:10:00 +0100 Subject: Code charts and code points (was: Re: fonts for U7.0 scripts) In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> Message-ID: On 24 October 2014 13:05, Shriramana Sharma wrote: > Hi Martin. If you haven't noticed it before, opening Unicode charts in > PDF readers has something like "SECURED" on the top i.o.w. the charts > are sorta DRM-protected. So you can't copy-paste the characters. Heck > you can't even copy-paste the character *names*! You can copy just fine with the Foxit PDF reader. Like Jukka, I tried randomly copying a number of PDF code charts (from http://www.unicode.org/charts/) and I couldn't find any which were not using the correct Unicode code points (maybe there are some, but I gave up before I found them). Andrew From samjnaa at gmail.com Fri Oct 24 07:11:09 2014 From: samjnaa at gmail.com (Shriramana Sharma) Date: Fri, 24 Oct 2014 17:41:09 +0530 Subject: Code charts and code points (was: Re: fonts for U7.0 scripts) In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> Message-ID: Looks like I spoke too soon -- I seem to distinctly remember this behaviour from earlier versions (or am I misremembering?!!!) but Unicode 7.0 charts don't seem to be this way... And SMP glyphs seem to be mapped to PUA chars. Not really ideal... BTW for older versions, individual blockwise charts don't seem to be available. I would presume the total size of individual blockwise charts isn't too much higher than the single http://www.unicode.org/Public/6.2.0/charts/ and Unicode versions aren't that too many anyway -- can we have individual block charts in archival rather than having to download 92 MB please? (In Asia the broadband speeds aren't all that fast...) -- Shriramana Sharma ???????????? ???????????? From jkorpela at cs.tut.fi Fri Oct 24 07:15:49 2014 From: jkorpela at cs.tut.fi (Jukka K. Korpela) Date: Fri, 24 Oct 2014 15:15:49 +0300 Subject: Code charts and code points In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> Message-ID: <544A42F5.7080709@cs.tut.fi> 2014-10-24 15:05, Shriramana Sharma wrote: > Hi Martin. If you haven't noticed it before, opening Unicode charts in > PDF readers has something like "SECURED" on the top i.o.w. the charts > are sorta DRM-protected. So you can't copy-paste the characters. Heck > you can't even copy-paste the character *names*! ?SECURED? means that there are *some* protections. As I wrote in my earlier message, I had no difficulties in copying characters from a chart. It is flagged as ?SECURED?, but looking at its properties in Adobe Reader (Ctrl+D), I see copying as allowed but e.g. commenting as disallowed. The following was copied directly from a chart into this message, so copying the characters and the names is surely possible (though not necessarily in every program): 1F650 ?? NORTH WEST POINTING LEAF Yucca From Tom at bluesky.org Fri Oct 24 08:28:44 2014 From: Tom at bluesky.org (Tom Gewecke) Date: Fri, 24 Oct 2014 06:28:44 -0700 Subject: fonts for U7.0 scripts In-Reply-To: <54494980.8090607@adobe.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <54494980.8090607@adobe.com> Message-ID: On Oct 23, 2014, at 11:31 AM, Eric Muller wrote: > >> How about even having just the glyphs in the Unicode.org charts being in the public domain? > > Very easy to achieve: > > 1. Ask the owner of the font how much money he wants to part with his property. > 2. Write a check for the corresponding amount. > 3. You are now the owner, you can put the font in the public domain. > You are right, of course, but I was thinking of uses other than to make fonts. It seems a bit odd to me sometimes that there is no guaranteed public domain example for characters. If someone wants to publish and sell a book in which they say something like "This is how Unicode suggests that character U+XXXX is supposed to look:" and then they copy the glyph from the Unicode chart, as I understand it they are violating copyright unless they get permission from the author of the font that was used for the chart. Or if they wanted to use one of the emoji characters from the charts on a public sign. Is that correct? From petercon at microsoft.com Fri Oct 24 10:18:24 2014 From: petercon at microsoft.com (Peter Constable) Date: Fri, 24 Oct 2014 15:18:24 +0000 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <54494980.8090607@adobe.com> Message-ID: <6a87559ce060495d8ca7f87c00a6d07c@CY1PR0301MB0698.namprd03.prod.outlook.com> From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Tom Gewecke > If someone wants to publish and sell a book in which they say something like > "This is how Unicode suggests that character U+XXXX is supposed to look:" Well, since the intent of the codes is to give indication of what the character identity is and _not_ to say how the character _should_ look, then it's a good thing if Unicode isn't authors to make such statements. Peter From michel at suignard.com Fri Oct 24 11:01:31 2014 From: michel at suignard.com (Michel Suignard) Date: Fri, 24 Oct 2014 16:01:31 +0000 Subject: Code charts and code points In-Reply-To: <544A3D2E.4020200@cs.tut.fi> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> <544A3D2E.4020200@cs.tut.fi> Message-ID: <6896b328b78843b2a3325937088f0d79@CO1PR02MB157.namprd02.prod.outlook.com> I know for a fact (because I did it and just verified), that the font used for those codes use the real UCS code. The conversion happens in the PDF embedding magic. I could look into it, but I have no easy to debug the Adobe Distiller path here. Apparently when you get out of the beaten path for new characters, the preservation of code points in copy and paste operation is not bullet proof. Michel -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Jukka K. Korpela Sent: Friday, October 24, 2014 4:51 AM To: unicode at unicode.org Subject: Re: Code charts and code points 2014-10-24 11:17, "Martin J. D?rst" wrote: > The code charts are published as PDFs. In general, text in PDFs can be > copypasted elsewhere. Is there something in place that makes sure that > "wrong" Unicode encodings for glyphs published in code charts don't > leak elsewhere? It seems that there isn?t. Whether this is serious is a different issue. I tested with the arbitrarily chosen Ornamental Dingbats block, with the chart http://www.unicode.org/charts/PDF/Unicode-7.0/U70-1F780.pdf Opening it in Adobe Reader XI on Win 7, I was able to select the characters with the mouse and copy and paste them to a text editor, BabelPad. It shows most of them as just boxes, identified with the correct Unicode numbers; this is the expected behavior when the editor has no suitable font in its disposal. But instead of U+1F67C VERY HEAVY SOLIDUS and U+1F67D VERY HEAVY REVERSE SOLIDUS, it shows ?/? and ?/?, identified as U+002F SOLIDUS and U+005C REVERSE SOLIDUS. So apparently the font designer had placed the glyphs as assigned to SOLIDUS and REVERSE SOLIDUS, which is understandable. But this means that when the characters in the code charts are copied and pasted, or otherwise accessed at the character level, they are wrong characters. I think it is imaginable that someone wants to copy a block of characters from the code charts, as a handy way of getting them for inspection, e.g. for testing how some particular software renders them using some particular font(s). I would expect some confusion then if you had partly got all wrong characters (code points). Yucca _______________________________________________ Unicode mailing list Unicode at unicode.org http://unicode.org/mailman/listinfo/unicode From asmusf at ix.netcom.com Fri Oct 24 11:21:04 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 24 Oct 2014 09:21:04 -0700 Subject: Code charts and code points In-Reply-To: <544A3D2E.4020200@cs.tut.fi> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> <544A3D2E.4020200@cs.tut.fi> Message-ID: <544A7C70.6070409@ix.netcom.com> On 10/24/2014 4:51 AM, Jukka K. Korpela wrote: > 2014-10-24 11:17, "Martin J. D?rst" wrote: > >> The code charts are published as PDFs. In general, text in PDFs can be >> copypasted elsewhere. Is there something in place that makes sure that >> "wrong" Unicode encodings for glyphs published in code charts don't leak >> elsewhere? > > It seems that there isn?t. Whether this is serious is a different issue. I posit that it is mostly an inconvenience. I understand that most fonts used nowadays are encoded correctly, anyway. But there are exceptions and where they are unavoidable for chart production, getting a chart to display correctly trumps copy&paste. Also, the situation is never static, each version uses a different set of fonts. In either case, it's not the same issue as creating (and exchanging) running text. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Fri Oct 24 11:28:59 2014 From: asmusf at ix.netcom.com (Asmus Freytag) Date: Fri, 24 Oct 2014 09:28:59 -0700 Subject: Code charts and code points In-Reply-To: <6896b328b78843b2a3325937088f0d79@CO1PR02MB157.namprd02.prod.outlook.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> <544A3D2E.4020200@cs.tut.fi> <6896b328b78843b2a3325937088f0d79@CO1PR02MB157.namprd02.prod.outlook.com> Message-ID: <544A7E4B.9010704@ix.netcom.com> On 10/24/2014 9:01 AM, Michel Suignard wrote: > I know for a fact (because I did it and just verified), that the font used for those codes use the real UCS code. The conversion happens in the PDF embedding magic. I could look into it, but I have no easy to debug the Adobe Distiller path here. Apparently when you get out of the beaten path for new characters, the preservation of code points in copy and paste operation is not bullet proof. And this is presumably true in general, and the code substitutions would then be "random", meaning that they do not establish an alternate encoding for exchange purposes. That is different from releasing ASCII-hacked or PUA fonts directly, because they do establish alternate encodings and documents in them can be exchanged if viewed with the same fonts. A./ > > Michel > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Jukka K. Korpela > Sent: Friday, October 24, 2014 4:51 AM > To: unicode at unicode.org > Subject: Re: Code charts and code points > > 2014-10-24 11:17, "Martin J. D?rst" wrote: > >> The code charts are published as PDFs. In general, text in PDFs can be >> copypasted elsewhere. Is there something in place that makes sure that >> "wrong" Unicode encodings for glyphs published in code charts don't >> leak elsewhere? > It seems that there isn?t. Whether this is serious is a different issue. > > I tested with the arbitrarily chosen Ornamental Dingbats block, with the chart http://www.unicode.org/charts/PDF/Unicode-7.0/U70-1F780.pdf > Opening it in Adobe Reader XI on Win 7, I was able to select the characters with the mouse and copy and paste them to a text editor, BabelPad. It shows most of them as just boxes, identified with the correct Unicode numbers; this is the expected behavior when the editor has no suitable font in its disposal. But instead of U+1F67C VERY HEAVY SOLIDUS and U+1F67D VERY HEAVY REVERSE SOLIDUS, it shows ?/? and ?/?, identified as U+002F SOLIDUS and U+005C REVERSE SOLIDUS. > > So apparently the font designer had placed the glyphs as assigned to SOLIDUS and REVERSE SOLIDUS, which is understandable. But this means that when the characters in the code charts are copied and pasted, or otherwise accessed at the character level, they are wrong characters. > > I think it is imaginable that someone wants to copy a block of characters from the code charts, as a handy way of getting them for inspection, e.g. for testing how some particular software renders them using some particular font(s). I would expect some confusion then if you had partly got all wrong characters (code points). > > Yucca > > > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode From ken.whistler at sap.com Fri Oct 24 13:26:12 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Fri, 24 Oct 2014 18:26:12 +0000 Subject: Code charts and code points In-Reply-To: <544A3D2E.4020200@cs.tut.fi> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <5449A97F.2070109@ix.netcom.com> <544A0B06.9000506@it.aoyama.ac.jp> <544A3D2E.4020200@cs.tut.fi> Message-ID: > I think it is imaginable that someone wants to copy a block of > characters from the code charts, as a handy way of getting them for > inspection, e.g. for testing how some particular software renders them > using some particular font(s). I would expect some confusion then if you > had partly got all wrong characters (code points). It is imaginable, but people who fiddle with extraction from PDF, particularly for cutting edge encodings and for large sets where they have no guarantee about how the embeddings work for the PDF, should also expect frustration to go with their confusion. If people want such examples of blocks for inspection, they should instead be using: http://www.unicode.org/charts/nameslist/ where Mark Davis' tools have created HTML versions of blocks and names list. People can grab as much of that as they want. It is all HTML, is easy to cut and paste, and is guaranteed to have only correct code points. You don't get fonts, of course -- but that is the point. You can use any of that material to test your own installation of fonts. --Ken From Tom at bluesky.org Fri Oct 24 13:27:05 2014 From: Tom at bluesky.org (Tom Gewecke) Date: Fri, 24 Oct 2014 11:27:05 -0700 Subject: fonts for U7.0 scripts In-Reply-To: <6a87559ce060495d8ca7f87c00a6d07c@CY1PR0301MB0698.namprd03.prod.outlook.com> References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <54494980.8090607@adobe.com> <6a87559ce060495d8ca7f87c00a6d07c@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: On Oct 24, 2014, at 8:18 AM, Peter Constable wrote: > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Tom Gewecke > >> If someone wants to publish and sell a book in which they say something like >> "This is how Unicode suggests that character U+XXXX is supposed to look:" > > Well, since the intent of the codes is to give indication of what the character identity is and _not_ to say how the character _should_ look, then it's a good thing if Unicode isn't authors to make such statements. I probably didn't express myself clearly before. Even if the book simply says "The charts published by Uncode.org indicate that the following would be a representative glyph for the Character U+XXXX", it seems that you would need permission to copy the glyph. I wonder if that is necessary. From petercon at microsoft.com Fri Oct 24 17:30:49 2014 From: petercon at microsoft.com (Peter Constable) Date: Fri, 24 Oct 2014 22:30:49 +0000 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <54494980.8090607@adobe.com> <6a87559ce060495d8ca7f87c00a6d07c@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: <6330a5913975451e8f4f72b8943b1093@CY1PR0301MB0698.namprd03.prod.outlook.com> Have you tried checking what the Unicode Terms of Use has to say about all this? Let me help: here's the Terms of Use page: http://www.unicode.org/copyright.html Regarding online code charts, it says, "The online code charts carry specific restrictions." If you load any of the code chart PDFs, there's a copyright notice that says this: Terms of Use You may freely use these code charts for personal or internal business uses only. You may not incorporate them either wholly or in part into any product or publication, or otherwise distribute them without express written permission from the Unicode Consortium. However, you may provide links to these charts. The fonts and font data used in production of these code charts may NOT be extracted, or used in any other way in any product or publication, without permission or license granted by the typeface owner(s). The Unicode Consortium is not liable for errors or omissions in this file or the standard itself. Information on characters added to the Unicode Standard since the publication of the most recent version of the Unicode Standard, as well as on characters currently being considered for addition to the Unicode Standard can be found on the Unicode web site. Anyone publishing a book and taking content from some other source is probably going to (or should) contact the owner of that content to get permission. The Unicode Consortium regularly receives requests for permission to use content. Peter -----Original Message----- From: Tom Gewecke [mailto:Tom at bluesky.org] Sent: Friday, October 24, 2014 11:27 AM To: Peter Constable Cc: Unicode Public Subject: Re: fonts for U7.0 scripts On Oct 24, 2014, at 8:18 AM, Peter Constable wrote: > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Tom > Gewecke > >> If someone wants to publish and sell a book in which they say >> something like "This is how Unicode suggests that character U+XXXX is supposed to look:" > > Well, since the intent of the codes is to give indication of what the character identity is and _not_ to say how the character _should_ look, then it's a good thing if Unicode isn't authors to make such statements. I probably didn't express myself clearly before. Even if the book simply says "The charts published by Uncode.org indicate that the following would be a representative glyph for the Character U+XXXX", it seems that you would need permission to copy the glyph. I wonder if that is necessary. From ken.whistler at sap.com Fri Oct 24 18:43:59 2014 From: ken.whistler at sap.com (Whistler, Ken) Date: Fri, 24 Oct 2014 23:43:59 +0000 Subject: fonts for U7.0 scripts In-Reply-To: References: <001e01cfedfe$cc0a58e0$641f0aa0$@sonic.net>

<4d16c7e0b8ca4a1fb131206e813bcbdc@CY1PR0301MB0698.namprd03.prod.outlook.com> <54494980.8090607@adobe.com> <6a87559ce060495d8ca7f87c00a6d07c@CY1PR0301MB0698.namprd03.prod.outlook.com> Message-ID: Tom Gewecke wondered: > it seems that you would > need permission to copy the glyph. I wonder if that is necessary. To follow on from Peter Constable's response, it comes down to the actual scenario at hand and precisely what one means by "copy the glyph". Scenario 1 I want to use an example chart (or part of a chart or part of a names list) in my forthcoming textbook on Unicode Algorithms for Squaring the Circle. Correct action: Contact the Unicode office with details and request permission to reprint said chart (or whatever the exact content is) in your book. In this case you are "copying the glyph" as part of an extended chunk of content intended for republication. Scenario 2 I want to cite a sentence from the Unicode Standard which includes some glyph from the charts for my blog post, The True Dirt on Unicode. Correct action: Feel free. This would fall under fair use. In this case you are "copying the glyph" for incidental use in a quoted mention. Scenario 3 I want to use a representative glyph from the Unicode charts to inform my own font design, so I am sure that I am not incorrectly mixing up CUNEIFORM SIGN LAK-449 TIMES PAP PLUS PAP PLUS LU3 with CUNEIFORM SIGN LAK-648 TIMES PAP PLUS PAP PLUS LU3. Correct action: Feel free. This is what the representative glyphs in the charts are for. In this case you are "copying the glyph" by reference to its distinctive features, for a new glyph design. Scenario 4 I want to crack the security on the PDF of the charts and steal the glyph drawing instructions out of the font, so I don't have to do the work myself or pay for a font. Correct action: Examine your motives and your ethics. This is *never* allowed by the license attached to the charts. In this case you are simply "absconding with the glyph", thereby stealing someone else's IP. --Ken From doug at ewellic.org Sat Oct 25 13:10:29 2014 From: doug at ewellic.org (Doug Ewell) Date: Sat, 25 Oct 2014 12:10:29 -0600 Subject: fonts for U7.0 scripts Message-ID: Peter Constable replied to Shriramana Sharma : >> Can you clarify what "all kinds of problems" you foresee? > > Sure: People find a font that isn?t a truly functional, Unicode- > conformant font for script X and? > > - They try using it, find it doesn?t display text as expected, and > conclude that Unicode doesn?t work for their script Wasn't this, in fact, one of the scenarios (considering rendering engines as well as fonts) that led some people to conclude that Unicode "didn't work" for Tamil, and to lobby for years for a glyph-based encoding? -- Doug Ewell | Thornton, CO, USA http://ewellic.org | @DougEwell ? From lang.support at gmail.com Mon Oct 27 00:36:24 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Mon, 27 Oct 2014 16:36:24 +1100 Subject: Western Cham in Akhar Jawi Message-ID: Hi all, When Western Cham is written in the Arabic script, there is regional variation in the Arabic characters used. Two varieties I am looking at use a character that i can't see in the Unicode charts, although I may have missed it. The character is a alef with three dots above (with the dots pointing upwards), see the attached images. has anyone come across this character used in other contexts? Andrew -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: u-circumflex.jpg Type: image/jpeg Size: 22156 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: u-circumflex-e.jpg Type: image/jpeg Size: 23667 bytes Desc: not available URL: From roozbeh at unicode.org Mon Oct 27 10:26:22 2014 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Mon, 27 Oct 2014 08:26:22 -0700 Subject: Western Cham in Akhar Jawi In-Reply-To: References: Message-ID: This is the first time I'm seeing the character. I suggest writing a Unicode proposal. On Oct 26, 2014 10:42 PM, "Andrew Cunningham" wrote: > Hi all, > > > When Western Cham is written in the Arabic script, there is regional > variation in the Arabic characters used. Two varieties I am looking at use > a character that i can't see in the Unicode charts, although I may have > missed it. > > The character is a alef with three dots above (with the dots pointing > upwards), see the attached images. > > has anyone come across this character used in other contexts? > > Andrew > > -- > Andrew Cunningham > Project Manager, Research and Development > (Social and Digital Inclusion) > Public Libraries and Community Engagement > State Library of Victoria > 328 Swanston Street > Melbourne VIC 3000 > Australia > > Ph: +61-3-8664-7430 > Mobile: 0459 806 589 > Email: acunningham at slv.vic.gov.au > lang.support at gmail.com > > http://www.openroad.net.au/ > http://www.mylanguage.gov.au/ > http://www.slv.vic.gov.au/ > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Mon Oct 27 19:04:44 2014 From: lang.support at gmail.com (Andrew Cunningham) Date: Tue, 28 Oct 2014 11:04:44 +1100 Subject: Western Cham in Akhar Jawi In-Reply-To: References: Message-ID: Thanks Roozbeh, I will most likely write a proposal, at the moment I am still mapping character usage to see if other unencoded characters pop up. Also doing the same for the western cham script, some of the more recent reforms (within past 10 years) in Cambodia don't appear to be encoded. Andrew On 28 October 2014 02:26, Roozbeh Pournader wrote: > This is the first time I'm seeing the character. I suggest writing a > Unicode proposal. > On Oct 26, 2014 10:42 PM, "Andrew Cunningham" > wrote: > >> Hi all, >> >> >> When Western Cham is written in the Arabic script, there is regional >> variation in the Arabic characters used. Two varieties I am looking at use >> a character that i can't see in the Unicode charts, although I may have >> missed it. >> >> The character is a alef with three dots above (with the dots pointing >> upwards), see the attached images. >> >> has anyone come across this character used in other contexts? >> >> Andrew >> >> -- >> Andrew Cunningham >> Project Manager, Research and Development >> (Social and Digital Inclusion) >> Public Libraries and Community Engagement >> State Library of Victoria >> 328 Swanston Street >> Melbourne VIC 3000 >> Australia >> >> Ph: +61-3-8664-7430 >> Mobile: 0459 806 589 >> Email: acunningham at slv.vic.gov.au >> lang.support at gmail.com >> >> http://www.openroad.net.au/ >> http://www.mylanguage.gov.au/ >> http://www.slv.vic.gov.au/ >> >> _______________________________________________ >> Unicode mailing list >> Unicode at unicode.org >> http://unicode.org/mailman/listinfo/unicode >> >> -- Andrew Cunningham Project Manager, Research and Development (Social and Digital Inclusion) Public Libraries and Community Engagement State Library of Victoria 328 Swanston Street Melbourne VIC 3000 Australia Ph: +61-3-8664-7430 Mobile: 0459 806 589 Email: acunningham at slv.vic.gov.au lang.support at gmail.com http://www.openroad.net.au/ http://www.mylanguage.gov.au/ http://www.slv.vic.gov.au/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Fri Oct 31 08:20:31 2014 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 31 Oct 2014 14:20:31 +0100 Subject: Looking for a standard on historical countries Message-ID: An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Oct 31 10:29:31 2014 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 31 Oct 2014 08:29:31 -0700 Subject: Looking for a standard on historical countries In-Reply-To: References: Message-ID: On Fri, Oct 31, 2014 at 6:20 AM, "J?rg Knappen" wrote: > Does someone here is aware of a standard or a de facto standard for names > or codes of historical countries? For the requirement I have in mind, all > countries where there was a printing press would be optimal coverage, > anything going beyond 1974 (ISO 3166-3) will be better than nothing. > I agree that that would be useful, but I am not aware of any such standard or reliable source of data. This question might be more successful on the cldr-users mailing list where people are more likely to think about region codes and display names. ( http://www.unicode.org/consortium/distlist.html) markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Oct 31 14:43:19 2014 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 31 Oct 2014 20:43:19 +0100 Subject: Looking for a standard on historical countries In-Reply-To: References:

Message-ID: How is ths related to Unicode ? May be it's associated to CLDR for former regional classifcation of languages, but I doubt this will ever create any standardization for historic data that should remain as is without changes in their old sources for which there are no more any active maintainers, just interested people (basically historians that may comment about them the way they want or could invent their new terminology for analysts and archivists). And there's no limit, but as proofs are disapearing there will be lot of political issues with conflicting countries, and even before countres were internationally regulated (before the creation of the Society of Nations and later the United Nations) because they only existed by temporary mutual agreements or were the result of wars (and even in that case, most conquered areas were not fully controled by the theoretical rulers). Additionally, maps severaly lacked the modern precision, names were not standardized at all even in the same language, or within the same local population, depending on contexts of use or the kind of people using them (ecclesiastic institutions, states; parliaments, kings/queens/imperators or their vassals, judges, merchants, farmers... It is alread y difficult to build maps for today's countries. There's in fact no rule in geography (every rule has its own exceptions, including when we just count today's countries standardized by ISO and people still disagreee about what is a country with the various forms of governments). 2014-10-31 16:29 GMT+01:00 Markus Scherer : > On Fri, Oct 31, 2014 at 6:20 AM, "J?rg Knappen" wrote: > >> Does someone here is aware of a standard or a de facto standard for names >> or codes of historical countries? For the requirement I have in mind, all >> countries where there was a printing press would be optimal coverage, >> anything going beyond 1974 (ISO 3166-3) will be better than nothing. >> > > I agree that that would be useful, but I am not aware of any such standard > or reliable source of data. > > This question might be more successful on the cldr-users mailing list > where people are more likely to think about region codes and display names. > (http://www.unicode.org/consortium/distlist.html) > > markus > > _______________________________________________ > Unicode mailing list > Unicode at unicode.org > http://unicode.org/mailman/listinfo/unicode > > -------------- next part -------------- An HTML attachment was scrubbed... URL: