From A.Schappo at lboro.ac.uk Mon Jun 1 06:29:46 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Mon, 1 Jun 2015 11:29:46 +0000 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: Message-ID: <5685F2CF-041E-4B67-ACF8-CD8CDEE79F21@lboro.ac.uk> On 30 May 2015, at 01:20, gfb hjjhjh wrote: 2. Is combined characters like U+20DD intended to work with all different type of characters, or is it some problem related to implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle) appear to be separate on most font I use, but if I change the Hiragana Yu into a conventional = sign or some latin character, most fonts are at least somehow able to put them together. Or, is there any better/alternative representation in unicode that can show japanese hiragana yu in a circle? Japanese Hiragana Letter Yu + Combining Enclosing Circle works fine for me using TextEdit on OSX Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Jun 1 06:49:48 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Mon, 01 Jun 2015 13:49:48 +0200 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: (David Starner's message of "Mon, 01 Jun 2015 01:29:27 +0000") References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> Message-ID: <864mmrejhf.fsf_-_@mimuw.edu.pl> On Mon, Jun 01 2015 at 3:29 CEST, prosfilaes at gmail.com writes: > On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien > wrote: > > The proposal makes me curious about past and present Unicode > policy, > e.g. would it be accepted if submitted now. > > > Why wouldn't it? Unicode has, if anything, seemed to become more > flexible about adding characters that seeing any sort of use. > On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf The document's author states: Although they could be made up of Letter + overlay diacritic, it is my understanding that the Unicode Consortium would prefer to create unique code points for these types of letters (e.g. recent acceptance of LATIN LETTER SMALL C WITH STROKE). Is this true? On the other hand, according to Wikipedia http://en.wikipedia.org/wiki/Saanich_dialect in 2014 there was "about 5" native speakers of the language. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From prosfilaes at gmail.com Mon Jun 1 07:05:34 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 01 Jun 2015 12:05:34 +0000 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: <864mmrejhf.fsf_-_@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> <864mmrejhf.fsf_-_@mimuw.edu.pl> Message-ID: On Mon, Jun 1, 2015 at 4:49 AM Janusz S. Bie? wrote: > The document's author states: > > Although they could be made up of Letter + overlay diacritic, it is > my understanding that the Unicode Consortium would prefer to create > unique code points for these types of letters (e.g. recent > acceptance of LATIN LETTER SMALL C WITH STROKE). > > Is this true? > As far as I know it's still true. Overlay diacritics don't work well, so they're pretty much ignored in encoding new characters. > On the other hand, according to Wikipedia > > http://en.wikipedia.org/wiki/Saanich_dialect > > in 2014 there was "about 5" native speakers of the language. > > It's what you get when you stock the committee who chooses what characters to encode with linguists. In the most general case, there is text in that language, and someone will want to digitize it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Mon Jun 1 12:07:24 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Mon, 1 Jun 2015 20:07:24 +0300 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: <864mmrejhf.fsf_-_@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> <864mmrejhf.fsf_-_@mimuw.edu.pl> Message-ID: <001501d09c8d$6ea3ba50$4beb2ef0$@fi> Please note that overlaid diacritics are not used in decomposition of characters in the Unicode Standard, unless they are used for the indication of negation of mathematical rules (see TUS 7.0, section 7.9 Combining Marks and 2.12 Equivalent Sequences). Sincerely Erkki I. Kolehmainen -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Janusz S. "Bien" L?hetetty: 1. kes?kuuta 2015 14:50 Vastaanottaja: David Starner Kopio: unicode at unicode.org Aihe: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) On Mon, Jun 01 2015 at 3:29 CEST, prosfilaes at gmail.com writes: > On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien > wrote: > > The proposal makes me curious about past and present Unicode > policy, > e.g. would it be accepted if submitted now. > > > Why wouldn't it? Unicode has, if anything, seemed to become more > flexible about adding characters that seeing any sort of use. > On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf The document's author states: Although they could be made up of Letter + overlay diacritic, it is my understanding that the Unicode Consortium would prefer to create unique code points for these types of letters (e.g. recent acceptance of LATIN LETTER SMALL C WITH STROKE). Is this true? On the other hand, according to Wikipedia http://en.wikipedia.org/wiki/Saanich_dialect in 2014 there was "about 5" native speakers of the language. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From public at khwilliamson.com Mon Jun 1 13:23:20 2015 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 01 Jun 2015 12:23:20 -0600 Subject: The Oral History Of The Poop Emoji Message-ID: <556CA318.5060705@khwilliamson.com> https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america From mark at macchiato.com Mon Jun 1 13:57:44 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 1 Jun 2015 20:57:44 +0200 Subject: The Oral History Of The Poop Emoji In-Reply-To: <556CA318.5060705@khwilliamson.com> References: <556CA318.5060705@khwilliamson.com> Message-ID: One of many on http://unicode.org/press/emoji.html Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 1, 2015 at 8:23 PM, Karl Williamson wrote: > > https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 1 17:42:00 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 01 Jun 2015 15:42:00 -0700 Subject: The Oral History Of The Poop Emoji Message-ID: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> I agree with one of the commenters that certain words just should not be used together in headlines. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Tue Jun 2 00:07:58 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 07:07:58 +0200 Subject: The Oral History Of The Poop Emoji In-Reply-To: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> References: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> Message-ID: Article de "merde" ? (not an insult, this is a true French word, appropriate to the subject). Bon app?tit ! (if you think about orality...) 2015-06-02 0:42 GMT+02:00 Doug Ewell : > I agree with one of the commenters that certain words just should not be > used together in headlines. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 2 01:01:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 08:01:25 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com> References: <556AEAE6.2040203@ix.netcom.com> <1433075623556.38b645ad@Nodemailer> <556B2DAD.6050204@ix.netcom.com> <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com> Message-ID: 2015-06-01 1:33 GMT+02:00 Chris : > > Of course, anyone can invent a character set. The difficult bit is having > a standard way of combining custom character sets. That?s why a standard > would be useful. > > And while stuff like this can, to some extent, be recognised by magic > numbers, and unique strings in headers, such things are unreliable. Just > because example.net/mycharset/ appears near the start of a document, > doesn?t necessarily mean it was meant to define a character set. Maybe it > was a document discussing character sets. > That's not what I described. I spoke about using a MIME-compatible private charset identifier, and how such private identifier can be made reasonnably unique by binding it to a domain name or URI. If you had read more carefully I also said that it was absolutely not necessary to dereference that URL: there are many XML schemas binding their namespaces to a URI which is itself not a webpage or to any downloadable DTD or XML schema or XML stylesheet. Google and Microsoft are using this a lot in lots of schemas (which are not described and documented at this URL if they are documented). The URI by itself is just an identifier, it becomes a webpage only when you use it in a web page with an href attribute to create an hyperlink, or to perform some query to a service returning some data. An identifier for a private charset does not need to perform any request to be usable by itself, we just have the identifier which is sufficient by itself. The URI can be also only a base URI for a collection of resources (whose URLs start by this base URI, with conventional extensions appended to get the character properties, or a font; but the best way is to embed this data in your document, in some header or footer, if your document using the private charset is not part of a collection of docs using the same private charset) In that case, you don't need a new UTF: UTF-8 remains usable and you can map your private charset to standard PUAs (and/or to "hacked" characters) according to the private charset needs. The charset indicated in your document (by some meta header) should be sufficient to avoid collisions with other private conventions, it will define the scope of your private charset as the document itself, which will then be interchangeable (and possibly mixable with other documents with some renumbering if there a collisions of assignments between two distinct private charsets: in the document header; add to the charset identifier the range of PUAs which is used, then with two documents colling on this range, you can reencode one automatically by creating a compound charset with subranges of PUAs remapped differently to other ranges). -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Jun 2 04:01:01 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 2 Jun 2015 10:01:01 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> Message-ID: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows. ---- Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used 3z7r means this is local glyph 3 being defined, though not used, at the start of the document as 7 red pixels More than one local glyph could be defined at the start of the document, as desired. ---- This would mean that use of such a glyph within the document would be by just using the quite short base character followed by tag characters sequence using the h request. This would enable document editing to be easier to accomplish. ---- A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. ---- May I mention something that I forgot to mention earlier please? When only one pixel of a particular colour is being specified, it can be specified using just the code for the colour. For example, for 1 red pixel please use r on its own, there is no need to use 1r though 1r should be made to work just in case anyone does use that format. There was a time when I used to use the FORTH programming language and this format of first inputting the number then the operator is based on the way that the FORTH programming language works. William Overington 2 June 2015 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 27/05/2015 - 17:26 (GMTST) To : unicode at unicode.org Subject : Tag characters and in-line graphics (from Tag characters) Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications. The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character. The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character. The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer. Examples of displays: Each example is left to right along the line then lines down the page from upper to lower. 7r means 7 pixels red 7r5y means 7 pixels red then 5 pixels yellow 7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue Examples of colours available: k black n brown r red o orange y yellow g green (0, 255, 0) b blue m magenta e grey w white c cyan p pink d dark grey i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1) f deeper green (foliage colour) (0, 128, 0) Next line request: - moves to the next line Local palette requests: 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) 7,2u means 7 pixels using local palette colour 2 Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font. Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted. William Overington 27 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 2 04:40:18 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 11:40:18 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! Your project is out of scope. You still want to reinvent the wheel. For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. Even for currency symbols (which are an exception to the demonstrated use, only because once they are created they are extremely rapidly needed by lot of people, in fact most people of a region as large as a country, and many other countries that will reference or use it it). But even in this case, what is encoded is the character itself, not the glyph or new characters used to defined the glyph ! Can you stop proposing out of topic subjects like this on this list ? You are not speaking about Unicode or characters. Another list will be more appropriate. You help no one here because all you want is to change radically the goals of TUS. 2015-06-02 11:01 GMT+02:00 William_J_G Overington : > Perhaps the solution to at least some of the various issues that have been > discussed in this thread is to define a tag letter z as a code within the > local glyph memory requests, as follows. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Jun 2 05:37:06 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 2 Jun 2015 11:37:06 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Responding to Philippe Verdy: > Nothing has been published. It has been published. It is published in this thread for discussion prior to a possible submission to the Unicode Technical Committee that could take place if people on this mailing list feel that it is a good solution to the problem raised in section 8 of the following document. http://www.unicode.org/reports/tr51/tr51-2.html Direct link to 8 Longer Term Solutions http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term William Overington 2 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Jun 2 06:45:31 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 2 Jun 2015 12:45:31 +0100 Subject: Tag characters and in-line graphics (from Tag characters) References: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Message-ID: On 2015-06-02, William_J_G Overington wrote: > take place if people on this mailing list feel that it is a good > solution to the problem raised in section 8 of the following document. > http://www.unicode.org/reports/tr51/tr51-2.html That section does not raise a problem. It says what the solution to the emoji problem is: namely that people who want to embed graphics in text should fix their protocols to allow it, instead of subverting Unicode to do it. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From kenwhistler at att.net Tue Jun 2 09:38:30 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 02 Jun 2015 07:38:30 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: <556DBFE6.3060800@att.net> On 6/2/2015 2:01 AM, William_J_G Overington wrote: > Local glyph memory, for use in compressing a document where the same > glyph is used two or more times in the document: Um, that technology already exists. It is called a "font". > > > A mechanism to be able to use the method to define a glyph linked to a > Unicode code point would be a useful facility to add for use in a > situation where the glyph is for a regular Unicode character. And that mechanism has also already been defined. It is called a "cmap": http://www.microsoft.com/typography/otspec/cmap.htm --Ken From jsbien at mimuw.edu.pl Tue Jun 2 14:38:47 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 02 Jun 2015 21:38:47 +0200 Subject: reversed Polish-hook o References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> Message-ID: <863829gat4.fsf@mimuw.edu.pl> I've just noticed the comment quoted in the subject in the description of 'LATIN SMALL LETTER TURNED DELTA' (U+018D) and I'm intrigued how it got into the standard. On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf > > (found by googling sencoten site:unicode.org) I tried to google for the relavant document both unicode.org and std.dkuug.dk but without any success. Actually I intend to look up the history of all the Polonica in Unicode and I will appreciate very much your advice what is the best way to search for information. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From idou747 at gmail.com Tue Jun 2 17:55:27 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 08:55:27 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: I was asking why the glyphs for right arrow ? are inconsistent in many sources, through a couple of iterations of unicode. Perhaps I might observe that one of the reasons is there is no technical link between the code and the glyph. I can?t realistically write a display engine that goes to unicode.org or wherever, and dynamically finds the right standard glyph for unknown codes. This is also manifest in my seeing empty squares ? for characters my platform doesn?t know about. This isn?t the case with XML where I can send someone a random XML document, and there is a standard way to go out there on the internet and check if that XML is conformant. Why shouldn?t there be a standard way to go out on the net and find the canonical glyph for a code? If there was, then non-standard glyphs would fall out of that technology naturally. So people are talking about all these technologies that are out there, html5, cmap, fonts and so forth, but there is no standard way to construct a list of ?characters?, some of which might be non-standard, and be able to embed that ANYWHERE one might reasonably expect characters, have it processed in a normal way as characters, be sent anywhere and understood. As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? Right now, what happens if you have a domain or locale requirement for a special character? Most likely you suffer without it, because even though you could get it to render in some situations (like hand coding some IMGs into your web site), you just know you won?t be able to realistically input it into emails, word documents, spreadsheets, and whatever other random applications on a daily basis. What I?m saying is it really beyond the unicode consortium?s scope, and/or would it really be a redundant technology to, for example, define a UTF-64 coding format, where 32 bits allow 4 billion businesses and individuals to define their own characters sets (each of up to 4 billion characters), then have standard places on the internet (similar to DNS lookup servers) that can provide anyone with glyphs and fonts for it? Right now, yes there are cmaps, but no standard way to combine characters from different encodings. No standard way to find the cmap for an unknown encoding. There is HTML5, but that doesn?t produce something that is recognisable as a list of characters that can be processed as such. (If there is an IMG in text, is it a ?character? or an illustration in the text? How can you refer to a particular set of characters without having your own web server? How you render that text bigger, with the standard reference glyph without manually searching the internet where to find it? There is a host of problems here). All these problems look unsolved to me, and they also look like encoding technology problems to me too. What other consortium is out there are working on character encoding problems? > On 2 Jun 2015, at 7:40 pm, Philippe Verdy wrote: > > Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! > > Your project is out of scope. You still want to reinvent the wheel. > > For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). > > Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. > > Even for currency symbols (which are an exception to the demonstrated use, only because once they are created they are extremely rapidly needed by lot of people, in fact most people of a region as large as a country, and many other countries that will reference or use it it). But even in this case, what is encoded is the character itself, not the glyph or new characters used to defined the glyph ! > > Can you stop proposing out of topic subjects like this on this list ? You are not speaking about Unicode or characters. Another list will be more appropriate. You help no one here because all you want is to change radically the goals of TUS. > > 2015-06-02 11:01 GMT+02:00 William_J_G Overington >: > Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jun 2 20:09:09 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 3 Jun 2015 10:09:09 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: <556E53B5.4080404@it.aoyama.ac.jp> On 2015/06/03 07:55, Chris wrote: > As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. > My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's "inline" text may be something similar. The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a "real good use case" for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. Regards, Martin. From duerst at it.aoyama.ac.jp Tue Jun 2 20:22:52 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 3 Jun 2015 10:22:52 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1432867044809.9dc7c15b@Nodemailer> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> Message-ID: <556E56EC.8010402@it.aoyama.ac.jp> On 2015/05/29 11:37, John wrote: > If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. > would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way. > Given that its been agreed that private use ranges are a good thing, That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > and given that we can agree that exchanging data is a good thing, Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. > maybe something should bring those two things together. Just a thought. Just a 'non sequitur'. Regards, Martin. From idou747 at gmail.com Tue Jun 2 20:50:19 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 11:50:19 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556E53B5.4080404@it.aoyama.ac.jp> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> <556E53B5.4080404@it.aoyama.ac.jp> Message-ID: Martin, you seem to be labouring under the impression that HTML5 is a substitute for character encoding. If it is, why do we need unicode? We could just have documents laden with On 3 Jun 2015, at 11:09 am, Martin J. D?rst wrote: > > On 2015/06/03 07:55, Chris wrote: > >> As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. > > Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. > >> My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? > > As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: > > Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. > > Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. > > From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's "inline" text may be something similar. > > The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. > > Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a "real good use case" for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. > > No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. > > Regards, Martin. From idou747 at gmail.com Tue Jun 2 21:09:17 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 12:09:17 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556E56EC.8010402@it.aoyama.ac.jp> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: > On 3 Jun 2015, at 11:22 am, Martin J. D?rst wrote: > > On 2015/05/29 11:37, John wrote: > >> If I had a large document that reused a particular character thousands of times, > > Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. If you have a daughter, look at her Facebook messenger, and then get back to me. >> would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? > > If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way You can?t ask the entire computing universe to compress everything all the time. And that is what your comment amounts to. Because the whole point under discussion is how can we encode stuff such that you can hope to universally move it around between different documents, formats, applications, input fields and platforms without any massage. > Given that its been agreed that private use ranges are a good thing, > > That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? >> and given that we can agree that exchanging data is a good thing, > > Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. The point is a universally recognised way. Of course you, me or anybody could design many good ways to solve any problem we might come up with. That doesn?t mean it will interoperate with anybody else though. > >> maybe something should bring those two things together. Just a thought. > > Just a 'non sequitur'. > > Regards, Martin. From verdy_p at wanadoo.fr Wed Jun 3 00:42:31 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 07:42:31 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> References: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Message-ID: No, nothing about what you propose, which is to encode graphics directly with a custom syntax using specific Unicode characters for this syntax itself. There's no such statement in the UTR, even for "longer term". What is proposed instead is a way to *reference* (not "define") graphics. For the rest, you need a rich-text format to embed graphics (using the syntax of this rich-text format, such as HTML), but this syntax remains out of scope of Unicode which will not standardize any graphic format, or any language by its syntax. Even for CLDR, you will use some JSON or XML rich-text format to create references, or embed some small graphics. But CLDR is NOT part of the Unicode Standard itself, and does not encode new characters (and I've not seen the CLDR requesing additions in the UCS for its own use, instead it uses its own assignments for PUAs where needed, als also for its own private locale tags for internal references within the CLDR data itself). 2015-06-02 12:37 GMT+02:00 William_J_G Overington : > Responding to Philippe Verdy: > > > Nothing has been published. > > It has been published. It is published in this thread for discussion prior > to a possible submission to the Unicode Technical Committee that could > take place if people on this mailing list feel that it is a good solution > to the problem raised in section 8 of the following document. > > http://www.unicode.org/reports/tr51/tr51-2.html > > Direct link to > > 8 Longer Term Solutions > > http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term > > > William Overington > > 2 June 2015 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jun 3 03:26:05 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 3 Jun 2015 09:26:05 +0100 (BST) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) Message-ID: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) >> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10&t=2957 http://forum.high-logic.com/viewtopic.php?f=10&t=2672 William Overington 3 June 2015 From frederic.grosshans at gmail.com Wed Jun 3 04:28:32 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Wed, 03 Jun 2015 11:28:32 +0200 Subject: reversed Polish-hook o In-Reply-To: <863829gat4.fsf@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <863829gat4.fsf@mimuw.edu.pl> Message-ID: <556EC8C0.1060907@gmail.com> An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 06:38:02 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 04:38:02 -0700 (PDT) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Message-ID: <1433331480845.7b37573e@Nodemailer> Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. ? Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wrote: > Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) >>> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). >> They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? > Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. > I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. > For example, > http://forum.high-logic.com/viewtopic.php?f=10&t=2957 > http://forum.high-logic.com/viewtopic.php?f=10&t=2672 > William Overington > 3 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 08:03:30 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 15:03:30 +0200 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <1433331480845.7b37573e@Nodemailer> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> <1433331480845.7b37573e@Nodemailer> Message-ID: This possibly fails because William possibly forgot to embed his font in the document itself (or Serif PagePlus forgets to do it when it creates the PDF document, and refuses to embed glyphs from the font that are bound to Unicode PUAs when it creates the embeded font). However no such problem when creating PDFs with MS Office, or via the Adobe Acrobat "printer" driver or other printer drivers generating PDF files, including Google Cloud Print). So this could be a misuse of Serif PagePlus when creating the PDF (I don't know this software, may be there are options set up that ells it to not embed fonts from a list of fonts that the recipient is supposed to have installed locally, to save storage space for the document, byt evoiding such embedding). Another reason may be that the font is marked as "not embeddable" within its exposed properties. Another reason may be that John tries to open the document with a software that does not handle embedded fonts, or that ignores it to use only the fonts preinstalled by John in his preferences. And in such case the result depends only on fonts preinstalled on his local system (that does not include the fonts created by William), or his software is setup to use exclusively a specific local "Unicode" font for all PUAs. (Softwares that behaved in this bad way was old versions of Internet Explorer, due to limitation of his text renderers, however this should not happen with PDFs, provided you have used a correct plugion version for displaying PDF in the browser : if this fails in the browser, download the document and view it with Adobe Reader instead of view the plugin: there are many PDF plugins on markets that do not support essential features and just built to display PDF containing scanned bitmaps, but with very poor support of text or vector graphics, or tuned specifically to change the document for another device or paper format). Without citing which softwares are used (and which PDF in the list does not load correctly), it is difficult to tell, but for me I have no problems with a few docs I saw created by William. So: NO F = NO FAIL for me. 2015-06-03 13:38 GMT+02:00 John : > Yep, I clicked on your document and saw an empty square where your > character should be. > > F = FAIL. > > ? > Chris > > > On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > >> Private Use Area in Use (from Tag characters and in-line graphics (from >> Tag characters)) >> >> >> >> That's not agreed upon. I'd say that the general agreement is that the >> private ranges are of limited usefulness for some very limited use cases >> (such as designing encodings for new scripts). >> >> >> > They are of limited usefulness precisely because it is pathologically >> hard to make use of them in their current state of technological evolution. >> If they were easy to make use of, people would be using them all the time. >> I?d bet good money that if you surveyed a lot of applications where custom >> characters are being used, they are not using private use ranges. Now why >> would that be? >> >> >> Actually, I have used Private Use Area characters a lot, and, once I had >> got used to them, I found them incredibly straightforward to use. >> >> >> I have made fonts that include Private Use Area encodings using the >> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >> both to produce PDF documents and PNG graphics, as needed for my particular >> project at the time. >> >> >> For example, >> >> >> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >> >> >> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >> >> >> William Overington >> >> >> >> >> 3 June 2015 >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 08:20:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 15:20:14 +0200 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> <1433331480845.7b37573e@Nodemailer> Message-ID: Note that copy-pasting from a PDF to another document is very tricky, the PDF format requires that embedded fonts use precise glyph naming conventions to map glyphs back to characters, otherwise the Unicode characters sequences associated to a glyph (or multiple glyphs if they are ligatured or in complex layouts or with uncommon decorations, or rendered on a non uniform background, or with glyphs filled with pattern, such as labels over a photograph or cartographic map) will not be recognized. This remark about PDFs is also applicable to PostScript documents. Some PDF readers in that case attempt to perform some OCR (plus dictionary lookups to fix mis readings) for common glyph forms, but will almost always fail if the glyphs are too specific such as when they include swashes, ligatures, or unknown scripts and scripts with complex layouts (such as the invented script created by William for noting sentences with specific "characters" with new glyphs, and a specific syntax and specific layout rules. In other casesn the PDF reader will jsut put in the clipboard only a bitmap for the selection, and it will be another software that will attempt to interpret the bitmap with OCR. The glyph naming conventions are documented in PDF specifications, but many PDF creators do not follow these rules, and copying text from these PDFs fails 2015-06-03 15:03 GMT+02:00 Philippe Verdy : > This possibly fails because William possibly forgot to embed his font in > the document itself (or Serif PagePlus forgets to do it when it creates the > PDF document, and refuses to embed glyphs from the font that are bound to > Unicode PUAs when it creates the embeded font). However no such problem > when creating PDFs with MS Office, or via the Adobe Acrobat "printer" > driver or other printer drivers generating PDF files, including Google > Cloud Print). > > So this could be a misuse of Serif PagePlus when creating the PDF (I don't > know this software, may be there are options set up that ells it to not > embed fonts from a list of fonts that the recipient is supposed to have > installed locally, to save storage space for the document, byt evoiding > such embedding). Another reason may be that the font is marked as "not > embeddable" within its exposed properties. > > Another reason may be that John tries to open the document with a software > that does not handle embedded fonts, or that ignores it to use only the > fonts preinstalled by John in his preferences. And in such case the result > depends only on fonts preinstalled on his local system (that does not > include the fonts created by William), or his software is setup to use > exclusively a specific local "Unicode" font for all PUAs. > > (Softwares that behaved in this bad way was old versions of Internet > Explorer, due to limitation of his text renderers, however this should not > happen with PDFs, provided you have used a correct plugion version for > displaying PDF in the browser : if this fails in the browser, download the > document and view it with Adobe Reader instead of view the plugin: there > are many PDF plugins on markets that do not support essential features and > just built to display PDF containing scanned bitmaps, but with very poor > support of text or vector graphics, or tuned specifically to change the > document for another device or paper format). > > Without citing which softwares are used (and which PDF in the list does > not load correctly), it is difficult to tell, but for me I have no problems > with a few docs I saw created by William. So: > > NO F = NO FAIL for me. > > 2015-06-03 13:38 GMT+02:00 John : > >> Yep, I clicked on your document and saw an empty square where your >> character should be. >> >> F = FAIL. >> >> ? >> Chris >> >> >> On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < >> wjgo_10009 at btinternet.com> wrote: >> >>> Private Use Area in Use (from Tag characters and in-line graphics (from >>> Tag characters)) >>> >>> >>> >> That's not agreed upon. I'd say that the general agreement is that >>> the private ranges are of limited usefulness for some very limited use >>> cases (such as designing encodings for new scripts). >>> >>> >>> > They are of limited usefulness precisely because it is pathologically >>> hard to make use of them in their current state of technological evolution. >>> If they were easy to make use of, people would be using them all the time. >>> I?d bet good money that if you surveyed a lot of applications where custom >>> characters are being used, they are not using private use ranges. Now why >>> would that be? >>> >>> >>> Actually, I have used Private Use Area characters a lot, and, once I had >>> got used to them, I found them incredibly straightforward to use. >>> >>> >>> I have made fonts that include Private Use Area encodings using the >>> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >>> both to produce PDF documents and PNG graphics, as needed for my particular >>> project at the time. >>> >>> >>> For example, >>> >>> >>> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >>> >>> >>> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >>> >>> >>> William Overington >>> >>> >>> >>> >>> 3 June 2015 >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 3 08:24:04 2015 From: prosfilaes at gmail.com (David Starner) Date: Wed, 03 Jun 2015 13:24:04 +0000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: Chris wrote: > There is no way to compare 2 HTML elements and know they are talking about the same character That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ Note that even in Unicode, the set ? ? ? ? s S ? may be considered the same character or up to seven different characters, depending on case-folding, canonization and accent dropping. > Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. On Tue, Jun 2, 2015 at 7:11 PM Chris wrote: > You can?t ask the entire computing universe to compress everything all the > time. Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jun 3 08:53:34 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 3 Jun 2015 14:53:34 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: <1565119.42918.1433339614901.JavaMail.defaultUser@defaultHost> Earlier in this thread, on 2 June 2015, I wrote as follows: > A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. I have now thought of a mechanism to use. Please imagine the base character followed by a sequence of tag characters, the tag characters here represented by ordinary letters and digits. Here is an example of the mechanism for defining the glyph for U+E702 in a particular document as 7 red pixels. HE702U7r The tag H character switches to hexadecimal input mode, then there are as many tag characters as necessary to express in hexadecimal notation the code point of the character for which the definition is being made, then there is a tag U character to action the definition and go out of hexadecimal input mode. The tag 7r is to express 7 red pixels. In practice the number of tag characters after the tag U character might be around 200, the above tag 7r is just a minimal example so as to explain the concept. ---- While posting, may I mention please one other matter? Previously I mentioned using tag R, tag G and tag B is defining colours. I now add tag A into that defining colour so as to define opacity, that is what is sometimes called transparency, yet 0 means totally transparent and 255 means totally opaque. If no value is stated for A then it should be presumed to have a value of 255, so that the default situation is to define opaque colours. ---- I feel that the information in this thread is now a good basis for the assessment of this suggested format as to whether it could be a useful open source system with good interoperability potential that could usefully be submitted to the Unicode Technical Committee. William Overington 3 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 09:04:34 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 16:04:34 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: Compression is even more important today on mobile networks: mobile apps are very verbose over the net, and you can easily pay the extra volume. In addition, mobile networks are frequently much slower than what they are advertized, even if you pay the extra subscription to get 3G/4G, you depend on antennas and the number of peoples around you. In my home, 3G/4G in faact does not work at all, and this is the case in many places around in my city, even though they are sold to have full coverage (for example, just downloading an application or updating it is simply impossible: I have to be at home connected to my Wifi router, but when its internet link fails (this happens sometimes for several hours, I have extremely slow connections on 3G/4G (which is also overcrowded at the same time, and only delivers 2G speeds). Lot of people have to support frequently low bandwidths on mobile networks, independantly of the price they paid for their subscription. So compressing data is stil lextremely important (even for texts or for the smallest web requests). Thanks, compression is now part of the web transport, but this does not mean that apps must learn to represent their interchanged data efficiently, and develop less verbose protocols and APIs). There are more and more people using mobile networks now than fixed landline internet accesses (or home wifi routers connected to it, and even for them, fiber access is still jsut for a minority of people in dense areas, the others don't get more than an handful of mebgatit/s on their DSL access: if you look at worldwide internet connections a large majority of people don't get more than 2 megabit/s: this is enough for reading/sending SMS or phone calls, or exchanging emails, but not if you need frequent updates to your apps and your apps are too verbose and there are too many apps in the background: many people cannot view videos on their mobile access, or only with very poor quality if they view it "live" (they cannot also download them slowly due to lack of storage space on their mobile device, so videos have to remain short in total volume and duration). So I disagree: compression is absolutely needed (even more today than iut was in the past when mobile Internet accesses were still for a minority. Mobile networks are not really faster today (their bandwidth does not double every three year like local performances of devices ! But with this extra local performance, you can support more complex compression schemes that require more CPU/GPU power which is no longer a bottleneck, when the real bottleneck is the effectively available bandwidth of the mobile network (smaller than the connection bandwidth because this bandwidth is shared... and expensive). 2015-06-03 15:24 GMT+02:00 David Starner : > Chris wrote: > > There is no way to compare 2 HTML elements and know they are talking > about the same character > > That's because character identity is a hard problem. Is the emoji TIGER > the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? > > > http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ > > Note that even in Unicode, the set ? ? ? ? s S ? may be considered the > same character or up to seven different characters, depending on > case-folding, canonization and accent dropping. > > > Similarly, there is no way to search or index html elements. If a HTML > document contained an image of a particular custom character, there would > be no way to ask google or whatever to find all the documents with that > character. Different documents would represent it differently. > > You can index links to images. If two documents represent it differently, > then I go back to the above; we can't know that they're the same thing. > > On Tue, Jun 2, 2015 at 7:11 PM Chris wrote: > >> You can?t ask the entire computing universe to compress everything all >> the time. > > > Anytime we care about how much space text takes up, it should be > compressed. It compresses very well. On the other hand, it's rare that > anyone cares anymore; what's a few hundred kilobytes between friends? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 3 09:56:33 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 03 Jun 2015 07:56:33 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> Chris wrote: > Right now, what happens if you have a domain or locale requirement for > a special character? That's what the PUA is for. Assign a PUA code point to your special character, create a font which implements the PUA character, create a brief "private agreement" which states that this code point refers to that character and which mentions the font, put the private agreement on the web, and publish your document with a reference to the agreement. For most non-professionals, creating the font is the tricky part. Also see Section 23.5 of TUS. Note that I am disagreeing with Martin about the PUA being useful only as a scratch area for standardization. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed Jun 3 10:14:39 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 03 Jun 2015 08:14:39 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150603081439.665a7a7059d7ee80bb4d670165c8327d.bec9174c59.wbe@email03.secureserver.net> Chris Why shouldn?t there be a standard way to go out on the net and find > the canonical glyph for a code? Because there isn't one. Glyphs are suggestions, meant to convey the identity of the character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From idou747 at gmail.com Wed Jun 3 19:17:34 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 17:17:34 -0700 (PDT) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: References: Message-ID: <1433377053793.5f2c25d8@Nodemailer> I don?t use old software, I use up to date versions of everything on a Mac. Very standard setup.? There?s a lot of links there. Maybe they do work in PDFs, but they certainly don?t work in the browser, and they don?t work when I click the txt files. Basically what you?re saying is that PDFs have a way to make this work. so what? Unless we are proposing that everything in the universe be PDF, this doesn?t really help. There should be a standard way to put custom characters anywhere that characters belong and have things ?just work?. Clearly right now things don?t just work. And without even bothering to try I know if I tried cutting and pasting from those PDFs into somewhere else, it won?t work. ? Chris On Wed, Jun 3, 2015 at 11:20 PM, Philippe Verdy wrote: > Note that copy-pasting from a PDF to another document is very tricky, the > PDF format requires that embedded fonts use precise glyph naming > conventions to map glyphs back to characters, otherwise the Unicode > characters sequences associated to a glyph (or multiple glyphs if they are > ligatured or in complex layouts or with uncommon decorations, or rendered > on a non uniform background, or with glyphs filled with pattern, such as > labels over a photograph or cartographic map) will not be recognized. This > remark about PDFs is also applicable to PostScript documents. > Some PDF readers in that case attempt to perform some OCR (plus dictionary > lookups to fix mis readings) for common glyph forms, but will almost always > fail if the glyphs are too specific such as when they include swashes, > ligatures, or unknown scripts and scripts with complex layouts (such as the > invented script created by William for noting sentences with specific > "characters" with new glyphs, and a specific syntax and specific layout > rules. In other casesn the PDF reader will jsut put in the clipboard only a > bitmap for the selection, and it will be another software that will attempt > to interpret the bitmap with OCR. > The glyph naming conventions are documented in PDF specifications, but many > PDF creators do not follow these rules, and copying text from these PDFs > fails > 2015-06-03 15:03 GMT+02:00 Philippe Verdy : >> This possibly fails because William possibly forgot to embed his font in >> the document itself (or Serif PagePlus forgets to do it when it creates the >> PDF document, and refuses to embed glyphs from the font that are bound to >> Unicode PUAs when it creates the embeded font). However no such problem >> when creating PDFs with MS Office, or via the Adobe Acrobat "printer" >> driver or other printer drivers generating PDF files, including Google >> Cloud Print). >> >> So this could be a misuse of Serif PagePlus when creating the PDF (I don't >> know this software, may be there are options set up that ells it to not >> embed fonts from a list of fonts that the recipient is supposed to have >> installed locally, to save storage space for the document, byt evoiding >> such embedding). Another reason may be that the font is marked as "not >> embeddable" within its exposed properties. >> >> Another reason may be that John tries to open the document with a software >> that does not handle embedded fonts, or that ignores it to use only the >> fonts preinstalled by John in his preferences. And in such case the result >> depends only on fonts preinstalled on his local system (that does not >> include the fonts created by William), or his software is setup to use >> exclusively a specific local "Unicode" font for all PUAs. >> >> (Softwares that behaved in this bad way was old versions of Internet >> Explorer, due to limitation of his text renderers, however this should not >> happen with PDFs, provided you have used a correct plugion version for >> displaying PDF in the browser : if this fails in the browser, download the >> document and view it with Adobe Reader instead of view the plugin: there >> are many PDF plugins on markets that do not support essential features and >> just built to display PDF containing scanned bitmaps, but with very poor >> support of text or vector graphics, or tuned specifically to change the >> document for another device or paper format). >> >> Without citing which softwares are used (and which PDF in the list does >> not load correctly), it is difficult to tell, but for me I have no problems >> with a few docs I saw created by William. So: >> >> NO F = NO FAIL for me. >> >> 2015-06-03 13:38 GMT+02:00 John : >> >>> Yep, I clicked on your document and saw an empty square where your >>> character should be. >>> >>> F = FAIL. >>> >>> ? >>> Chris >>> >>> >>> On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < >>> wjgo_10009 at btinternet.com> wrote: >>> >>>> Private Use Area in Use (from Tag characters and in-line graphics (from >>>> Tag characters)) >>>> >>>> >>>> >> That's not agreed upon. I'd say that the general agreement is that >>>> the private ranges are of limited usefulness for some very limited use >>>> cases (such as designing encodings for new scripts). >>>> >>>> >>>> > They are of limited usefulness precisely because it is pathologically >>>> hard to make use of them in their current state of technological evolution. >>>> If they were easy to make use of, people would be using them all the time. >>>> I?d bet good money that if you surveyed a lot of applications where custom >>>> characters are being used, they are not using private use ranges. Now why >>>> would that be? >>>> >>>> >>>> Actually, I have used Private Use Area characters a lot, and, once I had >>>> got used to them, I found them incredibly straightforward to use. >>>> >>>> >>>> I have made fonts that include Private Use Area encodings using the >>>> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >>>> both to produce PDF documents and PNG graphics, as needed for my particular >>>> project at the time. >>>> >>>> >>>> For example, >>>> >>>> >>>> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >>>> >>>> >>>> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >>>> >>>> >>>> William Overington >>>> >>>> >>>> >>>> >>>> 3 June 2015 >>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 19:21:00 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 17:21:00 -0700 (PDT) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> Message-ID: <1433377259559.1a60883d@Nodemailer> So what you?re saying is that the current situation where you see an empty square ? for unknown characters is better than seeing something useful? ? Chris On Thu, Jun 4, 2015 at 12:59 AM, Doug Ewell wrote: > Chris wrote: >> Right now, what happens if you have a domain or locale requirement for >> a special character? > That's what the PUA is for. Assign a PUA code point to your special > character, create a font which implements the PUA character, create a > brief "private agreement" which states that this code point refers to > that character and which mentions the font, put the private agreement on > the web, and publish your document with a reference to the agreement. > For most non-professionals, creating the font is the tricky part. > Also see Section 23.5 of TUS. > Note that I am disagreeing with Martin about the PUA being useful only > as a scratch area for standardization. > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 19:46:26 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 10:46:26 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> > On 3 Jun 2015, at 11:24 pm, David Starner wrote: > > Chris wrote: > > There is no way to compare 2 HTML elements and know they are talking about the same character > > That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? I personally think emoji should have one, single definitive representation for this exact reason. The subtley of different emotion between one happy face and another can be miles apart. Emoji are a little different to other symbols in that respect. Symbols that are purely symbolic can be changed as much as you like as long as they are recognisable. Emoji have too many shades of meaning for allowing change. Both of these scenarios are an argument that there should be custom characters with at least one official representation. Emoji because you don?t really want variation. Symbols because if you don?t have a local representation, then something is better than nothing. If you don?t have a local Snow Flake for example, any old snow flake will be fine. This is not a hard problem at all. Is one tony the tiger the same as another? The community interested in tony the tiger can make decisions like that. But having made that decision there needs to be a way for generic computer programs that don?t know about that community to do reasonable things with tony the tiger characters. > > You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. You can?t know because they?re images. That?s my exact point. Anybody talking about HTML5 and images as a solution to custom characters is not proposing a valid solution. > > On Tue, Jun 2, 2015 at 7:11 PM Chris > wrote: > You can?t ask the entire computing universe to compress everything all the time. > > Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends? You compress things when they are on the move. Between computers and as you are writing it to a file. But you can?t compress generically while it is in memory. You can?t iterate over compressed bits. You can?t process them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Wed Jun 3 19:57:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 03 Jun 2015 17:57:45 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <1433377053793.5f2c25d8@Nodemailer> References: <1433377053793.5f2c25d8@Nodemailer> Message-ID: <556FA289.7070703@att.net> On 6/3/2015 5:17 PM, John wrote: > > > > > so what? > > There should be a standard way to put custom characters anywhere that > characters belong and have things ?just work?. > > Well, that's the rub, isn't it? We (in IT) are still working pretty dang hard on the simpler problem, to wit: There should be a way to put *standard characters* anywhere that characters belong and have things "just work". And even *that* is a hard problem that has taken over 25 years -- and is still a work in progress. What you are asking for is not much removed from: There should be a *standard *way to put "*stuff-I-just-made-up*" anywhere that characters belong and have things "just work". See, the first barrier to getting anywhere with this goal is to get everybody concerned with text in IT (or perhaps even worse, all the hundreds of millions of people who *use* characters in their devices) to agree what a "custom character" is. And if the rollicking "discussions" underway about emoji have taught us much of anything, it includes the fact that people do *not* all agree about what characters are or what should be a candidate for "just working" -- or even what "just work" might mean for them, in any case. So before declaring that your position is self-evidently correct about how things should just work, it might be a good idea to put some real thought into how one would define and standardize the concept of a "custom character" sufficiently precisely that there would be a snowball's chance in hell that all the implementations of text out there would a) know what it was, b) know how it should display and render, c) know how it should be input, stored, and transmitted and d) know how it should be interpreted universally. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 3 19:59:21 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 00:59:21 +0000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: On Wed, Jun 3, 2015 at 5:46 PM Chris wrote: > > I personally think emoji should have one, single definitive representation > for this exact reason. > Then you want an image. I don't see what's hard about that. > The community interested in tony the tiger can make decisions like that. > That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. > You can?t know because they?re images. > You can't know because the only obvious equivalence relation is exact image identity. You can?t iterate over compressed bits. You can?t process them. Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 20:27:27 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Jun 2015 03:27:27 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: 2015-06-04 2:59 GMT+02:00 David Starner : > You can?t iterate over compressed bits. You can?t process them. >> > > Why not? In any language I know of that has iterators, there would be no > problem writing one that iterates over compressed input. If you need to > mutate them, that is hard in compressed formats, but a new CPU can store > War in Peace in the on-CPU cache. > You're right, today the CPU is no longer the bottleneck, which is now * the speed of long buses and communcaition links, with their limited (and costly) bandwidth as this is a shared medium used by more and more people but requiring mssive infrastures, or physical constraints even on the fastest serial buses, both implying transmission roundtrip times (limiting random access, which is a severe problem now that we have to access to extremely large volumes of data distributed over multiple devices or over a full network * the storage capacity for the fastest storage medium (such as flash memory, which is the only option for mobile devices, but also the most expensive). In both cases you need compression (the second bottleneck on storage volumes will fade out in a few years, but not the bandwidth constraints). It really pays now to use compression schemes (even the most complex ones such as those used to transmit live video: locally a CPU or GPU will easily handle the compression scheme. Researches on compression schemes is really not ended, it has never been so much active as it is today, including for text because of the explosion of the data volumes, even if now the volume of text is largely overwhelmed by the volume of images, videos and audio (but you can't compute a lot of things from audio/image/video data sources, we still need text for giving semantics to these medias from which you can derive data or perform searches (there is still a lot to do for handling images and audio speech and detect some semantics in them, but you won't get as much info from an audio/video than what can be represented by text: OCR for example is a very heuristic process with lots of false guesses produced, still much more than humain brains can process within a broad ranges of variations that we call "cultures"; computers are still very poor in recognizing cultures with as many variations as those we recognize through social interactions and years of education and *personal* experience). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 3 21:22:22 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 3 Jun 2015 20:22:22 -0600 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1433377259559.1a60883d@Nodemailer> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> Message-ID: Chris John wrote: > So what you?re saying is that the current situation where you see an > empty square ? for unknown characters is better than seeing something > useful? No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From idou747 at gmail.com Thu Jun 4 02:43:48 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 17:43:48 +1000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <556FA289.7070703@att.net> References: <1433377053793.5f2c25d8@Nodemailer> <556FA289.7070703@att.net> Message-ID: > > Well, that's the rub, isn't it? > > We (in IT) are still working pretty dang hard on the simpler problem, to wit: > > There should be a way to put standard characters anywhere that characters belong > and have things "just work". > > And even *that* is a hard problem that has taken over 25 years -- and is still a work in > progress. Unicode is 2 things. (1) A binary format? the technology bit. (2) And the social part: agreeing what the characters should be. (1) is, relatively speaking, super easy. Roughly speaking, 16 bit unique numbers in a row. (2) is hard because coming to an agreement is hard. What I?m saying is we can totally bypass (2) for many use cases if people had the power to make their own characters. Yes it is hard to meet in committee and agree on stuff. Don?t force people to do that. You do that by putting more work into (1), and less hand wringing about (2). > See, the first barrier to getting anywhere with this goal is to get everybody concerned > with text in IT (or perhaps even worse, all the hundreds of millions of people who > *use* characters in their devices) to agree what a "custom character" is. There is no need for such thing. Everybody knows roughly what the concept of a custom character is. What is needed is the technology to do it so that everyone can seamlessly enjoy it. > And if > the rollicking "discussions" underway about emoji have taught us much of anything, > it includes the fact that people do *not* all agree about what characters are or > what should be a candidate for "just working" -- or even what "just work" might > mean for them, in any case. That?s because you?re immersed in (2), which is a different kind of problem. You don?t have to agree on details if everybody has the power to create new characters. > So before declaring that your position is self-evidently correct about how things > should just work, it might be a good idea to put some real thought into how > one would define and standardize the concept of a "custom character" sufficiently > precisely that there would be a snowball's chance in hell that all the implementations > of text out there would a) know what it was, b) know how it should display and > render, c) know how it should be input, stored, and transmitted and d) know how it > should be interpreted universally. I already gave several possible implementation suggestions. I?ll repeat one of them again merely to illustrate that it is possible. Characters are 64 bit. 32 bits are stripped off as the ?character set provider ID?. That is sent to one of many canonical servers akin to DNS servers to find the URL owner of those characters. At that location you?d find a number of representations of the character whether TrueType, vector graphics, bitmaps or whatever. The rendering engine would download the representation and display it to the user. All without the user having to know anything about character sets, custom fonts or whatever. So you come across character 12340000000017. The OS asks charset server who owns charset 1234. They reply ?facebook.com/charsets?. The OS asks facebook.com/charsets for facebook.com/charsets/17/truetype/pointsize12 representation. All this happens invisible to the user. Of course if it is already cached on their machine, then it wouldn?t happen. -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 02:57:33 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 17:57:33 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: <4831F779-0B20-4B54-A85F-40308EEA4D57@gmail.com> > On 4 Jun 2015, at 10:59 am, David Starner wrote: > > On Wed, Jun 3, 2015 at 5:46 PM Chris > wrote: > > I personally think emoji should have one, single definitive representation for this exact reason. > > Then you want an image. I don't see what's hard about that. I already explained why an image and/or HTML5 is not a character. I?ll repeat again. And the world of characters is not limited to emoji. 1. HTML5 doesn?t separate one particular representation (font, size, etc) from the actual meaning of the character. So you can?t paste it somewhere and expect to increase its point size or change its font. 2. It?s highly inefficient in space to drop multi-kilobyte strings into a document to represent one character. 3. The entire design of HTML has nothing to do with characters. So there is no way to process a string of characters interspersed with HTML elements and know which of those elements are a ?character?. This makes programatic manipulation impossible, and means most computer applications simply will not allow HTML in scenarios where they expect a list of ?characters?. 4. There is no way to compare 2 HTML elements and know they are talking about the same character. I could put some HTML representation of a character in my document, you could put a different one in, and there would absolutely no way to know that they are the same character. Even if we are in the same community and agree on the existence of this character. 5. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. HTML is a rendering technology. It makes things LOOK a particular way, without actually ENCODING anything about it. The only part of of HTML that is actually searchable in a deterministic fashion is the part that is encoded - the unicode part. > > The community interested in tony the tiger can make decisions like that. > > That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. Apparently the world makes decisions all the time without meeting in committee. Strange but true. It?s called making a decision. Facebook have created a lot of emoji characters without consulting any committee and it seems to work fine, albeit restricted to the facebook universe because of a lack of a standard. > > > You can?t know because they?re images. > > You can't know because the only obvious equivalence relation is exact image identity. Because? there is no standard!! If facebook wants to define 2 emoji images, maybe one is bigger than the other, and yet basically the same, to mean the same thing, then that would be their choice. Since I expect they have a lot of smart people working there, I expect it would work rather well. Just like Microsoft issues courier fonts in different point sizes and we all feel they have made that work fairly well. You seem to be arguing the nonsense position that if someone for example, made a snowflake glyph slightly different to the unicode official one, that it is wrong. That of course is nonsense. People can make sensible decisions about this without the unicode committee. > > You can?t iterate over compressed bits. You can?t process them. > > Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. You can?t do it because no standard library, programming language, or operating system is set up to iterate over characters of compressed data. So if you want to shift compressed bits around in your app, it will take an awful lot of work, and the bits won?t be recognised by anyone else. Now if someone wants to define the next version of unicode to be a compressed format, and every platform supports that with standard libraries, computer languages etc, then fine that could work. Yet again I point out, lots of things MIGHT be possible in the real world IF that is how a standard is formulated. But all the chatter about this or that technology is pie in the sky without that standard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 03:03:12 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 18:03:12 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> Message-ID: > > No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). People with their iphones and ipads and so forth don?t want to have ?private agreements?, they don?t want to ?install character sets?. The want it to ?just work?. I wish Steve Jobs was here to give this lecture. I highly doubt actually that it is even possible to install a private character set font on an iphone such that it would be available to all applications. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. From wjgo_10009 at btinternet.com Thu Jun 4 03:46:05 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 4 Jun 2015 09:46:05 +0100 (BST) Subject: Custom characters (was: Re: Private Use Area in Use) Message-ID: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Chris expressed an idea, hypothetically starting: > Characters are 64 bit. The following posts might be helpful. http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0277.html http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0307.html For 64 bits, or somewhere in that region, maybe just a few bits less, a longer sequence of high surrogate characters followed by a low surrogate character could possibly be used. I did also find the following post. http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0256.html I thought that I would mention it, though I cannot quite at the moment understand the issue. William Overington 4 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 08:04:49 2015 From: idou747 at gmail.com (John) Date: Thu, 04 Jun 2015 06:04:49 -0700 (PDT) Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Message-ID: <1433423088288.48975bc8@Nodemailer> It occurs to me that the existing DNS system was designed to map 32bit numbers to domain names. So a hypothetical UTF64 format, with 32 bits of provider ID could be co-opted into the DNS system under a different record domain (Similar to how there is A records, and MX records, there could be UTF records.) Then all that would need defining would be some kind of directory hierarchy convention. Like /codepoint-number/font-type/font-size or whatever, and rendering engines could automatically lookup DNS, download from the web site via HTTP the font or bitmap or whatever, and seamlessly show you the right character. It wouldn?t be overly hard to implement, and a format without headers like this one, in the same general style as UTF-16 and UTF-32, wouldn?t upset the normal programming style of working with characters, so programming languages and existing apps wouldn?t have that much difficulty in upgrading. Mostly just a matter of upgrading the character size. I think this stuff could be relatively easy to define and standardise. You could basically define the entire technology in 1 A4 document. People have just got to want it badly enough to agree on it, and give it the imprimatur of the consortium. ? Chris On Thu, Jun 4, 2015 at 6:49 PM, William_J_G Overington wrote: > Chris expressed an idea, hypothetically starting: >> Characters are 64 bit. > The following posts might be helpful. > http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0277.html > http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0307.html > For 64 bits, or somewhere in that region, maybe just a few bits less, a longer sequence of high surrogate characters followed by a low surrogate character could possibly be used. > I did also find the following post. > http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0256.html > I thought that I would mention it, though I cannot quite at the moment understand the issue. > William Overington > 4 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 4 09:39:27 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 14:39:27 +0000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <1433423088288.48975bc8@Nodemailer> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: On Thu, Jun 4, 2015 at 6:09 AM John wrote: > Mostly just a matter of upgrading the character size. Which totally blows any concern with text size out of the water. Using 30 bytes to define certain very rare characters and 1 byte to define ASCII is way better then using 8 bytes to define all characters. I think this stuff could be relatively easy to define and standardise. You > could basically define the entire technology in 1 A4 document. People have > just got to want it badly enough to agree on it, and give it the imprimatur > of the consortium. > > Then define it. It doesn't need Unicode involved at all, unless nobody really wants it enough to use it without it getting tossed into the Unicode package. -------------- next part -------------- An HTML attachment was scrubbed... URL: From parker at parkerhiggins.net Thu Jun 4 11:43:20 2015 From: parker at parkerhiggins.net (Parker Higgins) Date: Thu, 4 Jun 2015 09:43:20 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <1433377053793.5f2c25d8@Nodemailer> <556FA289.7070703@att.net> Message-ID: On Thu, Jun 4, 2015 at 12:43 AM, Chris wrote: > > Characters are 64 bit. 32 bits are stripped off as the ?character set > provider ID?. That is sent to one of many canonical servers akin to DNS > servers to find the URL owner of those characters. At that location you?d > find a number of representations of the character whether TrueType, vector > graphics, bitmaps or whatever. The rendering engine would download the > representation and display it to the user. All without the user having to > know anything about character sets, custom fonts or whatever. > > So you come across character 12340000000017. The OS asks charset server > who owns charset 1234. They reply ?facebook.com/charsets?. The OS asks > facebook.com/charsets for facebook.com/charsets/17/truetype/pointsize12 > representation. > > All this happens invisible to the user. Of course if it is already cached > on their machine, then it wouldn?t happen. > Just in case you haven't considered this, there are LOTS of circumstances where this could be a problem from a user's perspective, or even abused by the provider. We've already moved largely from automatically displaying *images* of remote origin in email for privacy concerns?I don't really need Facebook (in your example, but substitute for an abusive spouse or a repressive government if it makes you feel better) knowing when I am reading plaintext documents on my own local machine. Thanks, Parker -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Jun 4 14:30:31 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 04 Jun 2015 12:30:31 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Message-ID: <5570A757.1010301@ix.netcom.com> On 6/4/2015 1:46 AM, William_J_G Overington wrote: > I thought that I would mention it, though I cannot quite at the moment > understand the issue. I'm long past where I'm sure I understand what the issue is. :) A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Jun 4 14:36:26 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2015 20:36:26 +0100 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: <20150604203626.64f88aa9@JRWUBU2> On Thu, 04 Jun 2015 14:39:27 +0000 David Starner wrote: > On Thu, Jun 4, 2015 at 6:09 AM John wrote: > > > Mostly just a matter of upgrading the character size. > > > Which totally blows any concern with text size out of the water. > Using 30 bytes to define certain very rare characters and 1 byte to > define ASCII is way better then using 8 bytes to define all > characters. The character size can be increased to 64 bits in such a way that no new surrogates are required, current UTF-8 text remains UTF-8, current UTF-16 text remains UTF-16 and current UTF-32 remains UTF-32, the extended UTF-8 still has 8-bit code units, the extended UTF-16 still has 16-bit units, and the extended UTF-32 still has 32-bit code units. In fact, the character size can be made unbounded. The trick is to extend UTF-8 indefinitely, and then for UTF-16 and UTF-32 repeat the idea of the UTF-8 scheme using sequences of two or more low surrogates (or two or more high surrogates - one must chose) much as UTF-8 uses bytes. Tom Bishop publicised the idea. Richard. From frederic.grosshans at gmail.com Thu Jun 4 15:05:33 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Thu, 04 Jun 2015 20:05:33 +0000 Subject: Another take on the English apostrophe in Unicode Message-ID: An interesting argument for U+02BC MODIFIER LETTER APOSTROPHE as English apostrophe : https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Jun 4 16:34:27 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 4 Jun 2015 14:34:27 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Looks all wrong to me. "don?t" is a contraction of two words, it is not one word. English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawai?ian ?Okina .) You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 4 20:31:09 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 01:31:09 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote: > "don?t" is a contraction of two words, it is not one word. > But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of "Don?t you mind?" being okay but not *"Do not you mind?", and "fo?c?sle". > You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Jun 4 21:01:56 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 4 Jun 2015 19:01:56 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, "ack" and "ack". Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner wrote: > On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer > wrote: > >> "don?t" is a contraction of two words, it is not one word. >> > > But as he points out, it's not a contraction of don and t; it is, at best, > a contraction of do and n't. It's eliding, not punctuating. In the > comments, he also brings up the examples of "Don?t you mind?" being okay > but not *"Do not you mind?", and "fo?c?sle". > > > You can't use simple regular expressions to find word boundaries. > > Who uses _simple_ regular expressions? You can't use any code to reliably > find word boundaries in English, and that's a problem. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 22:26:33 2015 From: idou747 at gmail.com (Chris) Date: Fri, 5 Jun 2015 13:26:33 +1000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: <1A29B678-92FC-42E4-9B00-4C0F0078112C@gmail.com> > > I think this stuff could be relatively easy to define and standardise. You could basically define the entire technology in 1 A4 document. People have just got to want it badly enough to agree on it, and give it the imprimatur of the consortium. > > Then define it. It doesn't need Unicode involved at all, unless nobody really wants it enough to use it without it getting tossed into the Unicode package. That?s like saying that nobody really wanted anything Unicode published because they could have done it themselves. That?s what the anti-custom character arguments around here claim, so why not disband? The problem at hand is that everybody out there who does have some kind of requirement IS defining their own proprietary solution, which is different to everybody else?s solution. Even on this very thread people can?t decide if the right way to address this is PUAs and custom character maps, or HTML5 snippets, and we?ve had a few other suggestions too! > I don't really need Facebook (in your example, but substitute for an abusive spouse or a repressive government if it makes you feel better) knowing when I am reading plaintext documents on my own local machine. Well? I would think in the vast majority of circumstances there would be no downloading involved. A typical scenario would be you use a Facebook app, or access a Facebook web site for example. That would cause downloading all the associated custom characters. Then you might do something like copy your text into say Microsoft word. No downloading because it?s already on your machine. I would anticipate most apps would choose to, if appropriate, cache them in their file format. So if you send the word document to someone else they also would have no downloading. Maybe then if that person decided to cut the characters out of that document into another app, maybe increase the font size or something, maybe then a download would be required. OK, so what about in that situation, the user takes some action that results in the rendering engine finding an unknown character? I can think of a lot of ways to address that and solve privacy. Here is one possibility. All unknown characters are rendered like this: Then when you click on the character, the OS?s font engine will locate and download it, and display it to the user. So the user had the choice, leave them unrendered, or download. Pretty simple for the user to learn, and gives them the choice. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment.jpeg Type: image/jpeg Size: 3026 bytes Desc: not available URL: From prosfilaes at gmail.com Thu Jun 4 23:25:52 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 04:25:52 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis wrote: > Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for > example, the work ack-ack isn't decomposable into words, or even morphemes, > "ack" and "ack". > > Leo > > On Thu, Jun 4, 2015 at 6:31 PM, David Starner > wrote: > >> On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer >> wrote: >> >>> "don?t" is a contraction of two words, it is not one word. >>> >> >> But as he points out, it's not a contraction of don and t; it is, at >> best, a contraction of do and n't. It's eliding, not punctuating. In the >> comments, he also brings up the examples of "Don?t you mind?" being okay >> but not *"Do not you mind?", and "fo?c?sle". >> >> > You can't use simple regular expressions to find word boundaries. >> >> Who uses _simple_ regular expressions? You can't use any code to reliably >> find word boundaries in English, and that's a problem. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Fri Jun 5 01:01:53 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 4 Jun 2015 23:01:53 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: > Hyphens generally make multiple words into one anyway. There's not really > multiple hyphens the way there's separate quotes and apostrophes. > Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. Leo > On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis wrote: > >> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >> for example, the work ack-ack isn't decomposable into words, or even >> morphemes, "ack" and "ack". >> >> Leo >> >> On Thu, Jun 4, 2015 at 6:31 PM, David Starner >> wrote: >> >>> On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer >>> wrote: >>> >>>> "don?t" is a contraction of two words, it is not one word. >>>> >>> >>> But as he points out, it's not a contraction of don and t; it is, at >>> best, a contraction of do and n't. It's eliding, not punctuating. In the >>> comments, he also brings up the examples of "Don?t you mind?" being okay >>> but not *"Do not you mind?", and "fo?c?sle". >>> >>> > You can't use simple regular expressions to find word boundaries. >>> >>> Who uses _simple_ regular expressions? You can't use any code to >>> reliably find word boundaries in English, and that's a problem. >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 01:58:24 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 23:58:24 -0700 Subject: Another take on the English apostrophe in Unicode Message-ID: On June 4, 2015, at 11:01 PM, Leo Broukhis wrote: > > >On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: > >Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. > >Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Fri Jun 5 02:16:07 2015 From: leob at mailcom.com (Leo Broukhis) Date: Fri, 5 Jun 2015 00:16:07 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: > But the point was that treating hyphens as parts of words is not generally a wrong thing. That brings us back to my original question: where's MODIFIER LETTER HYPHEN, then? A word is a sequence of letters, isn't it? :) I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Leo On Thu, Jun 4, 2015 at 11:58 PM, David Starner wrote: > On June 4, 2015, at 11:01 PM, Leo Broukhis wrote: > >> >> >>On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: >> >>Hyphens generally make multiple words into one anyway. There's not really >> multiple hyphens the way there's separate quotes and apostrophes. >> >>Generally, but not always, just as apostrophes aren't always at a >> contracted word boundary. There is only one hyphen because no language >> (AFAIK) claims it as part of its alphabet. > > But the point was that treating hyphens as parts of words is not generally a > wrong thing. There is one generally consistent rule for hyphens. When > apostrophes and quotes are conflated, there is no one generally acceptable > rule. From qsjn4ukr at gmail.com Fri Jun 5 04:43:49 2015 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Fri, 5 Jun 2015 12:43:49 +0300 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: The conflict is between linguists and programmers. In plain text apostrophe is a punctuation used instead letters (unreadable, one or more) or as separator for avoid connecting letters into ligature or syllable, between parts of composite word as well as inside the simple word, or finally, as quotation mark. Yes it is ambiguous! It is. It just is! Linguists say "It is. We see that. We know that". And programmers say "That's wrong! We can't understand that". Just are you so stupid if you can't! Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion. From wjgo_10009 at btinternet.com Fri Jun 5 04:48:01 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 10:48:01 +0100 (BST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> Markus Scherer wrote: > How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a "show in colour mode" where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively. If people want this facility, maybe it could become published in a Unicode Technical Report so that standardization and interoperability could be achieved. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Jun 5 04:49:14 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 5 Jun 2015 18:49:14 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> Message-ID: <5571709A.4010801@it.aoyama.ac.jp> On 2015/06/04 17:03, Chris wrote: > I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution. I'm not sure he would come to the same conclusion as you. > This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. > > Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. You are right that it would be strictly technically possible. Not only that, it has been so for 10 or 20 years. As an example, in 1996 at the WWW Conference in Paris I was participating in a workshop on internationalization for the Web, and by chance I was sitting between the participant from Adobe and the participant from Microsoft. These were the main companies working on font technology at that time, and I asked them how small it would be possible to make a font for a single character using their technologies (the purpose of such a font, as people on this thread should be able to guess, would be as part of a solution to exchange single, "user-defined" characters). I don't even remember their answers. The important thing here that the idea, and the technology, have been around for a long time. So why didn't it take on? Maybe the demand is just not as big as some contributors on this list claim. Also, maybe while the technology itself isn't rocket science, the responsible people at the relevant companies have enough experience with technology deployment to hold back. To give an example of why the deployment aspect is important, there were various Web-like hypertext technologies around when the Web took off in the 1990. One of them was called HyperG. It was technologically 'better' than the Web, in that it avoided broken links. But it was much more difficult to deploy, and so it is forgotten, whereas the Web took off. Regards, Martin. From asmus-inc at ix.netcom.com Fri Jun 5 05:46:10 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 05 Jun 2015 03:46:10 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <5571709A.4010801@it.aoyama.ac.jp> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> <5571709A.4010801@it.aoyama.ac.jp> Message-ID: <55717DF2.4030705@ix.netcom.com> On 6/4/2015 17:03 , "Chris" wrote: > This whole discussion is about the fact that it would be technically > possible to have private character sets and private agreements that > your OS downloads without the user being aware of it. The sticky issues are not the questions of how to make available fonts or images for use by the OS. Instead, they concern the fact that any such a model violates some pretty basic guarantees of plain text that the entire net infrastructure relies on. There are very obvious security issues. The start with tracking; every time you access a custom code point, that fact potentially results in a trackable interaction. This problem affects even the "sticker" solution that people are hoping for for emoji. (On my system, no external resources are displayed when I first open any message, and there is a reason for that). Beyond tracking, and beyond stickers (that is pictures that look like pictures) a generalized custom character set would allow "text" that is no longer really stable. You would be able to deliver identical e-mails to people that display differently, because when you serve the custom fonts, you would be able to customize what you deliver under the same custom character set designator. While this would be a wonderful way to circumvent censorship (other than the "man in the middle" version), you would likewise seriously undermine the ability to filter unwanted or undesirable texts, because the custom character set engine might recognize when a request comes from a filter and not the end user. (Just the other day, I came across a hacked website that responded differently to search engined than to live users, making the hack effective for one and invisible to the other. Custom character sets would seem to just add to the hackers' arsenal here). Finally, custom character sets sound like a great idea when thinking of an extension of an existing character set. But that's not where the issues are. The issues come in when you use the same technology to provide aliases for existing code points or for other custom characters. Aliasing undermines the ability to do search (or any other content-focused processing, from sorting to spell-check). At that point, the circle closes. When Unicode was created, the alternative then was ISO 2022, which was a standard that addressed the issue of how to switch among (albeit pre-defined) character sets to achieve, in principle, coverage equal to the union of these character sets. Unicode was created to address two main deficiencies of that situation. Unification addressed the aliasing issue, so that code points were no longer "opaque" but could be interpreted by software (other than display), which was the second big drawback of the patchwork of character sets. A processing model for opaque code points is possible to define, but it isn't very practical and in the late eighties people had had enough were glad to be quit of it. Seen from this perspective, the discussion about custom character sets presents itself as a giant step backward, undermining the very advances that underlie the rapid acceptance and spread of Unicode. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Jun 5 06:20:33 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 12:20:33 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <55717DF2.4030705@ix.netcom.com> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> <5571709A.4010801@it.aoyama.ac.jp> <55717DF2.4030705@ix.netcom.com> Message-ID: <31341075.26231.1433503233504.JavaMail.defaultUser@defaultHost> Asmus Freytag wrote about security issues. This is interesting reading and I have learned a lot from the post about various security issues. Whilst the post is in this thread and follows from a post in this thread, the topic has seemed to moved to the Custom characters thread. I note that what you write about seems to me that it would not apply to my suggestion in my original post: is that correct? http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html Also the following two posts. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html Whilst the ideas raised by Chris are interesting, they do seem to be distinctly different from what I suggested. So, for clarity, do you regard my suggested format as having any security issues, and if so, what please? I know that some people have opined that my suggested format is out of scope for Unicode, yet the scope of Unicode is what the Unicode Technical Committee decides is the scope of Unicode, and my suggested format does provide a way to include custom glyphs within a Unicode plain text document by using the new base character followed by tag characters method. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 08:15:09 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 13:15:09 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis wrote: > I agree that conflating apostrophes and quotes is a source of > problems, however, existence of the MODIFIER LETTER [same glyph as > used for English contractions] in Unicode is a coincidence which > should not have an effect on usage of apostrophes in English. Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Jun 5 08:51:31 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 14:51:31 +0100 (BST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> References: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> Message-ID: <13990923.38289.1433512291110.JavaMail.defaultUser@defaultHost> Markus Scherer wrote: >> How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I replied: > Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a "show in colour mode" where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts I realize that the cyan and red colours cannot be done at present, yet I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 09:13:04 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 14:13:04 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR wrote: > The conflict is between linguists and programmers. No, it's not. > Yes it is ambiguous! > It is. It just is! Linguists say "It is. We see that. We know that". > "Now you programmers find some way to deal with that so you can produce useful corpuses for linguistic work." Which is what this is all about, is producing good linguistic interpretations of plain text, for, among others, linguists whose supply of scanned text has exceeded their ability to hand-process it. > Modifier letter apostrophe is a letter that used as itself and means > itself (ejective sound e.g.) only. Don't use it else. It just make > more confusion. > If you don't know what language a text is in, you can't tell what sounds letters make. Adding this character to English's repertoire won't change that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From KalvesmakiJ at doaks.org Fri Jun 5 09:26:50 2015 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Fri, 5 Jun 2015 14:26:50 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: I don?t have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. "Cliticization vs. Inflection: English N?T."Language59, no. 3 (1983): 502?513. It?s nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435 From mark at macchiato.com Fri Jun 5 09:47:15 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 5 Jun 2015 16:47:15 +0200 Subject: =?UTF-8?B?aHR0cDovL+KciPCfjrDwn5K4Lndz?= Message-ID: -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Jun 5 09:48:09 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 5 Jun 2015 16:48:09 +0200 Subject: =?UTF-8?B?UmU6IGh0dHA6Ly/inIjwn46w8J+SuC53cw==?= In-Reply-To: References: Message-ID: Whoops, sent too soon. A surprise: http://?????.ws Mark *? Il meglio ? l?inimico del bene ?* On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ?? wrote: > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jun 5 10:36:27 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 05 Jun 2015 08:36:27 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> I wrote, crumpled up, and threw away about three different responses. I thought about ISO 2022 and about accessing the web for every PUA character, as Asmus mentioned, and about the size of the user base, as Martin mentioned. I thought about character properties and about ephemerality. I didn't think of the spoofing implications that Asmus described, which would affect both the automatic PUA font download and the inline drawing language. Either of these could be used to spell out, let's say, "paypal.com" rather convincingly and with minimal effort. I might have more experience with the PUA than many list members, having transcribed the 27,000-word "Alice's Adventures in Wonderland" into my constructed alphabet two years ago, in a PUA encoding, so that Michael Everson could publish it in book form. One of the many learning experiences of this project was finding out which software tools play nicely with the PUA and which don't. Some tools "just worked" while others would not give acceptable results with any amount of effort. At no point, however, did I suppose that a font with my alphabet, or any of the jillions of others that have been invented "during a boring day in class" (see Omniglot for tons of examples), should be silently downloaded to a user's computer, consuming bandwidth and disk space, without her knowledge. That's practically malware. Maybe I'm just not enough of a Distinguished Visionary to understand how insanely great this would be (unfortunately, celebrity name-dropping doesn't work with me). Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like "strongly discouraged" and "not interoperable" even in the presence of an agreement. Given this, and given that no system I'm aware of magically downloads fonts for *regularly encoded characters* (I still have no font for Arabic math symbols), I personally would not expect Unicode to perform a 180 on this. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Jun 5 10:40:37 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 05 Jun 2015 08:40:37 -0700 Subject: Another take on the English apostrophe in Unicode Message-ID: <20150605084037.665a7a7059d7ee80bb4d670165c8327d.08a0959f19.wbe@email03.secureserver.net> QSJN 4 UKR wrote: > And programmers say "That's wrong! We can't understand that". Just are > you so stupid if you can't! You know, we really aren't all like that. Some of us actually try to meet user needs. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From daniel.buenzli at erratique.ch Fri Jun 5 10:48:13 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 5 Jun 2015 16:48:13 +0100 Subject: ucd beta, stable filenames Message-ID: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Hello, Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like http://www.unicode.org/Public/beta/ and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. Thanks, Daniel From daniel.buenzli at erratique.ch Fri Jun 5 10:53:44 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 5 Jun 2015 16:53:44 +0100 Subject: ucd beta, stable filenames In-Reply-To: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> References: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Message-ID: <60241C8F9FA14A49B5D230EF3900DC4D@erratique.ch> Le vendredi, 5 juin 2015 ? 16:48, Daniel B?nzli a ?crit : > and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). Or both with and without the suffix of course. Daniel From john at mitre.org Fri Jun 5 12:29:10 2015 From: john at mitre.org (John D. Burger) Date: Fri, 5 Jun 2015 13:29:10 -0400 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: > On Jun 4, 2015, at 17:34 , Markus Scherer wrote: > > Looks all wrong to me. > > "don?t" is a contraction of two words, it is not one word. Yes it is. Is "keyboard" two words? How about "newspaper"? If "don't" is two words, please tell me what two words make up "won't"? (Hint, neither of them is "will".) Linguistically, "don't" and friends pass all the diagnostics that indicate they're single words. - John Burger > English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawai?ian ?Okina.) > > You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. > > Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. > > If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? > > markus From mark at kli.org Fri Jun 5 17:32:08 2015 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 05 Jun 2015 18:32:08 -0400 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> References: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> Message-ID: <55722368.40703@kli.org> On 06/05/2015 11:36 AM, Doug Ewell wrote: > At no point, however, did I suppose that a font with my alphabet, or any > of the jillions of others that have been invented "during a boring day > in class" (see Omniglot for tons of examples), should be silently > downloaded to a user's computer, consuming bandwidth and disk space, > without her knowledge. That's practically malware. Maybe I'm just not > enough of a Distinguished Visionary to understand how insanely great > this would be (unfortunately, celebrity name-dropping doesn't work with > me). > > Unicode has stated consistently for at least 23 years that it would not > ever standardize PUA usage, and over the years some UTC members have > used terms like "strongly discouraged" and "not interoperable" even in > the presence of an agreement. Given this, and given that no system I'm > aware of magically downloads fonts for *regularly encoded characters* (I > still have no font for Arabic math symbols), I personally would not > expect Unicode to perform a 180 on this.\ Isn't this what webfonts are all about? You specify a font in the stylesheet, give it a URL, and your browser goes and downloads it and displays the text in it. That seems to me to be a perfectly reasonable use of this sort of "evil font trick" in the PUA (and who knows, even in encoded text? No, I can think of some Bad Things that could result). There isn't anything to stop you from making a page with webfonts that looks like it says one thing but when you copy/paste the text it's something completely different. I should do that someday, just for demonstration purposes... ~mark From eric.muller at efele.net Fri Jun 5 20:06:06 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 05 Jun 2015 18:06:06 -0700 Subject: ucd beta, stable filenames In-Reply-To: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> References: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Message-ID: <5572477E.6000009@efele.net> On 6/5/2015 8:48 AM, Daniel B?nzli wrote: > Hello, > > Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like > > http://www.unicode.org/Public/beta/ > > and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. > > +1000 Eric. From eric.muller at efele.net Fri Jun 5 20:08:02 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 05 Jun 2015 18:08:02 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: <557247F2.9050902@efele.net> On 6/5/2015 10:29 AM, John D. Burger wrote: > Linguistically, "don't" and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french "pomme de terre" also passes the diagnostics. So we need a new space character. Eric. From wjgo_10009 at btinternet.com Sat Jun 6 09:37:28 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 6 Jun 2015 15:37:28 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <3037447.25167.1433601448238.JavaMail.defaultUser@defaultHost> Doug Ewell wrote: > Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like "strongly discouraged" and "not interoperable" even in the presence of an agreement. I know that Doug and many other on this mailing list will well understand the following already, yet I feel that is helpful to empasise that Unicode does standardize PUA (Private Use Area) usage to the extent that Unicode standardizes which code points are designated as being in the three Private Use Areas and some default properies, such as being left to right. So, whilst Unicode does not standardize which glyphs are used for each code point in any situation, Unicode does standardize the infrastructure so that the Private Use Area can be successfully used. So if, say, a much larger code space were needed wherein end users could among themselves agree how assignments could be made, it would not be unreasonable for Unicode to define the underlying infrastructure. There is a precedent in the way that the alt.* newsgroup hierarchy was incorporated into the Usenet email newsgroups in the time before the world wide web was invented. A person wishing to start a new alt.* newsgroup could post to alt.config and there was discussion for around a week, often with useful advice as to what name to have for the new newsgroup and the new newsgroup was then started. Regular Usenet newsgroups had a long process of votes to get a new newsgroup started, yet the alt.* newsgroups were different, allowing someone to start a new newsgroup on his or her own initiative. That was a very useful facility. William Overington 6 June 2015 From doug at ewellic.org Sun Jun 7 11:39:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 7 Jun 2015 10:39:38 -0600 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: Message-ID: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> "Mark E. Shoulson" wrote: > Isn't this what webfonts are all about? You specify a font in the > stylesheet, give it a URL, and your browser goes and downloads it and > displays the text in it. That's great if you have a stylesheet, a URL, and a browser. HTML is fancy text, and pretty much implies some sort of online connection. I thought we were talking about plain text, and apologize if we weren't or if that important detail was not clear. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sun Jun 7 22:36:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 8 Jun 2015 05:36:38 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> References: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> Message-ID: 2015-06-07 18:39 GMT+02:00 Doug Ewell : > "Mark E. Shoulson" wrote: > > Isn't this what webfonts are all about? You specify a font in the >> stylesheet, give it a URL, and your browser goes and downloads it and >> displays the text in it. >> > > That's great if you have a stylesheet, a URL, and a browser. HTML is fancy > text, and pretty much implies some sort of online connection. Everything in HTML is embeddable in a standalone document, including graphics. HTML does not imply any online connection. HTML is independant of HTTP or other transports. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilbert.lozano at gmail.com Mon Jun 8 14:59:50 2015 From: gilbert.lozano at gmail.com (Gilbert Lozano) Date: Mon, 8 Jun 2015 15:59:50 -0400 Subject: Small (minuscule) Message-ID: Can someone help me find the code for the small (minuscule) p with macron above? Many thanks, Gilbert Lozano -------------- next part -------------- An HTML attachment was scrubbed... URL: From gansmann at uni-bonn.de Mon Jun 8 15:40:36 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Mon, 08 Jun 2015 22:40:36 +0200 Subject: Small (minuscule) In-Reply-To: References: Message-ID: On Mon, 08 Jun 2015 21:59:50 +0200, Gilbert Lozano wrote: > Can someone help me find the code for the small (minuscule) p with macron above? U+0070: p U+0304: combining macron Put those two characters after each other and you get: p?. From wjgo_10009 at btinternet.com Tue Jun 9 03:17:26 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 9 Jun 2015 09:17:26 +0100 (BST) Subject: Small (minuscule) In-Reply-To: References: Message-ID: <24946800.8782.1433837846291.JavaMail.defaultUser@defaultHost> Just in case this will help. Years ago I made a font that included a small p with a macron, the glyph for the small p with a macron being located in the plane 0 Private Use Area at U+E727. The font is Quest text. It is available free from the following web page. http://www.users.globalnet.co.uk/~ngo/fonts.htm http://forum.high-logic.com/viewtopic.php?f=10&t=682 The glyph is one of a number for special characters and ligatures in the font. Please note specifically that this is not the official Unicode encoding for the character. I simply mention this font just in case you are wanting to get a print out quickly for some reason. William Overington 9 June 2015 ----Original message---- >From : gilbert.lozano at gmail.com Date : 08/06/2015 - 20:59 (GMTST) To : unicode at unicode.org Subject : Small (minuscule) Can someone help me find the code for the small (minuscule) p with macron above? Many thanks, Gilbert Lozano -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: p_macron.png Type: image/png Size: 6326 bytes Desc: not available URL: From pandey at umich.edu Tue Jun 9 17:07:19 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 15:07:19 -0700 Subject: Accessing the WG2 document register Message-ID: Hello all, I learned today that the WG2 document register is not publicly accessible. This means that I, as a proposal author, have no means of accessing the documents that I contribute. Can someone associated with WG2 or anyone else in the know please tell me why these documents are under lock and key? All the best, Anshuman From pandey at umich.edu Tue Jun 9 18:11:26 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 19:11:26 -0400 Subject: Accessing the WG2 document register In-Reply-To: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> References: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> Message-ID: <29A7ACF0-789D-4B28-BD45-77A7C06568E8@umich.edu> Hi Ken, > On Jun 9, 2015, at 6:38 PM, Ken Lunde wrote: > > Welcome to ISO. ? I think I'll skip that party. ?? I've already started to add copyright statements to my proposals. Now I'll add another statement that says: "This document is intended for encoding the XYZ script in The Unicode Standard. If it and its contents are appropriated for encoding XYZ in ISO 10646, then ISO must make this document openly and publicly accessible to all." Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... All the best, Anshu From pandey at umich.edu Tue Jun 9 18:26:16 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 19:26:16 -0400 Subject: Accessing the WG2 document register In-Reply-To: <5577745B.2000100@htpassport.com> References: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> <29A7ACF0-789D-4B28-BD45-77A7C06568E8@umich.edu> <5577745B.2000100@htpassport.com> Message-ID: Shervin, > On Jun 9, 2015, at 7:18 PM, Shervin Afshar wrote: > > Anshuman Pandey observed: > > > Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... > > Hear, hear! I really wanted to punctuate my statement with a STAG emoji, or REINDEER at the very least. But, the closest thing I found was ??. Pragmatically on the dot, but unforch not semantically... Anshu From mailinglists at ngalt.com Tue Jun 9 18:46:06 2015 From: mailinglists at ngalt.com (Nathan Sharfi) Date: Tue, 9 Jun 2015 16:46:06 -0700 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Message-ID: <97C666FC-FFC0-42FC-BCAA-F5E01F93BE15@ngalt.com> > On Jun 3, 2015, at 1:26 AM, William_J_G Overington wrote: > > Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) > > >>> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > > >> They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? > > > Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. That's nice; I've found some persistent annoyances when I use PUA codepoints. A while back I learned Quikscript, an alternate English orthography. Since May 2013, my blog's been in Quikscript using PUA codepoints. I've also joined the Shavian mailing list, sent e-mails in Shavian, and wrote an "I'm switching my Quikscript blog to Shavian" blog post in Shavian for April Fool's Day. To do all this typing, I made both Quikscript and Shavian keyboard layouts for OS X, as well as a Quikscript font. All of my Quikscript stuff is linked to from https://www.frogorbits.com/qs/ if you're interested. I'm something of a Johnny-come-lately to Shavian, so I've only used it in the SMP with fonts others have made. So, how much nicer is dealing with Shavian? - The Keyboard Viewer and input-source preview know what font to use for each key for Shavian; Quikscript keyboard layouts display boxes for the letters because there's no way for the system to guess which font to use for a particular codepoint. - Double-tapping a Shavian word in my browser will select the word; double-tapping a Quikscript word will select just one letter. - Internet Explorer will happily break Quikscript text in the middle of a word; Shavian gets broken at word boundaries just like English. While IE's behavior is unlike other browsers' and Not What I Want, I can't fault the IE team; I could be using PUA code points for a language that doesn't use spaces much, like Japanese. - I can read and write Shavian posts on Twitter on the desktop in a reasonable font for both Shavian and other scripts; if I wanted to do the same in Quikscript, I'd have to have a custom user-supplied stylesheet to override Twitter's own font suggestions. - Scripts already in Unicode attract the attention of talented completionist organizations that PUA communities generally can't attract beforehand. Everson Mono, Noto, and Segoe UI Historic (as of Windows 10) ? all great typefaces ? support Shavian and not Quikscript. This tends to be because: - I could have multiple fonts that have wildly differing meanings and glyphs mapped to the same code point; the OS can't guess which I might mean. - All the information that the OS needs to detect word breaks is in character properties data supplied by the Consortium and handled by the OS. ~ ~ ~ Specialists like us might be able to put up with these things, but we can't control everything about the reading and writing experience online unless we're all resigned to taking pictures of handwritten text. From samjnaa at gmail.com Tue Jun 9 21:18:26 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 07:48:26 +0530 Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: On 6/10/15, Anshuman Pandey wrote: > I learned today that the WG2 document register is not publicly > accessible. Seems that the page http://std.dkuug.dk/jtc1/sc2/wg2/ or the repo it points to ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/ haven't been updated after 2014-10-29. At least there should be a notice saying this is no longer the active register, if this is being maintained for historical purposes! > This means that I, as a proposal author, have no means of > accessing the documents that I contribute. But why would you want to do that? I suppose everyone who submits Unicode proposals would have their own copies of their documents, and certainly the ISO doesn't modify the contents of any of these documents. > I've already started to add copyright statements to my proposals. Now I'll > add another statement that says: "This document is intended for encoding > the XYZ script in The Unicode Standard. If it and its contents are > appropriated for encoding XYZ in ISO 10646, then ISO must make this document > openly and publicly accessible to all." Hm -- I'd be interested to see how they respond. Re your wording: 1) "appropriated"? 2) Unicode and ISO 10646 are only nominally two different standards and effectively (i.e. apart from all those procedural details) the same, no? Now does the UTC still require us proposal authors to forward our docs to WG2 after UTC approval? I fail to see the point in that if whatever is part of Unicode is going to become part of ISO 10646, except for that if by closing its doors to proposal authors, the ISO is going to communicate only with the UTC, then the UTC would have to take upon itself the onus of forwarding all proposals to the ISO saying -- I'm sure the UTC doesn't want that. -- Shriramana Sharma ???????????? ???????????? From wjgo_10009 at btinternet.com Wed Jun 10 02:35:17 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 08:35:17 +0100 (BST) Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: <5882613.5526.1433921717427.JavaMail.defaultUser@defaultHost> > This means that I, as a proposal author, have no means of accessing the documents that I contribute. I sent in a document some years ago and it was not even allowed to go into the list for discussion. It was said that it was out of scope. It is not clear whether people on the committees that discuss submitted documents were even aware that it had been submitted. I have submitted documents to the Unicode Technical Committee and some have been added to the list and one has been not added as it was said to be out of scope: however it was passed to the Chair of another Unicode Committee and it was considered. William Overington 10 June 2015 ----Original message---- >From : pandey at umich.edu Date : 09/06/2015 - 23:07 (GMTST) To : unicore at unicode.org Cc : unicode at unicode.org Subject : Accessing the WG2 document register Hello all, I learned today that the WG2 document register is not publicly accessible. This means that I, as a proposal author, have no means of accessing the documents that I contribute. Can someone associated with WG2 or anyone else in the know please tell me why these documents are under lock and key? All the best, Anshuman From wjgo_10009 at btinternet.com Wed Jun 10 03:25:19 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 09:25:19 +0100 (BST) Subject: Accessing the WG2 document register Message-ID: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> > Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. The fact that Unicode Inc. provides a valuable public service in making documents and encoding charts freely available to all who access the www.unicode.org website is not in any way the same as the provenance that ISO has of being recognised by governments around the world as providing standards for technological matters. I am not a lawyer, yet as I understand it, the underlying theory of standards work is that it is a legally permitted exception to a general legal prohibition of businesses meeting together to decide and agree what will be applied in industrial activity. Thus, for example, it is fine for businesses to agree that one particular code point will be used for the symbol for the Indian Rupee, as that helps consumers in that a message between computers of different brands can be passed and read successfully. Yet, for example, it is not permitted for businesses to meet together to decide that all computers will be in a grey plastic box, as that hinders choice for consumers. William Overington 10 June 2015 From jsbien at mimuw.edu.pl Wed Jun 10 04:07:32 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Wed, 10 Jun 2015 11:07:32 +0200 Subject: Accessing the WG2 document register In-Reply-To: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> Message-ID: <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> Quote/Cytat - William_J_G Overington (Wed 10 Jun 2015 10:25:19 AM CEST): >> Remind me why Unicode is still taking ISO to the dance? Sometimes >> going stag has its benefits... > > > As I understand it, Unicode Inc. is a recognised guest of ISO in > participating in ISO producing an International Standard. Cf. http://www.unicode.org/L2/L2014/14286-wg2-liaison.pdf Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From pandey at umich.edu Wed Jun 10 04:19:02 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 05:19:02 -0400 Subject: Accessing the WG2 document register In-Reply-To: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> Message-ID: <7970040E-6547-4ECE-9B86-4DCB09D20C6A@umich.edu> On Jun 10, 2015, at 4:25 AM, William_J_G Overington wrote: >> Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... > > > As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. Does Unicode need ISO to exist? Or does ISO need Unicode? > The fact that Unicode Inc. provides a valuable public service in making documents and encoding charts freely available to all who access the www.unicode.org website is not in any way the same as the provenance that ISO has of being recognised by governments around the world as providing standards for technological matters ISO is a profit making business. I worked on an ISO standard for the transliteration of Indic scripts two decades ago and I have yet to see the published standard. Back then I couldn't afford to buy the document and ISO didn't have the heart to give me a copy as a contribute. So, to this day today, I have yet to see the official standard that I helped to develop. ISO needs to function as a non-profit organization with open access to all of its activities and publications. > I am not a lawyer, yet as I understand it, the underlying theory of standards work is that it is a legally permitted exception to a general legal prohibition of businesses meeting together to decide and agree what will be applied in industrial activity. And so ISO functions by relying upon contributions made by the public without granting either authorship or compensation to those who actually build their standards. And now they want to claim ownership of contributed documents... > Thus, for example, it is fine for businesses to agree that one particular code point will be used for the symbol for the Indian Rupee, as that helps consumers in that a message between computers of different brands can be passed and read successfully. This can be done without ISO... > Yet, for example, it is not permitted for businesses to meet together to decide that all computers will be in a grey plastic box, as that hinders choice for consumers. Who exactly is imposing these restrictions? Restriction of choice is an issue for political economy, not standards bodies. All the best, Anshuman From pandey at umich.edu Wed Jun 10 04:49:13 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 05:49:13 -0400 Subject: Accessing the WG2 document register In-Reply-To: <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> Message-ID: > On Jun 10, 2015, at 5:07 AM, Janusz S. Bien wrote: > > Quote/Cytat - William_J_G Overington (Wed 10 Jun 2015 10:25:19 AM CEST): > >>> Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... >> >> >> As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. > > Cf. http://www.unicode.org/L2/L2014/14286-wg2-liaison.pdf This document provides further evidence of the irrelevance of ISO in the Unicode world. Deference. Janusz, what was your intention in providing a link to this document? All the best, Anshuman From pandey at umich.edu Wed Jun 10 05:01:44 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 06:01:44 -0400 Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Andrew, Thank you for this detailed investigation. It is truly informative. As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: http://linguistics.berkeley.edu/~pandey/ Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. All the best, Anshuman From eik at iki.fi Wed Jun 10 06:51:13 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Wed, 10 Jun 2015 14:51:13 +0300 Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: <001101d0a373$c06ab6d0$41402470$@fi> Andrew! I honestly believe that Michel as the WG2 Convener has little choice but to follow the JTC1 rules - and work actively to change them (hopefully having to spend less time on this than the years Mike had to spend to achieve the publicly available status for WG2-originated standards). Actually, I believe that a feasible solution would be to make Unicode a JTC1 PAS (Publicly Available Specification) submitter, and thus give the chance for the ISO/IEC JTC1/SC2 National Bodies to vote on the approval of TUS as an ISO standard. IRG (with possibly a somewhat expanded role, could/should still work under SC2 and co-operate with Unicode). Anshuman, I'd recommend that you withdraw your request to withdraw your contributions, because that would be of no help to the user communities involved. Sincerely Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel / Fax (by arr.): +358943682643 -----Alkuper?inen viesti----- L?hett?j?: Unicore [mailto:unicore-bounces at unicode.org] Puolesta Andrew West L?hetetty: 10. kes?kuuta 2015 12:18 Vastaanottaja: Anshuman Pandey Kopio: UnicoRe List; unicode Unicode Discussion Aihe: Re: Accessing the WG2 document register In the LiveLink system some document types are open and some document types are restricted, and you can see this in the SC2 document registry where some documents have a key icon against them and some do not. In the case of the WG2 document registry which is what Anshu is referring to, the list of documents is not even visible unless you are logged on to the system, which I believe to be completely unacceptable, and something I have questioned Michel about on several occasions. But even if the list of documents was to be visible to the public, they would all be password protected because of their document type ("Contributions"). I have suggested to Michel that a simple workaround would be to change the document type to one that is open to the public, even if the document type would not accurately reflect what sort of documents they are. The new restrictive rules for committee participation and document access have been forced on the committees by JTC1 (see JTC1 N12468 -- not publicly available, but there is a Google cache of the document if you search), and has caused considerable consternation among experts on the WG2 committee as well as in some national bodies. If you follow the new rules to the letter then WG2 is not allowed to even accept contributions from individuals who are not members of the relevant committee, which is quite ridiculous, and a severe handicap to many JTC1 working groups. I know that the BSI (representing the UK) is very unhappy with the restrictions on who can submit and access documents, and I hope (with little expectation) that the issue of document access will be raised at the next JTC1 plenary, and the rules changed. But in the meantime the rules are alienating experts such as Anshu, which is a great shame. Andrew On 9 June 2015 at 23:07, Anshuman Pandey wrote: > Hello all, > > I learned today that the WG2 document register is not publicly > accessible. This means that I, as a proposal author, have no means of > accessing the documents that I contribute. > > Can someone associated with WG2 or anyone else in the know please tell > me why these documents are under lock and key? > > All the best, > Anshuman From wjgo_10009 at btinternet.com Wed Jun 10 07:33:32 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 13:33:32 +0100 (BST) Subject: Accessing the WG2 document register Message-ID: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> As I am not on the Unicore list, just the public mailing list, I am only picking up bits of what is going on. However, I make the following observations. I followed the link to http://linguistics.berkeley.edu/~pandey/ and from there, having looked at some of the items on that page, to http://unicode.org/conference/bulldog.html where there are some very nice things said about you. > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > http://linguistics.berkeley.edu/~pandey/ > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. The problem is that if you withdraw your contributions, then Unicode will not be as good as it otherwise would have been. May I ask you to reconsider please? You have made a very effective protest in that it has caused people to wonder what is going on. Whether your protest will have any effect on changing the rules is not yet known. Yet even if it has no effect at all on the rules, if you allow your contributions to stand there will be people who are not yet born who will benefit from your contributions. So, will you reconsider please? William Overington 10 June 2015 ----Original message---- >From : pandey at umich.edu Date : 10/06/2015 - 11:01 (GMTST) To : babelstone at gmail.com Cc : unicore at unicode.org, unicode at unicode.org Subject : Re: Accessing the WG2 document register Andrew, Thank you for this detailed investigation. It is truly informative. As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: http://linguistics.berkeley.edu/~pandey/ Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. All the best, Anshuman From everson at evertype.com Wed Jun 10 07:46:43 2015 From: everson at evertype.com (Michael Everson) Date: Wed, 10 Jun 2015 13:46:43 +0100 Subject: Accessing the WG2 document register In-Reply-To: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <624263BD-D22C-4CBE-A448-18AEBEF7DDC0@evertype.com> Anshu, This level of idealism does nobody any good. On 10 Jun 2015, at 11:01, Anshuman Pandey wrote: > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. > > All the best, > Anshuman > > Michael Everson * http://www.evertype.com/ From samjnaa at gmail.com Wed Jun 10 10:09:06 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 20:39:06 +0530 Subject: Accessing the WG2 document register In-Reply-To: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: On 6/10/15, Anshuman Pandey wrote: > withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A > list of the contributions that I withdraw is given at: > http://linguistics.berkeley.edu/~pandey/ > ... > Whoever has the task of coordinating with ISO, is that you Michel?, please > withdraw all of my contributions. Since a lot of currently encoded scripts owe their encoding to you, it seems the stability policy makes it impossible for your to withdraw *all* of your contributions. Frankly, while you *are* making a point by raising the issue, I don't think this is so serious a problem for you to consider such a drastic step. The ISO hasn't claimed "ownership" of your document, as you mention in another mail. They merely restrict public access to it. Your document is publicly available in another (probably better maintained, thanks to Rick) place -- so where's your worry? I agree that the ISO should have the courtesy to accord contributors special status, but such big organizations are often steeped in bureaucracy, and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... -- Shriramana Sharma ???????????? ???????????? From costello at mitre.org Wed Jun 10 10:10:28 2015 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Jun 2015 15:10:28 +0000 Subject: Unicode Expert's way of Writing Data Specifications? Message-ID: Hi Folks, I seek recommendations from the Unicode experts on how to write data specifications that are precise, from a Unicode perspective. Let's take an example. A (fictitious) data specification says this: The name of the airplane's flight path must take this form: FLTPATH xx, where xx = two digits. Even as a non-expert in Unicode I can see impreciseness: 1. What are the codepoints of these symbols: FLTPATH? Presumably you mean U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048. 2. What are the range of codepoints for the two digits? Presumably you mean U+0030 - U+0039. Here is a revised version of the data specification: The name of the airplane's flight path must take this form: FLTPATH (U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048) xx, where xx = two digits in the range U+0030 - U+0039. Is that revised version precise, from a Unicode expert's perspective? Is there a better way of phrasing it, so that it is more readable? As it stands, reading it is kind of a bumpy ride. /Roger From doug at ewellic.org Wed Jun 10 10:50:55 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 10 Jun 2015 08:50:55 -0700 Subject: Unicode Expert's way of Writing Data =?UTF-8?Q?Specifications=3F?= Message-ID: <20150610085055.665a7a7059d7ee80bb4d670165c8327d.5e1a87a700.wbe@email03.secureserver.net> Costello, Roger L. wrote: > 1. What are the codepoints of these symbols: FLTPATH? Presumably you > mean U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048. I would specify, in prose or ABNF, that all keywords are encoded as Basic Latin characters (or Basic Latin plus Latin-1, or whatever range is desired). This would then apply to all subsequent specifications that deal with keywords, so there should be no need to specify U+xxxx code points in each one. If you use ABNF to specify the syntax, you can take advantage of keywords like ALPHA and DIGIT in the core rules (RFC 5234, Section B.1), which are predefined to be Basic Latin. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From john at tiro.ca Wed Jun 10 11:42:14 2015 From: john at tiro.ca (John Hudson) Date: Wed, 10 Jun 2015 09:42:14 -0700 Subject: Accessing the WG2 document register In-Reply-To: References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <557868E6.60307@tiro.ca> Anshu, I simply treat WG2 as a bureaucratic exercise bolted onto the actual work that Unicode does. In 20 years, I have never once had occasion to refer to ISO 10646, while I refer to Unicode every day. When I visit clients, none of them talk about implementing ISO 10646; they all talk about implementing Unicode. My recommendation is simply to ignore WG2 and act as if it doesn't exist. It already might as well not, and with its policies is only likely to become more and more irrelevant. JH -- John Hudson Tiro Typeworks Ltd www.tiro.com Salish Sea, BC tiro at tiro.com Getting Spiekermann to not like Helvetica is like training a cat to stay out of water. But I'm impressed that people know who to ask when they want to ask someone to not like Helvetica. That's progress. -- David Berlow From wjgo_10009 at btinternet.com Wed Jun 10 11:56:35 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 17:56:35 +0100 (BST) Subject: Accessing the WG2 document register In-Reply-To: References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> > ..., and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... Hmm. I opine that ..., and while bureaucracies are commonly known to seem unconcerned as to individual feelings, they are seldom outright malicious of intent, I feel... would be better, as that would not associate a disability with lack of concern for the feelings of others. Some readers might like to search for the word blind in the following web page. http://www.publications.parliament.uk/pa/cm201516/cmhansrd/cm150609/debtext/150609-0001.htm William Overington 10 June 2015 From samjnaa at gmail.com Wed Jun 10 12:11:41 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 22:41:41 +0530 Subject: Accessing the WG2 document register In-Reply-To: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> Message-ID: On 6/10/15, William_J_G Overington wrote: >> ..., and while bureaucracies are commonly known to seem blind > to individual feelings, they are seldom outright malicious of intent, > I feel... > > ..., and while bureaucracies are commonly known to seem unconcerned as to > individual feelings, they are seldom outright malicious of intent, > I feel... > > would be better, as that would not associate a disability with lack of > concern for the feelings of others. While English grammatical debates are out of scope for this list, please note that in my mail the word "blind" only stands in the place of your "unconcerned" and not "unconcerned to individual feelings"... -- Shriramana Sharma ???????????? ???????????? From michel at suignard.com Wed Jun 10 12:45:17 2015 From: michel at suignard.com (Michel Suignard) Date: Wed, 10 Jun 2015 17:45:17 +0000 Subject: Accessing the WG2 document register In-Reply-To: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> Message-ID: This is turning on bureaucrat bashing and those of you interested in that topic should turn you focus on another new mail thread with a different title that I can safely ignore. Concerning access to WG2 document, I am (as new WG2 convenor and ongoing project editor for 10646) very unimpressed by new ISO policies concerning access to documents which make WG repository even less accessible than their parent (SC) repository. And they now require ISO Global Directory credential to get meaningful access to anything within the ISO document system. There are ways for national bodies to nominate experts to have access to WG but it is cumbersome. Even I, despite my dual role, had initially no access to the ballots I was creating! I had to be creative to get access. For documents that need to be accessible to both UTC and WG2 I have suggested a new mechanism by which UTC contributions (such as Anshu's) can be referenced by link using simple catch-all WG2 documents (typically done by UTC liaison or Debbie Anderson). I will always post documents directly if it is the author wish but it is not necessary, as long as the UTC link is open and stable (no problem there). I am also considering creating a mirror site of the new WG2 directory under Unicode server but it would have been password protected (password can be simple and easy to find). I have no intent to withdraw anything from the old WG2 website (it is now in archive mode), doing so would create awkward situations for repertoires that have been adopted. For the new site, I would respectfully ask Anshu to reconsider, this is not helping my task but instead making even more complicated (if that's possible). Concerning 10646 usefulness, please understand that there is still a large portion of constituencies (especially in Asia) that can only contribute to an ISO blessed entity. Most of the CJK work, and yet to be encoded Asian minority repertoires can only be done by joint work between UTC and ISO. It is a tad an American centric idea to think that you can totally ignore 10646, especially if you do business in China, Japan, or Korea. Unicode and UTC are very Bay Area centric (mostly for financial reasons, because no one will fund meetings overseas), but it does create impediment for other constituencies to participate. ISO, although imperfect, offer these constituencies a voice. For example the Ideographic Rapporteur Group (IRG) under WG2 is the group where CJK content can be either updated or augmented. Furthermore, many folks in Europe still cherish the additional forum that ISO provides. Unicode officers (to which I also belong) are looking in ways to improve the situation on their side by creating more direct communication with IRG and Asian constituencies but it is a complicated process. For most of the Unicode crowd it is not even in their radar (unless you deal with Asian scripts), but don't think all of you can totally ignore ISO at this stage. Good for you, some of us carry most of the burden of that complicated situation, so that you can do your work in simpler ways. Best Michel From everson at evertype.com Wed Jun 10 14:45:20 2015 From: everson at evertype.com (Michael Everson) Date: Wed, 10 Jun 2015 20:45:20 +0100 Subject: Accessing the WG2 document register In-Reply-To: <55785C69.3050305@htpassport.com> References: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <55785C69.3050305@htpassport.com> Message-ID: On 10 Jun 2015, at 16:48, Shervin Afshar wrote: > From: Shriramana Sharma > >> The ISO hasn't claimed "ownership" of your document, as you >> mention in another mail. They merely restrict public access to it. > > This is no justification and we should not trivialize an organizational behavior which is not acceptable in this day and age of open access and collaboration. We should also not trash the whole idea of collaboration because people at a higher level in ISO have made poor decisions which are not the fault of the relevant technical committee (SC2). >> I agree that the ISO should have the courtesy to accord contributors special status, but such big organizations are often steeped in bureaucracy, and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... > > These seem to me as reasons why ISO is of little relevance to Unicode going forward. The SC2/UTC relationship is important because corporate and commercial concerns are not the only concerns worth taking into account. We cooperate and collaborate, and it?s not right to pretend that only the UTC has valuable input into the UCS. > I don't think the concern here is malicious intent; it's rather the bloated bureaucracy of such organizations which makes it virtually impossible to have that "courtesy" you are talking about for individual contributors. The bureaucracy hasn?t changed in size. Some specific decisions were taken at a high level about document distribution and participation. Those weren?t useful for our line of work. Not at all, and none of us in SC2 or WG2 are defending those. Maybe those procedures work well for some sorts of standards; I couldn?t say. But I don?t think it damns the whole ISO process forever, either. All the best, Michael Everson From tclancy at mozilla.com Wed Jun 10 16:10:43 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Wed, 10 Jun 2015 17:10:43 -0400 Subject: Another take on the English apostrophe in Unicode Message-ID: On 4/Jun/2015 14:34 PM, Markus Scherer wrote: > > Looks all wrong to me. > Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. > You can't use simple regular expressions to find word boundaries. That's > why we have UAX #29. > And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up > with the ambiguous use, not the people who gave it a number. > I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. > I think we need make a distinction between the colloquial usage of the word "punctuation" and the Unicode general category "punctuation" which has specific technical implications. I somewhat wish that Unicode had a separate category for "Things that look like punctuation but behave like letters", which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. "don?t" is a contraction of two words, it is not one word. > This is utter nonsense. Should my spell-checker recognise "hasn't" as a valid word? Or should it consider "hasn't" to be the word "hasn" followed by the word "t", and then flag both of them as spelling errors? Is "fo'c'sle" the three separate words "fo", "c", and "sle"? The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. If anything, Unicode might have made a mistake in encoding two of these > that look identical. How are normal users supposed to find both U+2019 and > U+02BC on their keyboards, and how are they supposed to deal with > incorrect > usage? > Yeah, and there are fonts where I can't tell the difference between capital I and lower-case l. But my spell-checker will underline a word where I erroneously use an I instead of an l, and I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. - Ted -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclancy at mozilla.com Wed Jun 10 17:51:45 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Wed, 10 Jun 2015 18:51:45 -0400 Subject: Another take on the English apostrophe in Unicode Message-ID: On 4/Jun/2015 19:01, Leo Broukhis wrote: > > Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for > example, the work ack-ack isn't decomposable into words, or even > morphemes, > "ack" and "ack". > I do think that U+2010 (HYPHEN) is miscategorised. I think it should have General Category = Pc, not Pd. (That is, hyphens are connectors, not dashes.) That would make it a "word" character. Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning it can occur in the middle of numbers or letters). UAX #29 says that U+2010 deliberately does *not* have Word Break = MidNumLet, though an implementation may treat it as if it did. (UAX #29 doesn't give any reasons for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have Word Break = MidNumLet, due to its history of being used as a dash or minus sign, but U+2010 should never be used as a dash or minus sign, so I don't see the problem.) But luckily, the miscategorisation of U+2010 hasn't led to any pressing practical problems, unlike the misuse of U+2019 for the apostrophe. - Ted -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 10 23:37:28 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 06:37:28 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: <557247F2.9050902@efele.net> References: <557247F2.9050902@efele.net> Message-ID: The French "pomme de terre" ("potato" in English, French vulgar synonym : "patate") is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a "nom compos?" (so there's no hyphens). And they are separated by standard spaces (that are breakable, and expansible/compressible like all others in case of justified text)... The lemma is still recognized if there are extra punctuation in the middle such as : ? pomme ? de terre. We don't need any new space character. What you want is to insert markup to exhibit the structure of sentences for grouping words semantically or grammaticaly. But nobody including grammarians will use this "new" space, what they'll use is in fact some additional symbols or presentation features (enclosing boxes, braces above or below, colors...) if they want to exhibit it on top of the standard text. 2015-06-06 3:08 GMT+02:00 Eric Muller : > On 6/5/2015 10:29 AM, John D. Burger wrote: > >> Linguistically, "don't" and friends pass all the diagnostics that >> indicate they're single words. >> > > If I am not mistaken, the french "pomme de terre" also passes the > diagnostics. So we need a new space character. > > Eric. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 11 00:17:11 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 07:17:11 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. "Pd" (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: - a connector between words in compound words (hyphen) - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in "pocket books" with very narrow columns and minimized spacing) - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). ---- Still nobody replied to my past comment (about 1 month ago) about the various forms of the word-breaking hyp?en / line-wrapping symbol: * I'm not speaking about the SHY control, but about the real character whose glyph appears when SHY is materialized at end of lines (and which should be neither minus, or en-dash but also not the same as the orthographic hyphen used between words in a compound word). * This character can also be found (and is needed) also for breaking long mathematical formulas and must be clearly distinct from the regular minus. * This character is also needed for rendering long lines of programming code or textual data (it is something that must not be entered in programs but that must be rendered because theses programs or codes have significant line breaks: the glyph indicates that the following rendered line break is to be discarded). Not all programming languages have a syntax allwong to use an escape before the line break (such escaping varies, it may be a backslash in C/C++, or an underscore in Basic, but in data dumps such as CSV files, it is impossible to note such escape in the data language itself, and we need to render some specific glyph). * This character is absolutely needed when rendering on a static medium (i.e. printing or broadcasting) ; for dynamic medium (such as personal displays with a personal UI) we could still use scrolling, but users don't like horizontal scrolls and highly prefer reading the text directly. So they expect to see a distinctive glyph (or icon) to see the distinction between line breaks where there are significant or where they just wrap too long lines, and still see the distinction with other regular hyphens and minus (that are also significant and very frequently distinct) 2015-06-11 0:51 GMT+02:00 Ted Clancy : > On 4/Jun/2015 19:01, Leo Broukhis wrote: >> >> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >> for >> example, the work ack-ack isn't decomposable into words, or even >> morphemes, >> "ack" and "ack". >> > I do think that U+2010 (HYPHEN) is miscategorised. I think it should have > General Category = Pc, not Pd. (That is, hyphens are connectors, not > dashes.) That would make it a "word" character. > > Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning > it can occur in the middle of numbers or letters). UAX #29 says that U+2010 > deliberately does *not* have Word Break = MidNumLet, though an > implementation may treat it as if it did. (UAX #29 doesn't give any reasons > for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have > Word Break = MidNumLet, due to its history of being used as a dash or minus > sign, but U+2010 should never be used as a dash or minus sign, so I don't > see the problem.) > > But luckily, the miscategorisation of U+2010 hasn't led to any pressing > practical problems, unlike the misuse of U+2019 for the apostrophe. > > - Ted > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclancy at mozilla.com Thu Jun 11 01:08:42 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Thu, 11 Jun 2015 02:08:42 -0400 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy wrote: > The ASCII punctuations have been ovveriden for a lot of different roles. > There's simply no way to map them to a category that matches their semantic > role. [...] "Pd" (dash) is then appropriate for the ASCII hyphen-minus. > I agree, but I wasn't talking about the ASCII hyphen, U+002D (HYPHEN-MINUS). I was talking about U+2010 (HYPHEN). I also wasn't talking about changing the properties of U+0027 (APOSTROPHE). > in dictionaries I've seen small slanted tildes, or slanted small equal > signs, to make the distinction with true hyphens used in compound words > This is drifting off-topic, but I wanted to address the thing you just said above. Firstly, in the dictionaries I've seen, the slanted double hyphen is only used when a line break happens to occur at the same place as a "true hyphen". It replaces the "true hyphen". When a line is broken at a hyphenation point between letters, an ordinary-looking hyphen is displayed. Secondly, this character is encoded in Unicode at U+2E17 (DOUBLE OBLIQUE HYPHEN). - Ted On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy wrote: > The ASCII punctuations have been ovveriden for a lot of different roles. > There's simply no way to map them to a category that matches their semantic > role. So the ASCII hyphen and apostrophe-quote can only be given a very > weak category that just exhibit their visual role. "Pd" (dash) is then > appropriate for the ASCII hyphen-minus. You can't really tell from the > character alone if it is a punctuation or a minus sign. > > If it is a minus sign you can reencode it better using the more specific > mathematical minus sign. Otherwise, even if it is not a minus sign, it can > be: > - a connector between words in compound words (hyphen) > - a trailing mark at end of lines for indicating a word has been broken in > the middle (but remember that I asked previously for another character for > that role because this word-breaking hyphen is not necessarily an > horisontal hyphen (in dictionaries I've seen small slanted tildes, or > slanted small equal signs, to make the distinction with true hyphens used > in compound words, also because sometimes these breaks are not necessarily > between two syllables in "pocket books" with very narrow columns and > minimized spacing) > - a bullet leading items in a vertical list (this should be an en dash, > follwoed by some spacing) > - a punctuation (not necessarily at begining of line) marking the change > of person speaking (very common in litterature, notably in theatre). > > As a connector between words, there's a demonstrated need of > differentiating regular hyphens, longer hyphens (preferably surrounded by > thin spaces) for noting intervals (we can use the EN DASH for that), long > hyphens between two separate names that are joined (example in propers > names, after mariage, there's an example in France, where INSEE encodes it > for now using TWO successive hyphens, which are also used in French > identity cards, passports, social security green cards...). > > > ---- > > Still nobody replied to my past comment (about 1 month ago) about the > various forms of the word-breaking hyp?en / line-wrapping symbol: > > * I'm not speaking about the SHY control, but about the real character > whose glyph appears when SHY is materialized at end of lines (and which > should be neither minus, or en-dash but also not the same as the > orthographic hyphen used between words in a compound word). > > * This character can also be found (and is needed) also for breaking long > mathematical formulas and must be clearly distinct from the regular minus. > > * This character is also needed for rendering long lines of programming > code or textual data (it is something that must not be entered in programs > but that must be rendered because theses programs or codes have significant > line breaks: the glyph indicates that the following rendered line break is > to be discarded). Not all programming languages have a syntax allwong to > use an escape before the line break (such escaping varies, it may be a > backslash in C/C++, or an underscore in Basic, but in data dumps such as > CSV files, it is impossible to note such escape in the data language > itself, and we need to render some specific glyph). > > * This character is absolutely needed when rendering on a static medium > (i.e. printing or broadcasting) ; for dynamic medium (such as personal > displays with a personal UI) we could still use scrolling, but users don't > like horizontal scrolls and highly prefer reading the text directly. So > they expect to see a distinctive glyph (or icon) to see the distinction > between line breaks where there are significant or where they just wrap too > long lines, and still see the distinction with other regular hyphens and > minus (that are also significant and very frequently distinct) > > > 2015-06-11 0:51 GMT+02:00 Ted Clancy : > >> On 4/Jun/2015 19:01, Leo Broukhis wrote: >>> >>> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >>> for >>> example, the work ack-ack isn't decomposable into words, or even >>> morphemes, >>> "ack" and "ack". >>> >> I do think that U+2010 (HYPHEN) is miscategorised. I think it should have >> General Category = Pc, not Pd. (That is, hyphens are connectors, not >> dashes.) That would make it a "word" character. >> >> Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning >> it can occur in the middle of numbers or letters). UAX #29 says that U+2010 >> deliberately does *not* have Word Break = MidNumLet, though an >> implementation may treat it as if it did. (UAX #29 doesn't give any reasons >> for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have >> Word Break = MidNumLet, due to its history of being used as a dash or minus >> sign, but U+2010 should never be used as a dash or minus sign, so I don't >> see the problem.) >> >> But luckily, the miscategorisation of U+2010 hasn't led to any pressing >> practical problems, unlike the misuse of U+2019 for the apostrophe. >> >> - Ted >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 11 01:05:24 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 08:05:24 +0200 Subject: Accessing the WG2 document register In-Reply-To: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: As far as I have have seen, you cannot withdraw the irrevocable licence you gave to ISO when submitting the document. ISO requires that you give such licence otherwise your document will be rejected. ISO hwoever does not take the ownership (or authorship) and you keep the right to grant licences yourself to other people (with possibly different licencing terms). ISO just requires that the licence you grant will also expose all other proprietary rights that you claim (including patents), and that you sign it with your name (you take yourself the risks for the claims you make) and a way to contact you in case of problems. All you can do then is to instruct ISO that your past submission should no longer be considerd as relevant in next discussions, but all past discussions and decisions that would be based on your document will remain valid and will borrow the terms of your licence which must be clearly starting the terms of use and which will allow anyone to request a valid licence (not necessarily a free licence, because you caould ask "reasonnable" fees). Unfortunately ISO does not define clearly what is the reasonnable fee you can claim to those that will request a licence. ISO will sell its published standard and will not give you back any dime when it will do that (but it is a fact that fees requested by ISO to get a copy of its standards is not adequate as they are really too much expensive for individual users or small organizations and non-profits). This has a aocnsequence: ISO standards can only be defined and used and by large organizations (and this brings severe doubts about ISO claiming they are building "international standards" for everyone. Even governments in small countries cannot participate, everyone has to pay the same expensive fees to ISO even if their use of the stadnard will not generate (propertionaly) the same revenues (or savings) as those generated by large organization or big governments if they use the standard (the fee requested to them by ISO is ridiculously low, and ISO then is still lacking money to finance its activities). ---- Personally I think that Unicode does a much better job to open its standard to many more people by offering differnet levels of participations and opening a large area open to every individual without paying considerable fees. I consider that the only standard that defines the UCS is TUS, not ISO/IEC 10646 (that is just a piece of junk, badly administered, and inaccessible to most people). If you want examples of really bad standards published by ISO, just consider the MPEG related standards or standards related to "open" documents. Really I don't trust ISO in those domains and most people prefer what the W3C do. I just hope that ISO will withdraw its MPEG and open document standards, to be replaced by those made by other standard bodies (W3C, IETF, CEN, IEEE... For ITU, UPU, IATA, many of their standards are also full of patent restrictions and published with very restrictive terms and very expensive fees just to get a copy of a single document). MPEG should be completely withdrawn too, replaced by really open encodings (such as OGG). And frankly, the Linux community can also create their own standard body (there will be an immediate market for that, notably in mobile and embedded devices where Linux is present almost everywhere, including in Android and significant parts of Apple iOS) and coordinate with other foundations working in the same area of open standards. It is the Linux/Unix world that really promoted and developed the UCS to allow it to reach its current state (before that there were lots of proprietary standards approved by ISO and incorrectly labeled "international standards" even if most of them were incompatible with each other). I can even remember the time when Microsoft did not believe in the Internet and wanted to create "The Microsoft Network" (it was withdrawn, including the ISP service using MS protocols, replaced by MSN services based on the Internet and IETF standards). 2015-06-10 14:33 GMT+02:00 William_J_G Overington : > As I am not on the Unicore list, just the public mailing list, I am only > picking up bits of what is going on. > > However, I make the following observations. > > I followed the link to > http://linguistics.berkeley.edu/~pandey/ > and from there, having looked at some of the items on that page, to > http://unicode.org/conference/bulldog.html > where there are some very nice things said about you. > > > As I am considered an ineligible contributor by ISO, um, standards, I > hereby withdraw all of my contributions to Unicode, and reflexively to ISO > 10646. A list of the contributions that I withdraw is given at: > > > http://linguistics.berkeley.edu/~pandey/ > > > Whoever has the task of coordinating with ISO, is that you Michel?, > please withdraw all of my contributions. > > The problem is that if you withdraw your contributions, then Unicode will > not be as good as it otherwise would have been. > > May I ask you to reconsider please? > > You have made a very effective protest in that it has caused people to > wonder what is going on. > > Whether your protest will have any effect on changing the rules is not yet > known. > > Yet even if it has no effect at all on the rules, if you allow your > contributions to stand there will be people who are not yet born who will > benefit from your contributions. > > So, will you reconsider please? > > William Overington > > 10 June 2015 > > > > > > > > > ----Original message---- > From : pandey at umich.edu > Date : 10/06/2015 - 11:01 (GMTST) > To : babelstone at gmail.com > Cc : unicore at unicode.org, unicode at unicode.org > Subject : Re: Accessing the WG2 document register > > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I > hereby withdraw all of my contributions to Unicode, and reflexively to ISO > 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please > withdraw all of my contributions. > > All the best, > Anshuman > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Jun 11 03:49:51 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 09:49:51 +0100 Subject: Accessing the WG2 document register In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: On 11 June 2015 at 07:05, Philippe Verdy wrote: > > Personally I think that Unicode does a much better job to open its standard > to many more people by offering differnet levels of participations and > opening a large area open to every individual without paying considerable > fees. I consider that the only standard that defines the UCS is TUS, not > ISO/IEC 10646 (that is just a piece of junk, badly administered, and > inaccessible to most people). You do realise that by insulting ISO/IEC 10646 you are also insulting a number of prominent members of the UTC and officers of the Unicode Consortium who actively participate in the production and editing of ISO/IEC 10646? The latest version of ISO/IEC 10646 is not inaccessible to most people, as it is (and has been since 2006) available for free download from ISO at . Whilst I agree that the standard itself is irrelevent to the vast majority of users, who can get by quite happily just knowing about the Unicode Standard, I believe that the great importance of ISO/IEC 10646 lies in the process that goes to produce it, not in the resultant standard. The Unicode Consortium is largely controlled by a few large American corporations, but ISO is open to participation by standards organizations representing countries across the globe, and there are currently thirty participating members of SC2, the committee which is responsible for ISO/IEC 10646 . The ISO ballot process allows stakeholders in scripts from these countries to participate in the encoding process, and make the views of their experts heard. The ballot process also applies important checks on the encoding process, and prevents scripts and characters being encoded with undue haste if an encoding proposal is not yet mature enough or if there is insufficient consensus among stakeholders. Not least, the ballot process allows for multiple stages of review and correction of errors. If Unicode were to go it alone, professional encoders such as Anshu and Michael, who do not have an inherent stake in most of the scripts they work on, would present their proposals to the UTC, who do not have any expertise in such minority or historic scripts, but on the basis that the proposal seems plausible they would approve it, and six months later it would be in the next version of Unicode. Yes, this speeds up the encoding process enormously (which is usually at least two years), but at what cost? What happens when a couple of years later, users of the script in question in Africa or Asia discover that it has been encoded in Unicode but has a serious flaw or shortcoming that no-one from the user community had an opportunity to correct (and due to stability policies it is now too late to correct)? So whilst ISO/IEC 10646 is certainly irrelevent to most people, I strongly believe that the process whereby the standard is produced is extremely beneficial to the Unicode Standard, and I would urge Anshu and others to support the work of SC2 and WG2 rather than dismiss it as a hindrance or irrelevance. Andrew From jsbien at mimuw.edu.pl Thu Jun 11 04:12:15 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Thu, 11 Jun 2015 11:12:15 +0200 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: <86vbeua9sg.fsf@mimuw.edu.pl> On Thu, Jun 11 2015 at 10:49 CEST, andrewcwest at gmail.com writes: [...] > The latest version of ISO/IEC 10646 is not inaccessible to most > people, as it is (and has been since 2006) available for free download > from ISO at . The page states clearly The following standards are made freely available for standardization purposes. In consequence I don't feel entitled to download it. Not only my curiosity is not a standarization purpose, but even teaching students about standards also doesn't qualify. I just show them the link and tell them to decide themselves to download or not :-) Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From andrewcwest at gmail.com Thu Jun 11 04:38:41 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 10:38:41 +0100 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: <86vbeua9sg.fsf@mimuw.edu.pl> References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> <86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: On 11 June 2015 at 10:12, Janusz S. Bie? wrote: > >> The latest version of ISO/IEC 10646 is not inaccessible to most >> people, as it is (and has been since 2006) available for free download >> from ISO at . > > The page states clearly > > The following standards are made freely available for standardization > purposes. > > In consequence I don't feel entitled to download it. Not only my > curiosity is not a standarization purpose, but even teaching students > about standards also doesn't qualify. I just show them the link and tell > them to decide themselves to download or not :-) I think you are reading far too much into the phrase "for standardization purposes". The license states that you are allowed to store a copy on your personal computer and print off a single copy, but says nothing about what purposes you may use the standards for. In my opinion it is ridiculous to claim that you are not entitled to download the documents. The Unicode terms of use are far more restrictive, and state that "Any person is hereby authorized, without fee, to view, use, reproduce, and distribute all documents and files solely for informational purposes in the creation of products supporting the Unicode Standard, subject to the Terms and Conditions herein." So if you are not planning to create a product supporting the Unicode Standard, you are not legally allowed to view or download any of the files comprising the Unicode Standard ! Andrew From andrewcwest at gmail.com Thu Jun 11 05:00:29 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 11:00:29 +0100 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> <86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: On 11 June 2015 at 10:38, Andrew West wrote: > > The Unicode terms of use are far > more restrictive, and state that "Any person is hereby authorized, > without fee, to view, use, reproduce, and distribute all documents and > files solely for informational purposes in the creation of products > supporting the Unicode Standard, subject to the Terms and Conditions > herein." So if you are not planning to create a product supporting > the Unicode Standard, you are not legally allowed to view or download > any of the files comprising the Unicode Standard ! My apologies, according to the "Unicode Consortium and Trademark Usage Policy" I should always refer to "The Unicode? Standard". I hope that everyone on this list will take note of this important policy in future messages. Andrew From billposer2 at gmail.com Thu Jun 11 12:47:39 2015 From: billposer2 at gmail.com (Bill Poser) Date: Thu, 11 Jun 2015 10:47:39 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number <7> for glottal stop.) Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy wrote: > On 4/Jun/2015 14:34 PM, Markus Scherer wrote: >> >> Looks all wrong to me. >> > Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your > points below. > > > >> You can't use simple regular expressions to find word boundaries. That's >> why we have UAX #29. >> > > And UAX #29 doesn't work for words which begin or end with apostrophes, > whether represented by U+0027 or U+2019. It erroneously thinks there's a > word boundary between the apostrophe and the rest of the word. > > But UAX #29 *would* work if the apostrophes were represented by U+02BC, > which is what I'm suggesting. > > Confusion between apostrophe and quoting -- blame the scribe who came up >> with the ambiguous use, not the people who gave it a number. >> > I'm not trying to blame anyone. I'm trying to fix the problem. > > I know this problem has a long history. > > English is taught as that squiggle being punctuation, not a letter. >> > I think we need make a distinction between the colloquial usage of the > word "punctuation" and the Unicode general category "punctuation" which has > specific technical implications. > > I somewhat wish that Unicode had a separate category for "Things that look > like punctuation but behave like letters", which might clear up this > taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF > RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are > actually modifiers, into that category too.) But we don't. And the English > apostrophe behaves like a letter, regardless of what your primary school > teacher might have told you, so with the options available in Unicode, it > needs to be classed as a letter. > > "don?t" is a contraction of two words, it is not one word. >> > This is utter nonsense. Should my spell-checker recognise "hasn't" as a > valid word? Or should it consider "hasn't" to be the word "hasn" followed > by the word "t", and then flag both of them as spelling errors? > > Is "fo'c'sle" the three separate words "fo", "c", and "sle"? > > The idea that words with apostrophes aren't valid words is a regrettable > myth that exists in English, which has repeatedly led to the apostrophe > being an afterthought in computing, leading to situations like this one. > > If anything, Unicode might have made a mistake in encoding two of these >> that look identical. How are normal users supposed to find both U+2019 >> and >> U+02BC on their keyboards, and how are they supposed to deal with >> incorrect >> usage? >> > Yeah, and there are fonts where I can't tell the difference between > capital I and lower-case l. But my spell-checker will underline a word > where I erroneously use an I instead of an l, and I imagine spell-checkers > of the future could underline a word where I erroneously use a closing > quote instead of an apostrophe, or vice versa. > > There are other possible solutions too, but I don't want to get into a > discussion about UI design. I'll leave that to UI designers. > > - Ted > -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Thu Jun 11 13:13:26 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Thu, 11 Jun 2015 21:13:26 +0300 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> <86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: <000301d0a472$520969c0$f61c3d40$@fi> Andrew, I fail to understand what constructive goal you are supposedly aiming at, in general and especially in your most recent postings. Sincerely Erkki -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Andrew West L?hetetty: 11. kes?kuuta 2015 13:00 Vastaanottaja: Unicode Discussion Aihe: Re: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) On 11 June 2015 at 10:38, Andrew West wrote: > > The Unicode terms of use are far > more restrictive, and state that "Any person is hereby authorized, > without fee, to view, use, reproduce, and distribute all documents and > files solely for informational purposes in the creation of products > supporting the Unicode Standard, subject to the Terms and Conditions > herein." So if you are not planning to create a product supporting > the Unicode Standard, you are not legally allowed to view or download > any of the files comprising the Unicode Standard ! My apologies, according to the "Unicode Consortium and Trademark Usage Policy" I should always refer to "The Unicode? Standard". I hope that everyone on this list will take note of this important policy in future messages. Andrew From mark at macchiato.com Thu Jun 11 13:39:17 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 11 Jun 2015 20:39:17 +0200 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: <000301d0a472$520969c0$f61c3d40$@fi> References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> <86vbeua9sg.fsf@mimuw.edu.pl> <000301d0a472$520969c0$f61c3d40$@fi> Message-ID: ?I think the whole thread got overheated, and Andrew was just responding to other heated ?comments. So it might be time to let this thread cool off a bit. The collaboration over the years between the Unicode Consortium and ISO has been, on the whole, a remarkable success. There have been frictions?as in any human enterprise?but the parties have worked to smooth those over, and to operate in good faith to incorporate the characters that are important to each side. The rising bureaucracy on the ISO side has made progress and collaboration increasingly difficult, but that did not originate with the SC2 or WG2 participants, who are often just as frustrated by it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 11 13:39:41 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 20:39:41 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Also used in the Breton trigram c?h (considered as a single letter of the Breton alphabet, but actually entered as two letters with a diacritic-like apostrophe in the middle (which in this case is still not a letter of the alphabet...): the trigram c?h is distinct from the digram ch. Breton **also** uses a regular apostrophe for elision. In fact what you note for the ejective in native american languages is effectively a right-combining diacritic, and still not a letter by itself. However, given its position and the fact it is "spacing", this is the spacing form of the apostrophe diacritic that should be used, and that form is then to choose between: * U+00B4 (acute, most often ugly, located too high, and too much horizontal), * U+02B9 (prime, nearly good, but still too high), * U+02BC (apostrophe), * U+02C8 (vertical high tick, but confusable with the mark of stress in IPA before a phonetic syllable), and * U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only used with sinograms in Mandarin Chinese, with its metrics distinct from U+00B4 that match the Latin metrics). In my opinion 02BC is the best choice for the diacritic apostrophe. The other character for the **elision** apostrophe is a punctuation mark U+2019 (just like the full stop punctuation is also used as an abbreviation mark). There's no confusion with its alternate role as a right-side single quote because U+2019 is used in languages that normally never use the single quotes, but chevrons (or other punctuation signs in East-Asian scripts). But in English where single quote are used for small quotations, there's still a problem to represent this elision apostrophe when it does not occur between two letters where it also marks a gluing of two morphemes (as in "don't" or "Peter's"), but at the begining or end of a word. But elisions at end of words is also invalid when this is the final word of a quoted sentence. If you really want to cite a single English word terminated by an elision apostrophe, the single quotes won't be usable and you'll use chevrons like in this ?demo?? and not single or double quotes which are difficult to discriminate. 2015-06-11 19:47 GMT+02:00 Bill Poser : > To add a factor that I think hasn't been mentioned, there are languages in > which apostrophe is used both as a letter by itself and as part of a > complex letter. Most of the native languages of British Columbia write > glottalized consonants as C+', e.g. for an ejective alveolar stop, and > many use apostrophe by itself for the glottal stop. (Another common > convention, which produces other difficulties, is to use the number <7> for > glottal stop.) > > Bill > > On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy wrote: > >> On 4/Jun/2015 14:34 PM, Markus Scherer wrote: >>> >>> Looks all wrong to me. >>> >> Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your >> points below. >> >> >> >>> You can't use simple regular expressions to find word boundaries. That's >>> why we have UAX #29. >>> >> >> And UAX #29 doesn't work for words which begin or end with apostrophes, >> whether represented by U+0027 or U+2019. It erroneously thinks there's a >> word boundary between the apostrophe and the rest of the word. >> >> But UAX #29 *would* work if the apostrophes were represented by U+02BC, >> which is what I'm suggesting. >> >> Confusion between apostrophe and quoting -- blame the scribe who came up >>> with the ambiguous use, not the people who gave it a number. >>> >> I'm not trying to blame anyone. I'm trying to fix the problem. >> >> I know this problem has a long history. >> >> English is taught as that squiggle being punctuation, not a letter. >>> >> I think we need make a distinction between the colloquial usage of the >> word "punctuation" and the Unicode general category "punctuation" which has >> specific technical implications. >> >> I somewhat wish that Unicode had a separate category for "Things that >> look like punctuation but behave like letters", which might clear up this >> taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF >> RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are >> actually modifiers, into that category too.) But we don't. And the English >> apostrophe behaves like a letter, regardless of what your primary school >> teacher might have told you, so with the options available in Unicode, it >> needs to be classed as a letter. >> >> "don?t" is a contraction of two words, it is not one word. >>> >> This is utter nonsense. Should my spell-checker recognise "hasn't" as a >> valid word? Or should it consider "hasn't" to be the word "hasn" followed >> by the word "t", and then flag both of them as spelling errors? >> >> Is "fo'c'sle" the three separate words "fo", "c", and "sle"? >> >> The idea that words with apostrophes aren't valid words is a regrettable >> myth that exists in English, which has repeatedly led to the apostrophe >> being an afterthought in computing, leading to situations like this one. >> >> If anything, Unicode might have made a mistake in encoding two of these >>> that look identical. How are normal users supposed to find both U+2019 >>> and >>> U+02BC on their keyboards, and how are they supposed to deal with >>> incorrect >>> usage? >>> >> Yeah, and there are fonts where I can't tell the difference between >> capital I and lower-case l. But my spell-checker will underline a word >> where I erroneously use an I instead of an l, and I imagine spell-checkers >> of the future could underline a word where I erroneously use a closing >> quote instead of an apostrophe, or vice versa. >> >> There are other possible solutions too, but I don't want to get into a >> discussion about UI design. I'll leave that to UI designers. >> >> - Ted >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From billposer2 at gmail.com Thu Jun 11 13:46:01 2015 From: billposer2 at gmail.com (Bill Poser) Date: Thu, 11 Jun 2015 11:46:01 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: I agree with the recommendation of U+02BC. However, it is in fact rarely used because most of the people who write these languages or create supporting infrastructure are unawre of such issues. A small point: it isn't always the spacing diacritic that is used. In some languages, e.g. Halkomelem, people use the spacing apostrophe if they have to but prefer the non-spacing version. On Thu, Jun 11, 2015 at 11:39 AM, Philippe Verdy wrote: > Also used in the Breton trigram c?h (considered as a single letter of the > Breton alphabet, but actually entered as two letters with a diacritic-like > apostrophe in the middle (which in this case is still not a letter of the > alphabet...): the trigram c?h is distinct from the digram ch. > Breton **also** uses a regular apostrophe for elision. > > In fact what you note for the ejective in native american languages is > effectively a right-combining diacritic, and still not a letter by itself. > However, given its position and the fact it is "spacing", this is the > spacing form of the apostrophe diacritic that should be used, and that form > is then to choose between: > > * U+00B4 (acute, most often ugly, located too high, and too much > horizontal), > * U+02B9 (prime, nearly good, but still too high), > * U+02BC (apostrophe), > * U+02C8 (vertical high tick, but confusable with the mark of stress in > IPA before a phonetic syllable), and > * U+02CA (acute/2nd tone, which for me is not distinct from 00B4, only > used with sinograms in Mandarin Chinese, with its metrics distinct from > U+00B4 that match the Latin metrics). > > In my opinion 02BC is the best choice for the diacritic apostrophe. > > The other character for the **elision** apostrophe is a punctuation mark > U+2019 (just like the full stop punctuation is also used as an abbreviation > mark). There's no confusion with its alternate role as a right-side single > quote because U+2019 is used in languages that normally never use the > single quotes, but chevrons (or other punctuation signs in East-Asian > scripts). > > But in English where single quote are used for small quotations, there's > still a problem to represent this elision apostrophe when it does not occur > between two letters where it also marks a gluing of two morphemes (as in > "don't" or "Peter's"), but at the begining or end of a word. But elisions > at end of words is also invalid when this is the final word of a quoted > sentence. If you really want to cite a single English word terminated by an > elision apostrophe, the single quotes won't be usable and you'll use > chevrons like in this ?demo?? and not single or double quotes which are > difficult to discriminate. > > > 2015-06-11 19:47 GMT+02:00 Bill Poser : > >> To add a factor that I think hasn't been mentioned, there are languages >> in which apostrophe is used both as a letter by itself and as part of a >> complex letter. Most of the native languages of British Columbia write >> glottalized consonants as C+', e.g. for an ejective alveolar stop, and >> many use apostrophe by itself for the glottal stop. (Another common >> convention, which produces other difficulties, is to use the number <7> for >> glottal stop.) >> >> Bill >> >> On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy wrote: >> >>> On 4/Jun/2015 14:34 PM, Markus Scherer wrote: >>>> >>>> Looks all wrong to me. >>>> >>> Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your >>> points below. >>> >>> >>> >>>> You can't use simple regular expressions to find word boundaries. >>>> That's why we have UAX #29. >>>> >>> >>> And UAX #29 doesn't work for words which begin or end with apostrophes, >>> whether represented by U+0027 or U+2019. It erroneously thinks there's a >>> word boundary between the apostrophe and the rest of the word. >>> >>> But UAX #29 *would* work if the apostrophes were represented by U+02BC, >>> which is what I'm suggesting. >>> >>> Confusion between apostrophe and quoting -- blame the scribe who came up >>>> with the ambiguous use, not the people who gave it a number. >>>> >>> I'm not trying to blame anyone. I'm trying to fix the problem. >>> >>> I know this problem has a long history. >>> >>> English is taught as that squiggle being punctuation, not a letter. >>>> >>> I think we need make a distinction between the colloquial usage of the >>> word "punctuation" and the Unicode general category "punctuation" which has >>> specific technical implications. >>> >>> I somewhat wish that Unicode had a separate category for "Things that >>> look like punctuation but behave like letters", which might clear up this >>> taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF >>> RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are >>> actually modifiers, into that category too.) But we don't. And the English >>> apostrophe behaves like a letter, regardless of what your primary school >>> teacher might have told you, so with the options available in Unicode, it >>> needs to be classed as a letter. >>> >>> "don?t" is a contraction of two words, it is not one word. >>>> >>> This is utter nonsense. Should my spell-checker recognise "hasn't" as a >>> valid word? Or should it consider "hasn't" to be the word "hasn" followed >>> by the word "t", and then flag both of them as spelling errors? >>> >>> Is "fo'c'sle" the three separate words "fo", "c", and "sle"? >>> >>> The idea that words with apostrophes aren't valid words is a regrettable >>> myth that exists in English, which has repeatedly led to the apostrophe >>> being an afterthought in computing, leading to situations like this one. >>> >>> If anything, Unicode might have made a mistake in encoding two of these >>>> that look identical. How are normal users supposed to find both U+2019 >>>> and >>>> U+02BC on their keyboards, and how are they supposed to deal with >>>> incorrect >>>> usage? >>>> >>> Yeah, and there are fonts where I can't tell the difference between >>> capital I and lower-case l. But my spell-checker will underline a word >>> where I erroneously use an I instead of an l, and I imagine spell-checkers >>> of the future could underline a word where I erroneously use a closing >>> quote instead of an apostrophe, or vice versa. >>> >>> There are other possible solutions too, but I don't want to get into a >>> discussion about UI design. I'll leave that to UI designers. >>> >>> - Ted >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Thu Jun 11 13:47:52 2015 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 11 Jun 2015 11:47:52 -0700 Subject: Unicode Terms of Use Clarification (was: Re: free download of ISO/IEC 10646) In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> <86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: <5579D7D8.4050108@att.net> Andrew, Fixed. Please refresh your cached copy of http://www.unicode.org/copyright.html For others who have been following this discussion, I'd like to make it clear that the Unicode terms of use have *never* been intended to be construed as legally disallowing people from viewing or downloading any publicly available content of the Unicode website or the various standards specifications and other documents posted there. The "for informational purposes" part of the Unicode terms of use is intended to discourage anyone from engaging in commercial resale of the content of the Unicode website or its standards, misrepresenting themselves either as the Unicode Consortium or as somehow licensed by the Unicode Consortium to do so, etc. The "in the creation of products supporting the Unicode Standard" part of the Unicode terms of use is intended to *permit* free use of the data and specifications in the development of products, but to discourage attempts to use the data in nonconformant or otherwise misleading implementations that would undermine the intended open interoperability of the Unicode Standard for all. Clear? --Ken Whistler, Technical Director, Unicode, Inc. On 6/11/2015 2:38 AM, Andrew West wrote: > > > The Unicode terms of use are far > more restrictive, and state that "Any person is hereby authorized, > without fee, to view, use, reproduce, and distribute all documents and > files solely for informational purposes in the creation of products > supporting the Unicode Standard, subject to the Terms and Conditions > herein." So if you are not planning to create a product supporting > the Unicode Standard, you are not legally allowed to view or download > any of the files comprising the Unicode Standard ! > > From shervinafshar at gmail.com Thu Jun 11 14:04:09 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Thu, 11 Jun 2015 12:04:09 -0700 Subject: Accessing the WG2 document register In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: On Thu, Jun 11, 2015 at 1:49 AM, Andrew West wrote: > The Unicode Consortium is largely controlled by a few large > American corporations, but ISO is open to participation by standards > organizations representing countries across the globe, and there are > currently thirty participating members of SC2, > Of course, a visit to Unicode Consortium Members page[1] would prove otherwise; of ten full members, three of them are not American entities. Of four institutional members (which are voting, just like full members), only one is American (UC Berkeley). Of twenty associate members, seven of them are non-American. Not to mentioning the long list of liaison members[2] from all over the world. [1]: http://www.unicode.org/consortium/members.html [2]: http://www.unicode.org/consortium/liaison-members.html -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Jun 11 14:28:34 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 11 Jun 2015 12:28:34 -0700 Subject: =?UTF-8?Q?Unicode=C2=AE=20terms=20of=20use=20=28was=3A=20Re=3A=20free=20d?= =?UTF-8?Q?ownload=20of=20ISO/IEC=20=31=30=36=34=36=29?= Message-ID: <20150611122834.665a7a7059d7ee80bb4d670165c8327d.cfc5e3e70e.wbe@email03.secureserver.net> Andrew West wrote: > The Unicode terms of use are far > more restrictive, and state that "Any person is hereby authorized, > without fee, to view, use, reproduce, and distribute all documents and > files solely for informational purposes in the creation of products > supporting the Unicode Standard, subject to the Terms and Conditions > herein." So if you are not planning to create a product supporting > the Unicode Standard, you are not legally allowed to view or download > any of the files comprising the Unicode Standard ! It looks to me like item A.3 says: "... solely for informational purposes *and* in the creation of products supporting the Unicode Standard..." (emphasis mine). I read the word "and" as "and/or", meaning that one could compliantly use the files just for personal information, OR to inform the creation of products. That is just my interpretation, and in theory the "and" might be intentionally inclusive and imply that the only compliant use of the files is to create products. But that seems unlikely, given the similar phrasing "view, use, reproduce, and distribute all documents and files," which wouldn't strictly require the reproducing and distributing parts, or the involvement of "all" documents and files. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu Jun 11 14:34:36 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 11 Jun 2015 12:34:36 -0700 Subject: =?UTF-8?Q?Unicode=C2=AE=20terms=20of=20use=20=28was=3A=20Re=3A=20free?= =?UTF-8?Q?=20download=20of=20ISO/IEC=20=31=30=36=34=36=29?= Message-ID: <20150611123436.665a7a7059d7ee80bb4d670165c8327d.b750eb7245.wbe@email03.secureserver.net> Oops. I guess Ken fixed the wording between Andrew's post and mine. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Thu Jun 11 21:02:39 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 12 Jun 2015 04:02:39 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: 2015-06-11 20:46 GMT+02:00 Bill Poser : > I agree with the recommendation of U+02BC. However, it is in fact rarely > used because most of the people who write these languages or create > supporting infrastructure are unawre of such issues. > > A small point: it isn't always the spacing diacritic that is used. In some > languages, e.g. Halkomelem, people use the spacing apostrophe if they have > to but prefer the non-spacing version. > True but on the examples I gave, spacing is needed: the apostrophe is intended to not collide with the previous or next letter, including when writing capital letters. In the Breton trigram "c?h" where it it plays a diacritic role, but as well in the English elision "don?t", the collision would occur after the apostrophe with the ascenders. The only alternative would have been to use a diacritic above one of the two letters for the diacritic apostrophe (and the best diacritic that would have been used for Breton or English would have been an acute accent over the first consonnant. But such usage of combining characters is non conforming for its use as an elision mark. An elision alone is not supposed to change the pronunciation of the remaining letters.So it would have not been appropriate for the elisions in English "don?t", or in French "j?ai" or "s?est" (this is not a strict rule, French or English also have exceptions where some combinations are used and written that change the way the letters are effectively phonetically realized, including with elisions: "don?t" is a perfect example where "n" looses its consonnant value as it is glued with the previous vowel to nasalize it and slightly stress it and in other contexts the following t is also muted as in "you don't have to do that" in fast speech: this is still the same contraction/elision and it is justified to keep the elision mark separate without noting how the following or next letter are contextually realized, but in all case the elision glues two syllables into only one and the apostrophe is written between the remaining letters of morphemes on each side). If you use a non-spacing version, this can in fact only occur graphically when the following letter is a small letter without ascenders : I still think that this is the spacing version, but what happens is just the effect of some contextual typographic kerning (the same thing that happens in pairs like "AV", "fi", "ij", "To"...) ---- Also you claim that U+02BC is rarely used for the elision apostrophe. This is plain wrong for French at least, even if people only have an ASCII apostrophe on their native keyboard (there are many word processors that will correctly enter the appropriate "curly" apostrophe as U+02BC instead of the ugly ASCII vertical quote. Even in English when you look at correctly typeset documents the ASCII quote is replaced by U+2BC (look at large section headings, book titles). U+02BC is also prefered in English for the elision apostrophe. For English you may want to read this: http://www.creativebloq.com/typography/mistakes-everyone-makes-21514129 ASCII and the computer keyboards just perpetuate the limited charset that was supported by old mechanical typewriters. I don't understand why PC keyboards could be extended to add many "multimedia" control keys or function keys, but not the traditional quotes that are needed (and even sometimes letters still missing in all "standard" physical keyboard leyouts for French, such as ?/?, ?/?, or frequent capitals with accents such as ?, which is however present on virtual onscreen keyboards for smartphones and tablets). It's high time to restore these letters (and also campaign so that manufacturer of physical keyboards will add a few more keys for national letters (they did it for Japanese only, why not for French or even English, to have more punctuation signs and missing letters or diacritics). It is perfectly possible to find a place for them on physical keyboards just above the numeric key (F1..F12 keys can be compacted if needed, and a couple of dead keys can also be mapped to the right of the Return key without reducing the size of the space bar or the Return/Backspace keys or other modifier keys). Some notebook manufacturers have used two additional preprogrammed keys (e.g. Acer, stupidly, for an unneeded additional Euro symbol whose location on AltGr+E or AltGr+4 in UK is standard, the second one being bound to the dollar symbol aslo not needed !). What is needed is 5 standard keys with standard keycodes, different from keycodes used for user-programmable keys (generally labelled PF1, PF2... but sometimes unlabelled) and different from application-dependant function keys (e.g. generic color keys, like on TV remote controls for navigation in menus: red, green, yellow, blue) Note that this is different from the existing feature on some keyboards defining programmable keys, whose layout is not programmable by the driver itself but by individual settings of the user, independantly of thre selected keyboard layout): adding about 5 or 6 keys would be very helpful and could gretly simplify layouts for languages/scripts that are complicate to input. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Jun 12 09:57:06 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 12 Jun 2015 16:57:06 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <634725215.12783.1434121026204.JavaMail.www@wwinf2229> On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote: > Confusion between apostrophe and quoting -- > blame the scribe who came up with the ambiguous use, > not the people who gave it a number. There?s a lot of confusion in writing, especially since this job was done on typewriters, where computer keyboards are derived from while the narrowing of the character sets shifted from mechanics to code pages. This is all over, thanks to Unicode and its principle defined in TUS??1.3: > ?The Unicode Standard does not define glyph images. That is, the standard defines how characters are interpreted, not how glyphs are rendered.? Unfortunately the new precision and differenciation has sometimes been refused by sticking with legacy practice and for backwards compatibility?s sake. The use of a paired quotation mark (U+2019) as an English apostrophe against the UTC?s initial successful attempt to disambiguate the two by recommending U+02BC (same glyph) for use as apostrophe, is a leading example of how the hard labor of ordering and clarification aiming at what in ancient Greek is called ?Kosmos?, can at every time be thrown back to chaos by applying short views and doubtful considerations. There?s been a discussion on this Mailing List in July of 1999, that was before the release of the 3.0.0?version of the Standard: ?Apostrophes, quotation marks, keyboards and typography?, when the demand for simplification was already addressed with the corrections published as version?2.1: ? > Couldn't Unicode follow Microsoft and just remove the > recommendation that U+02BC be the recommended apostrophe character and > instead give U+2019 the dual meaning that it de-facto has already today? http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html [The quoted UTR#8 is now located at: http://www.unicode.org/reports/tr8/tr8-3.html] (The shift, as viewed at NamesList level, is now highlighted at http://charupdate.info#ambiguation ? On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote further: > If anything, Unicode might have made a mistake in? > encoding two of these that look identical.? > How are normal users supposed to > find both U+2019 and U+02BC on their keyboards,? > and how are they supposed to deal with incorrect usage? I never believed it could have been a mistake, since we know that Unicode encodes semantics, not glyphs. Were there no modifier letters at all, Unicode had have to introduce an apostrophe character, because an apostrophe is not at all the same as a quotation mark and does not work the same way neither. By handling text, not theories, Ted?Clancy at Mozilla clearly shows us that ambiguating the apostrophe with a close-quote brings up counterproductive complications that impact severely the productivity of the users. What, now, about ?normal users?? To fix the issue, consider that wishing to stay all the life long with one and the same keyboard layout while at the same time, changing for a new smartphone every year or two, needs some explanation. I guess it is because keyboards don't display anyhing by themselves except keycap labels, so you're never pretty sure about them.?? We should consider, too, that before being a matter of finding on keyboard, the matter is about using. How are we supposed to choose the right one out of four apostrophe/quotes (U+0027, U+02BC, U+2019, U+2018) while many of us seem not to know or not to bother about where to place it? But supposed we do, it would effectively be much more useful to tell the machine whether we want to type an apostrophe or a quotation mark, and as about that, the existing key is enough (see T.?Clancy?s blog). Is managing nested quotes already implemented in word processing? I never heard it is. Definitely, here?s a point where the simplification wished for a widespread word processing software worsened considerably the working conditions of all demanding people. The gap between word processing and desktop publishing is the smaller. Adding characters on your preferred keyboard on Windows is very easy using the Microsoft Keyboard Layout Creator, which has an end-user UI. As the compiled drivers are not even Windows-versioned (from NT-4 upwards), you can deploy them in your company and share among your friends without precautions. That is what users are supposed to do. If they don?t, Microsoft is not supposed to force upon. By contrast, if you want a Kana toggle to toggle the apostrophe key between U+0027 and U+02BC (and the quotation mark between U+0022 and a dead key for all quotation marks), you must use the Windows Driver Kit (along with some other resources) plus the MSKLC. If you wish to see it working, you may download an experimental keyboard layout on the unfinished webpage http://charupdate.info. It exemplifies also the Third level solution and the Compose key solution. I hope that helps. Marcel Schneider? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Jun 12 10:02:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 12 Jun 2015 17:02:31 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1544763019.12884.1434121351347.JavaMail.www@wwinf2229> On Fri, June 5, William_J_G Overington wrote: > Markus Scherer wrote: >>> How are normal users supposed to find both U+2019 and U+02BC on their >>> keyboards, and how are they supposed to deal with incorrect usage? > I replied: >> Would it be possible to have wordprocessing software where one uses >> CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC >> for input and could there be a "show in colour mode" where U+2019 is >> displayed in cyan and U+02BC is displayed in red, while >> everything else is displayed in black? > I am wondering whether some existing software packages > might be able to be used for the character inputting part using customized > keyboard short cuts. > https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts > I realize that the cyan and red colours cannot be done at present, > yet I have now thought of the alternative for now of being able to test what is > in the text by using a special version of an open source font > where there are distinctive glyphs one from the other > for the two characters. If your goal is to check right now what apostrophes are in a given text, an easy way is to do a search for U+02BC and to ask the software to highlight all. Of PagePlus I?ve got only an expired demo version, but I can assure that on Word, a side pane may even show you the pages with all instances highlighted, and allows you to browse them. To start, press Ctrl+F and type a modifier letter apostrophe into the search bar, or select one in your text and then press Ctrl+F. Getting the apostrophes colored and with a distinctive glyph is possible too. As you are talking about changing the font, I suppose you are in front of raw text. In this case you can do a search-and-replace which gives all U+02BC a red color and another font, say Tahoma when the text is in Arial. Again I speak for Word, where a Plus button shows a Formatting button for the replacing text (replace by the same but with a font formatting on typeface and color), but I suppose PagePlus allows the same proceeding. About suggesting options, one might think about a blinking markup which would allow to find the problematic apostrophes even faster. As a shortcut for U+02BC I?d prefer CONTROL APOSTROPHE because it may occur more often. However, adding something on your keyboard using Right Alt (that is AltGr) is much more efficient because: ? You add whenever you want and nearly what you want (if no Kana and no chained dead keys, you get the needed characters on your keyboard the time you write to lists and fora). ? You are not bound to a given high-end software (the driver works whenever you type on your keyboard). ? You go on to be an active part of your communities (Unicode, Serif, ...) by sharing the resulting drivers with other people. Definitely, any shortcut for an apostrophe would slow down the writing speed, therefore Apostrophe is preferred on Base shift state. So you may design a variant keyboard layout with U+02BC instead of U+0027, even if that be the only change, and toggle between the new one and the usual one by means of your OS's facilities. Or you may choose to add a Kana toggle to toggle the apostrophe key directly inside the driver, but achieving this is somewhat longer. For an example, you may look at the unfinished page http://charupdate.info where there is already an experimental keyboard layout for download. With U+02BC MODIFIER LETTER APOSTROPHE. I hope that helps. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Jun 12 11:07:31 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 12 Jun 2015 17:07:31 +0100 (BST) Subject: ISO committees (from Re: Tag characters and localizable sentence technology (from Tag characters)) In-Reply-To: <32759766.22530.1432292473336.JavaMail.defaultUser@defaultHost> References: <32759766.22530.1432292473336.JavaMail.defaultUser@defaultHost> Message-ID: <18980482.52510.1434125251835.JavaMail.defaultUser@defaultHost> In my post of 22 May 2015, reproduced below, is the following. > ... and then the plain text encoding of a particular localizable sentence would be defined as being expressed as the LOCALIZABLE SENTENCE BASE CHARACTER character followed by the code for the localizable sentence specified in the ISO [number] document, the code being expressed using tag characters. As there has been discussion of ISO committees in this mailing list recently and it is clear that there are a number of people involved with ISO on this mailing list who have expert knowledge of the structures and rules of ISO committees, I write to ask advice. Regarding my idea that localizable sentence technology could be implemented in Unicode by reference to detailed codes in an ISO document (not yet written), which would be the best ISO committee to become in charge of producing that document please? William Overington 12 June 2015 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 22/05/2015 - 12:01 (GMTST) To : unicode at unicode.org Subject : Tag characters and localizable sentence technology (from Tag characters) Tag characters and localizable sentence technology (from Tag characters) I refer to the following documents, the first about localizable sentences and the second about, amongst other matters, applying tag characters using a new encoding format. http://www.unicode.org/L2/L2013/13079-loc-sentance.pdf http://www.unicode.org/L2/L2015/15145r-add-regional-ind.pdf Starting from the idea of the markup bubble from the first document and applying the tag method and the ISO standard document method from the second document, there arises the following possibility for the future for localizable sentence technology. A single character would be added into Unicode, the name of the character being LOCALIZABLE SENTENCE BASE CHARACTER and then the plain text encoding of a particular localizable sentence would be defined as being expressed as the LOCALIZABLE SENTENCE BASE CHARACTER character followed by the code for the localizable sentence specified in the ISO [number] document, the code being expressed using tag characters. Please find attached a design for the glyph for the LOCALIZABLE SENTENCE BASE CHARACTER character. I designed the glyph by adapting and then combining the designs for localizable sentence markup bubble brackets from the first of the two documents referenced earlier in this text. Each localizable sentence, carefully written so as to avoid in use any reliance as to meaning on any sentence previously used in the same document, would have a meaning expressed in words and possibly also have a glyph: more commonly used localizable sentences each having a glyph yet not all other localizable sentences necessarily having a glyph, though some could have a glyph, as desired. William Overington 22 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jun 12 11:18:54 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 12 Jun 2015 09:18:54 -0700 Subject: ISO committees Message-ID: <20150612091854.665a7a7059d7ee80bb4d670165c8327d.fe61efd3d4.wbe@email03.secureserver.net> William_J_G Overington wrote: > Regarding my idea that localizable sentence technology could be > implemented in Unicode by reference to detailed codes in an ISO > document (not yet written), which would be the best ISO committee to > become in charge of producing that document please? Sounds like something TC 37 might enjoy: http://www.iso.org/iso/iso_technical_committee.html%3Fcommid%3D48104 https://en.wikipedia.org/wiki/ISO/TC_37 -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Fri Jun 12 11:51:29 2015 From: petercon at microsoft.com (Peter Constable) Date: Fri, 12 Jun 2015 16:51:29 +0000 Subject: ISO committees In-Reply-To: <20150612091854.665a7a7059d7ee80bb4d670165c8327d.fe61efd3d4.wbe@email03.secureserver.net> References: <20150612091854.665a7a7059d7ee80bb4d670165c8327d.fe61efd3d4.wbe@email03.secureserver.net> Message-ID: William (who, IIRC, lives in the UK) would need to start by engaging with BSA. People can't engage directly as individuals with TC 37 or any other ISO committee. ISO membership is not composed of individuals, but of countries, and representation is from each country's authorized standards organizations. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Doug Ewell Sent: Friday, June 12, 2015 9:19 AM To: Unicode Mailing List Subject: Re: ISO committees William_J_G Overington wrote: > Regarding my idea that localizable sentence technology could be > implemented in Unicode by reference to detailed codes in an ISO > document (not yet written), which would be the best ISO committee to > become in charge of producing that document please? Sounds like something TC 37 might enjoy: http://www.iso.org/iso/iso_technical_committee.html%3Fcommid%3D48104 https://en.wikipedia.org/wiki/ISO/TC_37 -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Fri Jun 12 15:13:15 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 12 Jun 2015 22:13:15 +0200 Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <1544763019.12884.1434121351347.JavaMail.www@wwinf2229> References: <1544763019.12884.1434121351347.JavaMail.www@wwinf2229> Message-ID: 2015-06-12 17:02 GMT+02:00 Marcel Schneider : > >> Would it be possible to have wordprocessing software where one uses > >> CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC > > CONTROL and CONTROL+SHIFT cannot work on French keyboards where the > existing ASCII apostrophe is on the numeric row where there are also ascii > controls mapped matching the ASCII open brace that is itself mapped on > ALTGR (or CTRL+ALT) in order to generate instead the C0 control. > In general it is a bad idea to map any printable character or combining character or dead key with the CTRL or CTRL+SHIFT modifiers associated to any position in the alphanumerica part of the keyboard: this should remain reserved to map function keys or C0/C1 controls only, that local applications will use to assign them application-specific application functions. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Fri Jun 12 22:11:49 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 12 Jun 2015 20:11:49 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: <557247F2.9050902@efele.net> Message-ID: <557B9F75.7040107@efele.net> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 13 01:24:11 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 13 Jun 2015 08:24:11 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: <557B9F75.7040107@efele.net> References: <557247F2.9050902@efele.net> <557B9F75.7040107@efele.net> Message-ID: I don't agree with this Gr?visse definition (and I'm not alone, other grammarians and dictionaries don't follow Gr?visse, and even the French Academy disagrees). May be this is a form of composition but the correct term is nothz that it create a new word, it just means that words take new semantics in specific contexts (here, idiomatic expressions where the term "pomme" is a minor shift of meaning, that also occurs in "pomme de pin" = "pineapple", or "chou pomme" and as well in the alternate semantic of "pomme" related only to its rouch shape to designate a human head and by extension a person, also used in idiomatic expressions like "c'est pour ma pomme"). But the word itself is not different and in fact the etymology is the same, this was only a progressive extension of semantic that created finally an idiomatic expression, but not a new word. A compound word (mot compos?") needs a clear gluing, by an hyphen, or apostrophe, or absence of space and punctuation. Gr?visse still records many good advices that are too frequently forgotten today, but here it got too far in details that was not needed to preserve the semantics of the language. Another proof is the cuisne expression "pomme frite" which does not mean a fried aple fruit, but a fried potato: "pomme de terre" has been abreviated to only "pomme", and this term even disappears now when the participle verb "frite" used as an epithetic adjective is then substantivated. The idiomatic expression "pomme de terre" is not so much idiomatic, this is just a extension lemma added to the term "pomme" (apple). The composition has in fact never be clearly attested, but if it was, hyphens would have been used since long (many hyphens are now starting to disappear in compiund words, replaced by direct gluing which is admitted in most cases). 2015-06-13 5:11 GMT+02:00 Eric Muller : > On 6/10/2015 9:37 PM, Philippe Verdy wrote: > > The French "pomme de terre" ("potato" in English, French vulgar synonym : > "patate") is a single lemma in dictionaries, but is still 3 separate words > (only the first one takes the plural mark), it is not considered a "nom > compos?" (so there's no hyphens). > > > > Grevisse, Le bon usage, 11th edition, 1980, page 118, part 1 Elements of > the language, chapter 7 The words, section 3 Formation of new words, > article 2, Composition, very first paragraph (179 overall): > > --- > By *composition*, language creates new words, either by combining simple > words with existing words, or by preceding these simple words with > syllables that have no independent existence: > > *Chou-fleur, gendarme, pomme de terre, contredire, d?sunir, paratonnerre. * > > A word, despite being formed of graphically independent elements, is > *composed* as soon at it brings to mind, not the distinct images of each > of the words from which it is composed, but a single image. Thus the > composites *h?tel de ville, pomme de terre, arc de triomphe* each remind > of a unique image, and not of the distinct images of *h?tel* and of > *ville*, of *pomme* and of *terre*, of *arc* and of > *triomphe. * > > *---* > > *(h?tel de ville* = city hall; *pomme* = apple, *de* = of, *terre* = > earth) > > Paragraph 181, 3rd remark: > > --- > Sometimes the elements composing [the word] are welded in a simple word: > *Bonheur**, contredire, entracte; *sometimes they are connected by an > hyphen: *chou-fleur, coffre-fort;* sometimes they stay independent > graphically: > > > > *Moyen ?ge, pomme de terre. --- *(?Le Gr?visse? as we affectionately call > it, or *Le bon usage / French Grammar with remarks on today?s french > language*, is a must-have for the student of French. It is encyclopedic > in its depth, and has tons of examples and counter-examples. Interestingly, > the French wikipedia page says ?a descriptive grammar of French?, while the > English wikipedia page says ?a prescriptive grammar?; it?s both!) > > I agree that we don?t need a new space coded character. I was just > pointing out that some of the arguments for a new coded character for the > apostrophe in *don?t* apply equally well to the spaces in the word *pomme > de terre*. > > Eric. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Sat Jun 13 01:28:03 2015 From: petercon at microsoft.com (Peter Constable) Date: Sat, 13 Jun 2015 06:28:03 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Nice article, as I recall. (Been a long time.) Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Kalvesmaki, Joel Sent: Friday, June 5, 2015 7:27 AM To: Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I don't have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. "Cliticization vs. Inflection: English N'T."Language59, no. 3 (1983): 502-513. It's nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435 From verdy_p at wanadoo.fr Sat Jun 13 02:02:40 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 13 Jun 2015 09:02:40 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: I disagree: U+02BC already qualifies as a letter (even if it is not specific to the Latin script and is not dual-cased). It is perfectly integrable in language-specific alphabets and we don't need another character to encode it once again as a letter. So the only question is about choosing between: - on one side, U+02BC (the existing apostrophe letter), and other possible candidate letters for alternate forms (including U+02C8 for the vertical form, and the common fallback letter U+00B4 present in many legacy fonts for systems built before the UCS was standardized and using legacy 8-bit charsets such as ISO 8859-1). - and on the other side, U+2019 where it is encoded as a quotation punctuation mark (like also the legacy ASCII single quote) Note that U+00B4 (from ISO 8859-1) has also been used in association with U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by assigning an orientation: the exact shape of these two is variable, between a thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9 digits), as well as the exact angle when it is a wedge or thin rectangle (these characters however have been used since long in overstriking mode to add accents over Latin capital letters, so the curly comma shapes are very uncommon and they are more horizontal than vertical and U+00B4 will be a very poor cantidate for the apostrophe that should have a narrow advance width. So there remains in practice U+02BC and U+02C8 for this apostrophe letter (which one you'll use is a matter of preference but U+02C8 will not be used if there are two distinct apostrophes in the language (e.g. in Polynesian languages where the distinction was made even more clearer by using right or left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1 if that letter has a very distinctive phonetic realisation as a plain consonnant with two variants like in Arabic or even U+02B0 when this is just a breath without stop: the full range range U+02B0-U+02C1 offers much enough variations for this letter if you need slight phonetic distinctions). 2015-06-13 8:28 GMT+02:00 Peter Constable : > Nice article, as I recall. (Been a long time.) > > > Peter > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of > Kalvesmaki, Joel > Sent: Friday, June 5, 2015 7:27 AM > To: Unicode Mailing List > Subject: Re: Another take on the English apostrophe in Unicode > > I don't have a particular position staked out. But to this discussion > should be added the very interesting work done by Zwicky and Pullum arguing > that the apostrophe is the 27th letter of the Latin alphabet. Neither > U+2019 nor U+02BC would satisfy that position. See: > > Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. > "Cliticization vs. Inflection: English N'T."Language59, no. 3 (1983): > 502-513. > > It's nicely summarized and discussed here: > http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ > > jk > -- > Joel Kalvesmaki > Editor in Byzantine Studies > Dumbarton Oaks > 202 339 6435 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 13 09:05:15 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Jun 2015 16:05:15 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1245207844.9505.1434204315176.JavaMail.www@wwinf1d14> On Fri, Jun 5, 2015, David Starner wrote: > On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis wrote: >> I agree that conflating apostrophes and quotes is a source of >> problems, however, existence of the MODIFIER LETTER [same glyph as >> used for English contractions] in Unicode is a coincidence which >> should not have an effect on usage of apostrophes in English. > Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one. In fact this would be a return to the state until version 2.0.0. http://www.unicode.org/Public/2.0-Update/NamesList-1.txt Since version 3.0.0 (or more precisely, since update 2.1), U+2019 is preferred for apostrophe, not U+02BC any longer. http://www.unicode.org/Public/3.0-Update/NamesList-3.0.0.txt Prior to this discovery, I supposed it could have been later ISO prescriptions which triggered it the wrong way, but now it's impossible ISO initiated the move of preferred apostrophe from U+02BC to U+2019. This change took place not sooner than in update 2.1, whereas the merger was at 1.1 and ISO stands for stability. So ISO could never agree that the preferred character for English apostrophe stopped to be U+02BC and started to be U+2019, against the Stability Policy, and presumably using a gap in this policy which possibly don?t cover usage recommendations... I must do some more research in the Archives to find out more about why the apostrophe and the single close quote were ambiguated?a process that needs even a new word to put on it, as ordinarily everybody works for disambiguation... However, the 1999 Mail Archive already shows it was for simplification's sake, in word processing software. Could anybody tell us more about this issue? IMHO, the mischievous apostrophe that we use today, is due to a shortcut, narrowed design, and uncomplete check-ups. Briefly, the disconnect was between Unicode whose global approach lead to complete solutions including all you need for text handling and word processing, and Microsoft whose industrial approach prioritized the ready make-up of output appearance, letting out of scope the subsequent lifestages of text. The Windows code page 1252 apostrophe-close-quote looks nice on screen and in the documents, but as soon as you need to convert quotes from British to American or from free to nested, the only way to prevent your text from becoming unusable is to hand-process the quotes one by one. The money you saved when purchasing the software, is lost thousandfold at use. Microsoft?s choice of mashing up apostrophe and close-quote to end up with an unprocessable hybrid was wrong. Very wrong. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 13 09:21:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Jun 2015 16:21:14 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1936848241.9690.1434205274475.JavaMail.www@wwinf1d14> On Sun, Jul 18, 1999, Markus Kuhn wrote: > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0557.html > I addition, I feel that the current ISO 8859 oriented national keyboard > standards are not adequate for modern Unicode-era word processing > practices, as they put obsolete typewriter characters such as U+0027 on > too prominent keys, while they have no key positions for the extremely > frequently needed typesetting characters that are for instance supported > by CP1252 (directional single and double quotes, en and em dashes, > etc.). Software either has to use shaky algorithms to make educated > guesses on which character the user might have meant (such as Word tries > to do), or sequences of ASCII characters are interpreted with new > semantics (such as both TeX and Word do), in order to give typists some > compromise access to these characters. > > I think it is urgent time to revise national keyboard standards here. We > really need standardized ways to easily enter say at least > > 2018 LEFT SINGLE QUOTATION MARK > 2019 RIGHT SINGLE QUOTATION MARK > 201C LEFT DOUBLE QUOTATION MARK > 201D RIGHT DOUBLE QUOTATION MARK > 2013 EN DASH > 2014 EM DASH > > on keyboards for English language users, and corresponding extensions on > other national keyboard standards. This might be a good opportunity to > introduce on US keyboards the Level 2 Select key (AltGr), while on > European keyboards is is probably sufficient to just add appropriate > labels to a number of new Level 2 Select positions. > On Sun, Jul 18, 1999, Mark Davis wrote: > http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML017/0558.html > However, I agree that having the curly quotes (single and double) on the > standard keyboard would be handy. I switch back and forth between a Mac and > Windows. On the Mac, the option key (a second level shift) has always made > this easy. The installable Windows international keyboard is not nearly so > useful, since you can't just leave it on all the time (it messes up your > used of quotation marks). On Thu, Jun 4, 2015 at 2:38 PM, Markus Scherer wrote: > How are normal users supposed to > find both U+2019 and U+02BC on their keyboards, > Yes this may be the main issue, how to get at hand U+20BC, U+2019 and U+2018 as well, plus the actual U+0027, on keyboards that are derived from typewriters? ones. Word processors are overasked with management of all four, while many users whish to stay typing ?apostrophe? for all of them. And not to change for another keyboard driver(?). A free tool, the Microsoft Keyboard Layout Creator, allows every user to add U+02BC on his preferred keyboard layout, for example in the deadlist of apostrophe on the US International keyboard, a layout where U+2019 is already found, along with U+2018. You may choose a double stroke on Apostrophe to generate the modifier letter. But as this layout obviously is not so useful, you?ll prefer to get them on the US Standard layout, or depending on where you live, on the UK standard or extended or any other layout. A more achieved solution is obtained with the Windows Driver Kit, a free development kit which allows to implement a Kana toggle, to toggle Apostrophe on the US Standard keyboard between U+0027 and U+02BC *or* U+2019. The least used among all three will be put into the deadlist, when adding one dead key on this layout, say Grave. Then, [Grave] [Apostrophe] will result in the missing apostrophe character. > how are they supposed to deal with incorrect usage? If the document is already incorrect, there will be nothing to do IMHO than check them one by one. Theoretically, word processors could integrate an exhaustive checking algorithm with an exhaustive dictionary. Which such a tool, there would be no ?Apostrophe Catastrophe? as it has been called: > http://www.newrepublic.com/article/113101/smart-quotes-are-killing-apostrophe > (found by a search engine). So, on actual keyboard layouts, avoiding the Apostrophe Catastrophe would then have been unfeasible?the like as with actual consumption habits, avoiding a number of other catastrophes is unfeasible as well... Nevertheless, this morning I opened once more the Microsoft Keyboard Layout Creator. Ten minutes later I got the finished complete package of the US American keyboard layout with U+02BC MODIFIER LETTER APOSTROPHE and all English quotation marks in one dead key on ?Grave?, that is key number E00 (ISO/IEC?9995-1). The same way I made up the keyboard layout for the United Kingdom, which uses AltGr, so the apostrophe and all quotes are also on AltGr. Ten minutes, again. If you don?t use the grave accent (or AltGr), there is strictly no change on these keyboard layouts, because I loaded the original Windows US and UK layouts into the MSKLC. If you use the grave accent, you must type a whitespace after hitting the grave key to get the grave accent (in conformance to the standard behavior of dead keys). ? To get the modifier letter apostrophe, type ?grave? - ?apostrophe?. ? To get the first level quotation marks (UK: single quotes; US: double quotes), type ?grave? followed by a square bracket, opening or closing (left or right). ? To get the nested quotation marks (UK: double quotes; US: single quotes), type ?grave? followed by whether ?9? or ?0?, that is, the key where you have the opening or closing parenthesis. For United Kingdom only: Additionally to the above, you may type also the following: ? To get the modifier letter apostrophe, type AltGr?+??apostrophe?. ? To get the single quotation marks, type AltGr?+?a?square bracket, opening or closing (left or right). ? To get the double quotation marks, type AltGr?+??9? or AltGr?+??0?. These MSKLC-generated keyboard drivers can be used on > all Windows versions (from NT?4.0 upwards) and > all system architectures (32?bit and 64?bit). > In the documentation of the MSKLC, Microsoft writes: ?On Vista [and later], the keyboard layout will automatically be added to the language bar on install and removed on uninstall.? > You may define a shortcut to switch, add a nice icon in the language bar / Task bar, or set it as default. They are named ?kbdenukw? and ?kbdenusw?. ?w? stands for ?Wholesome?. This is completely free and unlicensed software. It can be downloaded from now on at the following short URLs: kbdenukw: http://bit.ly/1QVeby6 kbdenusw: http://bit.ly/1MRxGab However, this being the 1.0 version, and the goal was to be fast, I forgot the En and Em dashes, and that I would add U+00B1, U+2260, U+00A0... If there will be a number of downloads, a 2.0?version could follow and be announced here. To get the MODIFIER LETTER APOSTROPHE ready on the apostrophe key like today the close-quote as smart apostrophe, you must disable the smart quotes in your word processor, and then add an autocorrect to convert U+0027 to U+02BC. If your word processor allows you to set apart the smart single and double quotes, you may enable the double quotes. I hope that helps. ? Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 13 09:31:02 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Jun 2015 16:31:02 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <271675363.9860.1434205862181.JavaMail.www@wwinf1d14> On June 3, 2015, Ted Clancy wrote: > https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ I wish to thank you personally for having brought up this issue, as well as Mr?Grosshans for having posted the URL launching this thread. However, your solution is not complete, and I don?t agree fully with all your statements. So let?s try to check up what?s the matter, and then look what might be done. First, the Unicode Technical Committee is *not* very wrong. A look in the Standard?2.0.0 or even simplier, a glance at the first NamesList in the UCD, that is the source code for the Version?2.0.0?Code?Charts, shows that originally, the UTC recommended the use of U+02BC MODIFIER LETTER APOSTROPHE for the English apostrophe as well as for apostrophe on the whole, and to reserve the use of U+2019 RIGHT SINGLE QUOTATION MARK for what it is: close-quote. It wasn?t sooner than in the 2.1 update that the preferred character for apostrophe was shifted from U+02BC to U+2019, to conform with the usage (and presumably at the demand) of Microsoft, which did not comply to the Standard, despite of being a full member of the Unicode Consortium (and having thus agreed at the beginning that apostrophe should be U+02BC). I?m pretty sure that when they moved the apostrophe preference from U+02BC to U+2019, the Unicode Technical Committee and the Unicode Editorial Committee acted against their will. My opinion is induced from the original UTC position and from comparing two versioned NamesList extracts among those displayed at charupdate.info#ambiguation Second, your solution is *not* complete. Even if word-processors managed nested quotes, one single key for all occurring quotation marks of a given locale, as British English or US?English, would scarcely be sufficient. Here?s why. Everybody knows that quotes are used not only to quote, but also to delimit, to warn or generally to flag otherwise than as a quotation. The latter occurs commonly when the writer (and by transposition, the speaker, making a quotes gesture) wants to flag a word or an expression as being controversial, not true, not in his belief, or ironical. From this they are sometimes called ?irony quotes?. Languages that use angle quotation marks (chevrons) to quote, use comma quotation marks to flag. In English, I suppose that you need to use the ?other? quotation marks to flag. So in US English you would flag using single quotes, while in British English you would use double quotes, the like as in French. However I don?t know how that works in quotations (while in languages as French and German this is no problem). Therefore, the user should always have means to type exactly the quotes he wishes to type. This will result in the need of at least one dead key or some supplemental dead list entries, and/or supplemental AltGr positions, or even supplemental shift states (Kana). Never one single key position can do all the job. Third (but this is an off-topic discussion in this thread and is set aside in your blog post), the close-quote as an apostrophe is not good for French neither, regardless of how many words are around. The use of U+2019 as apostrophe hasn?t lead in French to any ?Apostrophe Catastrophe? only because in French, few people use single comma quotes (in rare cases or for special purposes), and because properly leading apostrophes are often placed otherwise, as in ?Y?a? for ?Il?y?a?, instead of ??Y?a?. What shall we do? As you draw it, the so-called smart quotes algorithm must be reengineered and cannot stay working as it does, so users must be informed that to type ?unexpected? quotes, they?ve to hit the key two times, or to type another character just after. But users must also make an effort by themselves instead of wishing to stay with the inherited keyboard layout regardless of what changes are on-going, and at the same time, to get more Unicode characters as reasonably supportable on this old keyboard. In other words, the gap between the expected rendering and the actually conceded input must be filled up whether by using a set of customised (or perhaps one day, standardised) autocorrect entries (see one suggestion at charupdate.info#curly) or by typing appropriate characters on extended keyboard layouts (which don?t lead to change for another hardware, except for special purposes). Thanks again, because without this discussion, I?would have released more keyboard layouts with the wrong apostrophe! Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Sat Jun 13 10:10:06 2015 From: petercon at microsoft.com (Peter Constable) Date: Sat, 13 Jun 2015 15:10:06 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: I should qualify my statement. The Zwicky and Pullum article was a nice piece of linguistic analysis regarding the morphological characteristics of ?n?t?. Their remark about apostrophe, however, was not so much about orthography ? which was not the focus of their article ? but was rather a way of putting an exclamation on their findings. When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That?s because there isn?t any single _linguistic_ definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: Di Sciullo, Anna Maria, and Edwin Williams. 1987. On the definition of word. (Linguistic Inquiry monograph fourteen.) Cambridge, Massachusetts, USA: The MIT Press. Peter From: verdyp at gmail.com [mailto:verdyp at gmail.com] On Behalf Of Philippe Verdy Sent: Saturday, June 13, 2015 12:03 AM To: Peter Constable Cc: Kalvesmaki, Joel; Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I disagree: U+02BC already qualifies as a letter (even if it is not specific to the Latin script and is not dual-cased). It is perfectly integrable in language-specific alphabets and we don't need another character to encode it once again as a letter. So the only question is about choosing between: - on one side, U+02BC (the existing apostrophe letter), and other possible candidate letters for alternate forms (including U+02C8 for the vertical form, and the common fallback letter U+00B4 present in many legacy fonts for systems built before the UCS was standardized and using legacy 8-bit charsets such as ISO 8859-1). - and on the other side, U+2019 where it is encoded as a quotation punctuation mark (like also the legacy ASCII single quote) Note that U+00B4 (from ISO 8859-1) has also been used in association with U+0074 (from ASCII) to replace the more ambiguous ASCII quote U+0027 by assigning an orientation: the exact shape of these two is variable, between a thin rectangle, or a wedge, or a curly comma (shaped like 6 and 9 digits), as well as the exact angle when it is a wedge or thin rectangle (these characters however have been used since long in overstriking mode to add accents over Latin capital letters, so the curly comma shapes are very uncommon and they are more horizontal than vertical and U+00B4 will be a very poor cantidate for the apostrophe that should have a narrow advance width. So there remains in practice U+02BC and U+02C8 for this apostrophe letter (which one you'll use is a matter of preference but U+02C8 will not be used if there are two distinct apostrophes in the language (e.g. in Polynesian languages where the distinction was made even more clearer by using right or left rings U+02BE/U+02BF, or glottal letters U+02C0/U+02C1 if that letter has a very distinctive phonetic realisation as a plain consonnant with two variants like in Arabic or even U+02B0 when this is just a breath without stop: the full range range U+02B0-U+02C1 offers much enough variations for this letter if you need slight phonetic distinctions). 2015-06-13 8:28 GMT+02:00 Peter Constable >: Nice article, as I recall. (Been a long time.) Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Kalvesmaki, Joel Sent: Friday, June 5, 2015 7:27 AM To: Unicode Mailing List Subject: Re: Another take on the English apostrophe in Unicode I don't have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. "Cliticization vs. Inflection: English N'T."Language59, no. 3 (1983): 502-513. It's nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Jun 13 10:27:31 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 13 Jun 2015 17:27:31 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable wrote: > When it comes to orthography, the notion of what comprise words of a > language is generally pure convention. That?s because there isn?t any > single *_linguistic_ *definition of word that gives the same answer when > phonological vs. morphological or syntactic criteria are applied. There are > book-length works on just this topic, such as this: > > ?In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for ?"supercali...docious". (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 01:23:46 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 08:23:46 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <39908195.1592.1434349426297.JavaMail.www@wwinf2229> On Fri, Jun 12, 2015, Philippe Verdy wrote: > 2015-06-12 17:02 GMT+02:00 Marcel Schneider : >>> Would it be possible to have wordprocessing software where one uses >>> CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC >> CONTROL and CONTROL+SHIFT cannot work on French keyboards where >> the existing ASCII apostrophe is on the numeric row where there are >> also ascii controls mapped matching the ASCII open brace that is itself mapped >> on ALTGR (or CTRL+ALT) in order to generate instead the C0 control. > In general it is a bad idea to map any printable character or combining character or dead key with > the CTRL or CTRL+SHIFT modifiers associated to any position in the alphanumerica part > of the keyboard: this should remain reserved to map function keys or C0/C1 controls only, > that local applications will use to assign them application-specific application functions. Even the Language bar uses the upper row to define shortcuts with Control, Shift+Control, Shift+Alt to switch between keyboard layouts, which are prioritized. So to test the shortcuts with Clavier+, I must first remove shortcuts in the Language bar. Then the way was free to test Mr?Overington?s shortcuts for curly apostrophes (I will send the result just after). When I deleted the shortcuts in Clavier+ to test your advice, I?found no application shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as Word shortcut with CONTROL, while the heading formatting is with ALT. But indeed among ASCII controls I found eight on the French keyboard: //VirtualKey |ScanCd |ISO_# |Ctrl {VK_ESCAPE /*T01 */ ,0x001b {VK_CANCEL /*X46 */ ,0x0003 {VK_BACK /*T0E E13*/ ,0x007f {VK_OEM_6 /*T1A D11*/ ,0x001b {VK_OEM_1 /*T1B D12*/ ,0x001d {VK_OEM_5 /*T2B C12*/ ,0x001c {VK_RETURN /*T1C C13*/ ,'\n' {VK_OEM_102 /*T56 B00*/ ,0x001c On the alphanumerical block, there are always the same five, three among them near the Enter key. The British-American Apostrophe key is exempt of Controls too. This is probably why Mr?Overington wants to use CONTROL and SHIFT+CONTROL for U+2019 and U+02BC, as custom applications shortcuts. I had once defined a universal latin layout in the MSKLC, but as there is neither Kana nor chained dead keys, I allocated some dead keys (among a total of about 25) on CONTROL positions where I supposed there wouldn?t be any shortcuts in any application, as on ?, ^, and even high digits on the upper row. It must be at http://dispoclavier.monsite-orange.fr, and somebody has been very astonished because precisely this may become buggy. Even more, this is disabled! Winwordc.exe did not process these dead keys. Other applications did, as I remember. But the layout was far too hard to remind, as I?filled up double diacrited at the next free positions in the alphabet. This way I could allocate 1,921 Unicode characters (by editing the KLC source in spreadsheets), but since I?know and use the WDK, I won?t make such a layout again. Now I?m trying to put even more characters but with chained dead keys, for double diacrited and for easy-to-remind compose sequences. For example, you will enter U+01BF LATIN LETTER WYNN by typing simply COMPOSE, w, y, n, n, or less if not needed to disambiguate. Same for digraphs and ligatures. The test version I use is now adapted to type the letter apostrophe U+02BC (I?ll send after to the List some news about). Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 01:28:44 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 08:28:44 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <385492682.1634.1434349724328.JavaMail.www@wwinf2229> On Fri, June 5, William_J_G Overington wrote: > I replied: >> Would it be possible to have wordprocessing software where one >> uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input [...] > I am wondering whether some existing software packages > might be able to be used for the character inputting part using customized > keyboard short cuts. There is a very good shortcut utility for Windows which doesn?t modify the registry except to launch the app automatically: http://utilfr42.free.fr/util/Clavier.php Using this software, I tried, you can define CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input.?After defining the shortcut by typing it, you will have to paste the character into the text editing field. You can specify that these shortcuts work only in the word processing software you use, as you wish to. To achieve this, pick the ?target?icon, drag and drop it into an open window of the target application, its name will be added in the bar and you?ll have to choose that the shortcut be enabled in this software. You may even define that the shortcuts work with LEFT CONTROL only, in order to keep RIGHT CONTROL for other shortcuts with APOSTROPHE. As CONTROL SHIFT is not easy enough to type for character input, I?d suggest to define CONTROL L for U+2019, and to add CONTROL SEMICOLON for U+2018. This is because on the square bracket keys, there are already control characters allocated on CONTROL shift state. On these keys you may however choose LEFT ALT or RIGHT ALT for a shortcut. BTW: Clavier+ allows even to command the pointer and to enter mouse clicks, so that a shortcut can execute an action on the graphic interface of the app. This is very useful to add app shortcuts in apps that don?t allow customising. It?s free, and the interface can be switched to English. To download your copy: http://utilfr42.free.fr/util/Clavier.php > I have now thought of the alternative for now of being able to test what is > in the text by using a special version of an open source font where there are > distinctive glyphs one from the other for the two characters. I discovered that when U+02BC is input by autocorrect in replacement of U+0027, and the current font does not contain U+02BC (for example Lucida Console), then U+02BC is displayed in the fall-back font (Courier New) and the font-setting is *not* altered. This way, you have the MODIFIER LETTER APOSTROPHE displayed in a distinctive font at input. This is observed in Microsoft Word Starter, where every out-of-font character typed as such triggers the font-setting to fall-back, which is very annoying. Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 01:40:57 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 08:40:57 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <973380398.1843.1434350457831.JavaMail.www@wwinf2229> On Wed, Jun 10, 2015, Ted Clancy wrote: > The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, > which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. [...] > I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. > There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. There?s however one UI whose design is a matter of everybody, and every typist should be interested in, that is, we all, since everybody does at least partly a typist?s work. We?re all typists, and we?re all invited to help design that UI for ourselves and for our relations, friends, colleagues. This week-end I?switched my current apostrophe from U+2019 to U+02BC by updating my (already customised, but still unfinished) French keyboard layout. As we?ve already one prominent dead key, I?d added two others on Base shift state. From now on, I type GRAVE ? APOSTROPHE / QUOTATION MARK for a single or double opening quote, and get the closing one by using the ACUTE dead key. This recalls some legacy practice where spacing accents were used. The typographic apostrophe U+02BC is CIRCUMFLEX ? APOSTROPHE. (I?d U+2019 on the apostrophe key when Kana was toggled off!) In addition, I?ve added an autocorrect for U+0027 to be replaced with U+02BC when writing text on Microsoft Word Starter. The idea that we can?t touch at our keyboard except on keycaps as they?re labeled, or that we can at most change for another predefined layout which often doesn?t match these labels, is another regrettable widespread myth. As users, we confine ourselves in a receptive and waiting position, wishing and suggesting, and doing all imaginable and improbable things except adding a handful of characters on our keyboard straight before us, while in the meantime, in obliging anticipation, the world?s biggest software company stays inviting us to feel free to customise our keyboard with a free tool for free download at http://www.microsoft.com/en-us/download/details.aspx?id=22339 If this call were taken serious, all these discussions about keyboards would take another turn. Every corporate manager would make sure that his employees use appropriate keyboard layouts to save time and enhance output quality. To achieve this, he would not hesitate one minute to put himself at the place of a UI designer and to get that poor keyboard UI molted to a performative worktool. And to deploy the result at corporate level. The MSKLC is worth spending a day to get started with and to create a completed keyboard layout from one?s preferred one, because this will save much time and anger. You may design one where apostrophe and single quotes are far one from another (as on Saturday?s kbdenusw), to avoid mistyping and spelling errors without having to wait for any better on-screen UI. However, I won?t hide that the MSKLC does not allow to chain dead keys, nor does it support Kana shift states, things that are useful for a number of languages using latin or other scripts and to emulate a compose functionality. But all this plus a Kana toggle ends up to be rather simple with additional resources to program and compile the driver in C, all free of charge as well, namely a DDK or WDK https://www.microsoft.com/en-us/download/details.aspx?id=11800 The ?kbdenukw? and ?kbdenusw? of Saturday, no matter whether they were downloaded or not, are now available in their 2.0?version, which differs from the previous by including the two missing dashes. The goal of this exercise is to prove that at this funny speed, and with such a facility of adding characters on the keyboard, there is no more reason to deprive oneself of the Unicode non-ASCII characters one needs. You may open the included *.klc source?a file format which Microsoft designed for sharing?in the Microsoft Keyboard Layout Creator and in a text editor. For more information, please see my related previous mail. (The AltGr views of the US version show the dead key content.) kbdenukw: http://bit.ly/1dFMFb1 kbdenusw: http://bit.ly/1IWO8aJ Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 01:45:25 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 08:45:25 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <1903098365.1884.1434350725622.JavaMail.www@wwinf2229> At the following URL, a forum page illustrates the way users struggle since a decade (and more) against the chaotic confusion Microsoft perpetuated despite of Unicode, forcing the Committee to adopt its short views: http://painintheenglish.com/case/383 Please note Persephone?s workaround, which is a way to avoid the Apostrophe Catastrophe without turning off the ?smart quotes?. This is the smartest thing I?ve ever read about ?smart quotes?. This workaround, which I ignored, might explain why Microsoft refused to reengineer the smart quotes algorithm: Users have just to type two quotes and to delete one! However, the problem of *handling* and *processing* such text stays unresolved. Users are conscious about a quote not being an apostrophe, this page shows. But they are compelled to use close-quotes for simulation of curly apostrophes. This works on the spot, but it brings bad quality text files. Regardless of whether this matches Microsoft?s business model or not, there is no right of dissuading font-designers from publishing complete fonts! Allocating the same glyph (U+2019) to a supplemental code point (U+02BC) is very easy when creating a font, but as Microsoft compelled Unicode to tell eveybody that there is no need of U+02BC in English and that our text files must not contain U+02BC, we lost sixteen years and thousands of fonts (including Arial Unicode MS, which surprisingly is lacking U+02BC!) are nearly unusable with correct text files because they don?t include any typographical apostrophe. Except that U+0027 is curly in many ornamental fonts, to meet users? expectations. A ready workaround would thus be to disable the smart quotes and keep U+0027 as apostrophe (only), while entering U+2018/U+2019 by any means, and to replace eventually all instances of U+0027 by U+02BC. Or by U+2019 but only just before printing, never to publish in PDF and even less to send as a file or to publish on the internet! As usual, the status quo which originated from legacy code pages (which were already considerably enriched compared to ISO?8859-1, be said to the honor of Microsoft) has been justified a posteriori with a lot of mostly biased arguments: ? The approval of U+2019 as apostrophe is based on glyphs and rendering and on a static view of text, excluding from scope the further word processing across documents and languages. ? Unicode?s principles are misapplied and even misinterpreted. The fact that different meanings across languages do not need different code points, is applied inside a given language to argue that distinction of semantics by different code points is not needed. ? Some arguments are obsoleted since they were uttered, so the U+02BC being a ?spacing clone of Greek smooth breathing mark? (removed in 5.1) and thus never slanted, while in most fonts it has same shape as U+2019, slanted or curly. ? Another fallacy cites as a proof the use of U+2019 as apostrophe in some locales, while this is already based on CP1252-inspired practice against the spirit of Unicode. ? Bluring the issue by enumerating the various values of English apostrophe, which leads sometimes to include the close-quote function as punctuation apostrophe... Whatever, there is nothing to save of the status quo. Unfortunately, the mass of wrongly encoded text goes on increasing while discussions follow one another. At least, that does not hinder publishing good books and newspapers and sending nice mails (on paper, where nobody?s asking what?s the code point, because there?s no need). About other media, there?s to say that hand-processing wrong text files increases the job volume? :(?for managers, but :)?for workers, at the condition that they are really paid for. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 01:48:07 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 08:48:07 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <2122936182.1936.1434350887064.JavaMail.www@wwinf2229> On Thu, Jun 11, 2015, Philippe Verdy wrote: > The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. "Pd" (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. > If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: > - a connector between words in compound words (hyphen) > - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in "pocket books" with very narrow columns and minimized spacing) > - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) > - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). > As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). In most fonts, the glyph of the hyphen-minus U+002D is the same as the one of the hyphen U+2010, while the minus sign U+2212 is longer and higher, at half-height of digits, to match between or before, as opposed to the hyphen and hyphen-minus which are positioned at half height of lowercase letters. As a minus sign, these work well only with Elzevir digits. This is why, in most fonts, the hyphen-minus U+002D is very unpleasant when used as a minus sign, especially when the plus sign, equals sign and other operators are present too. In this, the hyphen differs from the apostrophe U+0027, whose differenciated characters (apostrophe U+02BC and single close-quote U+2019) have exactly the same glyph. But hyphen and apostrophe resemble in the fact that in many fonts, only the paired or assorted character is present, while the other is missing. So even in Arial, where the letter apostrophe U+02BC is present, the hyphen U+2010 is missing. The user is supposed to use U+002D as a hyphen and U+2212 as the minus sign. The system hyphen displayed in automatic word break at line end, is converted to U+002D for PDF. This isn?t ideal, as you point out, because to reverse the word break, one can?t simply replace all U+002D by nothing. Word processors allow to remove all instances of (U+002D, EOL), but this can delete some orthographic hyphens. The solution would be to use U+2010 for orthographic hyphens (with compatible fonts) and to let the system place its U+002D. The letter apostrophe U+02BC is indispensable because the glyph of U+0027 is unfit for typography. We are also told that U+0027 is unstable, but this is mainly due to the autocorrect smart quotes, which can be turned off at input. I use the autocorrect from now on to convert U+0027 to U+02BC. Another difference between apostrophes and hyphens, and perhaps the main difference, is that except if they are used for word break, hyphens generally don?t need to be replaced at further stages. At input, the user will replace U+002D with U+2212 where appropriate, and the autocorrect may replace two hyphens with an en dash U+2013. In some fonts, U+002D will need to be replaced with U+2010 for glyphic reasons. By contrast, quotes are to be converted, Ted?Clancy points out in his paper. Ambiguating one of them with the apostrophe was a very bad idea. Well, I still believe it was *not* the idea of any Unicode Committee, nor of any Standards Body at all. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 02:17:32 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 09:17:32 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> ?On Sat, Jun 13, 2015, Mark Davis wrote: > In particular, I see no need to change our recommendation on the character used > in contractions for English and many other languages (U+2019). Similarly, we wouldn't > recommend use of anything but the colon for marking abbreviations in Swedish, or > propose a new MODIFIER LETTER ELLIPSIS for "supercali...docious". > (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. ? Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. ? Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. ? ? Marcel Schneider ? ? > Message du 13/06/15 17:36 > De : "Mark Davis ??" > A : "Peter Constable" > Copie ? : "verdy_p at wanadoo.fr" , "Kalvesmaki, Joel" , "Unicode Mailing List" > Objet : Re: Another take on the English apostrophe in Unicode > > > On Sat, Jun 13, 2015 at 5:10 PM, Peter Constable wrote: > When it comes to orthography, the notion of what comprise words of a language is generally pure convention. That?s because there isn?t any single _linguistic_ definition of word that gives the same answer when phonological vs. morphological or syntactic criteria are applied. There are book-length works on just this topic, such as this: ? > In particular, I see no need to change our recommendation on the character used in contractions for English and many other languages (U+2019). Similarly, we wouldn't recommend use of anything but the colon for marking abbreviations in Swedish, or propose a new MODIFIER LETTER ELLIPSIS for "supercali...docious". > (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) ? > Mark > ? Il meglio ? l?inimico del bene ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Mon Jun 15 02:44:22 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Mon, 15 Jun 2015 08:44:22 +0100 (BST) Subject: Accessing the WG2 document register Message-ID: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> I have been thinking about the current discussion in the Unicode mailing list about a particular ISO committee no longer being allowed to accept proposal documents from individuals, because of a rule change from a higher level within ISO. I am thinking of how the committee meetings might be different from how they would be if the rules had not been changed and what might not get encoded that might have been encoded had the rule change not happened. In the short term, the individual contributor is hurt, yet in the long term the document encoding process is hurt and the whole world of information technology may be hurt as potentially good content has been ignored due to discrimination, and a standards document produced that is not as good as it could have been had there not been the discrimination. Thinking of this I remembered that some years ago, possibly on Channel 4 television news in the UK, there was an item about a lady who had that year won the Nobel Prize for Literature. I am trying to trace who it was and a particular work by her, thus far without success. There was a work, either a poem or a narrative, about what happened differently at a railway station because she was not there as a passenger that day, as to how what happened was different from what would have happened had she been there. I cannot be sure but I think that Hungary came into it somewhere, either as a Hungarian lady or a Hungarian railway station. I opine that it is important when deciding what will be considered for encoding that there is no discrimination about considering encoding proposals. Not only does ignoring contributions cause immediate problems but also there can be second order effects and so on as potential later contributions will not be made as they will not have the original contribution to build upon, and many people may not even realize that the second order effects have taken place. William Overington 15 June 2015 ----Original message---- >From : pandey at umich.edu Date : 10/06/2015 - 11:01 (GMTST) To : babelstone at gmail.com Cc : unicore at unicode.org, unicode at unicode.org Subject : Re: Accessing the WG2 document register Andrew, Thank you for this detailed investigation. It is truly informative. As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: http://linguistics.berkeley.edu/~pandey/ Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. All the best, Anshuman From mark at macchiato.com Mon Jun 15 03:10:10 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 15 Jun 2015 10:10:10 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> References: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> Message-ID: On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote: > When we take the topic down again from linguistics to the core mission of > Unicode, that is character encoding and text processing standardisation, > ellipsis and Swedish abbreviation colon differ from the single closing > quotation mark in this, that they are not to be processed. > > > > Linguistics, however, delivered the foundation on which Unicode issued its > first recommendation on what character to use for apostrophe. The result > was neither a matter of opinion, nor of probabilities. > > > > Actually, the choice is between perpetuating confusion in word processing, > and get people confused for a little time when announcing that U+2019 for > apostrophe was a mistake. > > ?Quite nice of you to inform me of the core mission of Unicode?I must have somehow missed that. More seriously, it is not all so black and white. As we developed? Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. In practice, whenever characters are essentially identical?and by that I mean that the overlap between the acceptable glyphs for each character is very high?people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. So we only separated essentially identical characters in limited cases: such as letters from different scripts. Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 15 03:11:34 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 15 Jun 2015 10:11:34 +0200 Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <39908195.1592.1434349426297.JavaMail.www@wwinf2229> References: <39908195.1592.1434349426297.JavaMail.www@wwinf2229> Message-ID: 2015-06-15 8:23 GMT+02:00 Marcel Schneider : > On Fri, Jun 12, 2015, Philippe Verdy wrote: > Even the Language bar uses the upper row to define shortcuts with Control, > Shift+Control, Shift+Alt to switch between keyboard layouts, which are > prioritized. > These are application shortcuts, but these modifier keys combinations are used with base function keys (F1...F12), not with keys on the alphanumeric parts of the keyboard. So there's no conflict. It is normal then to not assign CTRL+keys or CONTROL+shift+keys (independantly of the capslock state) with non-control characters if the same keys are used to type non-control ASCII characters in range U+0040..U+005F. This means that 32 positions on the keyboard must not be used for any assignment. The same remark applies to ALT+digit and ALT+letter (otherwise keyboard shortcut for application menus or navigation in web forms won't work correctly, or will take the priority when you intended to type a valid character, forcing these application functions instead of accepting your character input). MSKLC performs this "safety checks" and will issue warnings if you do so. This is not just "my" advaice but documented in the ISO standard. > So to test the shortcuts with Clavier+, I must first remove shortcuts in > the Language bar. Then the way was free to test Mr Overington?s shortcuts > for curly apostrophes (I will send the result just after). When I deleted > the shortcuts in Clavier+ to test your advice, I found no application > shortcuts for Ctrl+4 while the keys 1, 2, 5 and 0 are usually mapped as > Word shortcut with CONTROL, while the heading formatting is with ALT. But > indeed among ASCII controls I found eight on the French keyboard: > > > //VirtualKey |ScanCd |ISO_# |Ctrl > {VK_ESCAPE /*T01 */ ,0x001b > {VK_CANCEL /*X46 */ ,0x0003 > {VK_BACK /*T0E E13*/ ,0x007f > {VK_OEM_6 /*T1A D11*/ ,0x001b > {VK_OEM_1 /*T1B D12*/ ,0x001d > {VK_OEM_5 /*T2B C12*/ ,0x001c > {VK_RETURN /*T1C C13*/ ,'\n' > {VK_OEM_102 /*T56 B00*/ ,0x001c > > On the alphanumerical block, there are always the same five, three among > them near the Enter key. The British-American Apostrophe key is exempt of > Controls too. This is probably why Mr Overington wants to use CONTROL and > SHIFT+CONTROL for U+2019 and U+02BC, as custom applications shortcuts. > Assigning characters to positions defined for application shortcuts is a bad idea. Keyboard layouts should map characters in positions that are independant of applications (but layouts may be specific to an OS if the OS interface defines some standard shortcuts: this is a problem when using virtualized OSes, as there's a conflict with shortcuts used to switch from the guest to the host: personnally I have chosen the Application key for this instead of the right control, because the Application key is rarely needed, but I frequently type control with the right hand or two hands, notably CTRL+A, CTRL+C, CTRL+X, CTRL+V). On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7 successive keys of the first row ("5([", "6-|", "7?`", "8_\", "9?^", "0?@", "?)]"), they are needed to get ASCII controls However CONTROL+@ is extremely rarely needed in applications to enter a NULL control that will be almost always filtered out silently, only some editors that allow loading and editing binary files will use it, e.g. Emacs or Vim which have a "binary editing" mode that avoids altering the encoding of newlines, but displays all controls explicitly, and that does not limit the "line length". Personally I prefer not using text editors to edit binary files, this is too much unsafe with their "insertion" working mode, it is highly preferable and much simpler to use an hexadecimal editor). This means that CONTROL+"0?@" may be assigned something else more useful (even if the MSKLC compiler warns about it). But you can assign characters with CONTROL and CONTROL+SHIFT for the 6 other keys of the first row ("?", "1&", "2?~", "3"#", "4'{" on the left side, and "+=}" on the last position to the right). This means that CONTRL+4 can be safely assigned to U+02BC for the apostrophe letter, but the most common encoding of the French apostrophe is U+2019 (the closing single quote) as French normally does not use single quotation marks, or if it does, it cannot be followed by a letter and cannot be confused with a French apostrophe that is always followed by a letter (or number 1). ---- For now I've not seen any specific need of U+02BC in French (U+2019 is enough, even if it represents two distinct things in French, but in distinct non-colliding contexts). But of course U+02BC is needed for English that needs the distinction with single quotes, because the English apostrophes are used more permissively including at end of words just before a space or punctuation or end of line In French this is not valid to use the apostrophe for elisions at end of words, you need to use instead some abbreviation mark or style.. or no mark at all. ---- The French abbreviation mark can simply be a dot (same as the ASCII full stop punctuation), or writing the last letter in superscript with styles: it is highly recommended not to use any Unicode superscript letters, the only exception being the superscript letter o used to abbreviate "primo" as "1?" or "num?ro" as "n?", but this letter is also missing on standard French keyboards that assign a degree symbol and many French documents are using a degree sign for "n?" and "1?" (however mechanical typewriters assigned a key for typing "N?" as a single keystroke (where it was narrower that typing N and degree, and with the letter o generally underlined), it was on the first row, and some PC keyboards are displaying it in the shift position of the first key "?"). Underlining superscripted letters for abbreviations is deprecated in French, except for "N?" where it is still frequently seen. It is no longer recommended to use any dots (or hyphens) for abbreviations (except for abbreviations using only one letter such as "M." for "monsieur") : "S.N.C.F." which was common in the 1960's and 1970's, is now just "SNCF" (and the capitalization of non-initial letters is dropped if this becomes an acronym as in "Insee", which was the ugly "I.N.S.E.E." or "I.N.S.?.?."in the 1960's; some people want also the restoration of accents when decapitalizing acronyms, so they write "Ins??"; and they also want accents on capitalized letters of non-acronym abbreviations such as "?AU" for the Arab Emirates in order to avoid the confusion with "EAU", the capitalization of the French word meaning water; some old abbreviations like "?.-U." for the English "U.S." are no longer used, it would become "?U" with the new rule and would be too much confusable with the European Union: instead we use now "US" or "USA" that have been lexicalized since long, and preferably "UE" for the European Union, but "EU" is still very common). ---- The remaining cases in French are then just the elision apostrophe which only occurs between two letters, and U+2019 is now its most common encoding, generated by spell checkers (if this is not the ASCII single quote). U+02BC cannot be found anywhere (it won't make any semantic difference though and if ever spell checkers change their autocorrector to use U+02BC, no French user will really complain, provided that it is supported in the same fonts mapping U+2019; Winword knows which fonts it is using so it should not be a problem, but it should be simple to patch the spell checker so that it will accept U+02BC or U+2019 as equivalent in French to avoid unnecessary warnings, and then suggest U+02BC instead of U+2019 to replace the ASCII quote). Unfortunately, spell checkers in web browsers are still ignoring both U+2019 and U+02BC (e.g. Chrome, IE, Firefox... and in all Android IMEs that only propose the ASCII quote in their visual layouts... I don't know what Safari does on MacOS): they still only recognize the ASCII vertical quote, and incorrectly signal an "error" in the text editor (with red wavy underlining ? which is also unnecessarily warning us almost everywhere in a way that cannot be disabled when entering texts in another language that the default locale set in the Browser, and when there's no locale selector for this spell checker enabled by default). -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 03:29:41 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 10:29:41 +0200 (CEST) Subject: Accessing the WG2 document register In-Reply-To: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> References: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> Message-ID: <784833623.3822.1434356981870.JavaMail.www@wwinf2229> ?On Mon, Jun 15, 2015, William_J_G Overington wrote: ?> I have been thinking about the current discussion in the Unicode mailing list about a particular ISO committee no longer being allowed to accept proposal documents from individuals, because of a rule change from a higher level within ISO. > > I am thinking of how the committee meetings might be different from how they would be if the rules had not been changed and what might not get encoded that might have been encoded had the rule change not happened. > > In the short term, the individual contributor is hurt, yet in the long term the document encoding process is hurt and the whole world of information technology may be hurt as potentially good content has been ignored due to discrimination, and a standards document produced that is not as good as it could have been had there not been the discrimination. > ... > I opine that it is important when deciding what will be considered for encoding that there is no discrimination about considering encoding proposals. Not only does ignoring contributions cause immediate problems but also there can be second order effects and so on as potential later contributions will not be made as they will not have the original contribution to build upon, and many people may not even realize that the second order effects have taken place. > I'm shocked that there is still any discrimination, even against individuals, in ISO, and worse, that such discrimination has been newly introduced. ? This makes me remember the idea I got about ISO when I considered the ISO/IEC 9995 standard. This standard specifies that on all keyboards, there should be a so-called common secondary group, and that this secondary group should contain all the characters that are on the keyboard but aren't for a so-called strictly national use.? This sounds to me as if it were fascistic or neofascistic. The way this secondary group is accessed seems rather complicated and been engineered in disconnect from actual OSs and keyboard drivers. The result was that when it went on to be implemented on Windows, the secondary group was not accessed like specified but as Kana levels, which is very consistent with a real keyboard. But in the meantime, this ISO/IEC 9995 standard wastes a whole shift state by excluding it simply from use, on the pretext that you need to press more than two keys: Shift + AltGr + another key. This restriction to a maximum number of two simultaneously pressed keys was so fancy Microsoft didn't bother about. Really, to enter a character from the second level of the secondary group, you need to press Shift + Kana + another key.? That's all OK, but the ISO/IEC 9995 standard is *not*. ? I won't repeat what I already wrote on this List. Sincerely I thought that the International Association for Standardization is today a real international organization which cares for all nations on the earth, whether the proposals come from individuals or collectivities. I dimly recall that in the nineties, ISO was even likely to refuse demands made by its own national members. Reports and results showed that it even dit not consult anybody of the nations it was encoding the characters of, except a few people who were not always reliable, ISO 8859-1 showed. ? To read such things today makes me furious again. I personally wish that you, Mr Pandey, Mr West and Mr Overington, be fully heard at ISO and that *all* proposals are treated equally, fully, and successfully. What are we going to do? What are you going to do? I repeat, I'm shocked, and I hate ISO again. ? ? Best regards, Marcel Schneider ? > Message du 15/06/15 09:53 > De : "William_J_G Overington" > A : pandey at umich.edu, unicode at unicode.org, babelstone at gmail.com > Copie ? : > Objet : Re: Accessing the WG2 document register > > I have been thinking about the current discussion in the Unicode mailing list about a particular ISO committee no longer being allowed to accept proposal documents from individuals, because of a rule change from a higher level within ISO. > > I am thinking of how the committee meetings might be different from how they would be if the rules had not been changed and what might not get encoded that might have been encoded had the rule change not happened. > > In the short term, the individual contributor is hurt, yet in the long term the document encoding process is hurt and the whole world of information technology may be hurt as potentially good content has been ignored due to discrimination, and a standards document produced that is not as good as it could have been had there not been the discrimination. > > Thinking of this I remembered that some years ago, possibly on Channel 4 television news in the UK, there was an item about a lady who had that year won the Nobel Prize for Literature. I am trying to trace who it was and a particular work by her, thus far without success. > > There was a work, either a poem or a narrative, about what happened differently at a railway station because she was not there as a passenger that day, as to how what happened was different from what would have happened had she been there. > > I cannot be sure but I think that Hungary came into it somewhere, either as a Hungarian lady or a Hungarian railway station. > > I opine that it is important when deciding what will be considered for encoding that there is no discrimination about considering encoding proposals. Not only does ignoring contributions cause immediate problems but also there can be second order effects and so on as potential later contributions will not be made as they will not have the original contribution to build upon, and many people may not even realize that the second order effects have taken place. > > William Overington > > 15 June 2015 > > > > ----Original message---- > From : pandey at umich.edu > Date : 10/06/2015 - 11:01 (GMTST) > To : babelstone at gmail.com > Cc : unicore at unicode.org, unicode at unicode.org > Subject : Re: Accessing the WG2 document register > > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. > > All the best, > Anshuman > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 04:49:59 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 11:49:59 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> Message-ID: <147344343.5449.1434361799774.JavaMail.www@wwinf2229> On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ?? wrote: > On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote: >> When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. >> Linguistics, however, delivered the foundation on which Unicode issued its first recommendation on what character to use for apostrophe. The result was neither a matter of opinion, nor of probabilities. >> Actually, the choice is between perpetuating confusion in word processing, and get people confused for a little time when announcing that U+2019 for apostrophe was a mistake. > Quite nice of you to inform me of the core mission of Unicode?I must have somehow missed that. > More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. >In practice, whenever characters are essentially identical?and by that I mean that the overlap between the acceptable glyphs for each character is very high?people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. >So we only separated essentially identical characters in limited cases: such as letters from different scripts. ? It was a very good idea to disambiguate also apostrophe and single quote, and I feel it's not paid too much because it simplified greatly the processing of quotation marks in English. I mean, the replacement of each pair of one kind by a pair of another kind. When I search for quotes in a text, I don't want to be distracted by apostrophes. Don't worry about equivalence classes, they already present to us a word without apostrophe as equivalent to the same letters with an apostrophe/quote between. It's every time better the computer knows what a character is exactly, even when at output it doesn't need to let us know, than that it comes up with a useless mixup. ? You just brought up another good idea too: Period-terminated abbreviations are listed as exceptions in word processors. Another list could contain all words with leading apostrophe and all words with trailing apostrophe. This might allow to filter search results and to separate definitely apostrophes and single comma quotation marks. And at input, the smart quotes algorithms will become even smarter. Say, really smart. ? I don't believe working people would mix up letter apostrophe and close-quote if they were on keyboard. And even now that they aren't, people don't, because people just hit the apostrophe key, which without any dumb smart quotes algorithm leads always to visually satisfying results, as shown in the Unicode documentation. For good desktop publishing, people must work hard anyway, so it would be nice to give them the means, and not to overburden them with routine tasks due to deficient text encoding. ? The way things are working today is not satisfying concerning the English apostrophe. I still can't believe that the Unicode Committees were wrong when recommending U+02BC. Restoring this advantage today, will be at the honor of all involved parties, and we and future generations will thank you very much. ? If they'll exist. ? Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Jun 15 05:38:25 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Mon, 15 Jun 2015 12:38:25 +0200 Subject: ISO (was Re: Accessing the WG2 document register) In-Reply-To: <784833623.3822.1434356981870.JavaMail.www@wwinf2229> References: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> <784833623.3822.1434356981870.JavaMail.www@wwinf2229> Message-ID: <20150615123825.24207jh3dsvfbbq9@mail.mimuw.edu.pl> Quote/Cytat - Marcel Schneider (Mon 15 Jun 2015 10:29:41 AM CEST): > What are we going to do? What are you going to do? I repeat, I'm > shocked, and I hate ISO again. Please remember that your government supports ISO through your national standard body. So contact AFNOR and persuade them to take an appropriate action. Good luck! Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From charupdate at orange.fr Mon Jun 15 06:50:03 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 13:50:03 +0200 (CEST) Subject: ISO (was Re: Accessing the WG2 document register) In-Reply-To: <20150615123825.24207jh3dsvfbbq9@mail.mimuw.edu.pl> References: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> <784833623.3822.1434356981870.JavaMail.www@wwinf2229> <20150615123825.24207jh3dsvfbbq9@mail.mimuw.edu.pl> Message-ID: <1117500033.7189.1434369003403.JavaMail.www@wwinf2229> Thank you. That's done. ? I'd finished by thinking seriously that today, the ISO'd improved itself. The case of how Mr Anshuman Pandey is treated by ISO proves that it did not. This sheltered documents access policy and practice makes ISO appear like sheer moonless night, I experienced myself. And as there is no transparency, you even don't know what's about. ? I hope that Mr Pandey's work will be fully honored and be taken into account. Well, I don't understand much of these processes, but it's clear to me since a pretty long time that there's a problem with ISO somewhat. ? Best regards, Marcel > Message du 15/06/15 12:51 > De : "Janusz S. Bien" > A : "Marcel Schneider" > Copie ? : unicode at unicode.org > Objet : ISO (was Re: Accessing the WG2 document register) > > Quote/Cytat - Marcel Schneider (Mon 15 Jun 2015 > 10:29:41 AM CEST): > > > > What are we going to do? What are you going to do? I repeat, I'm > > shocked, and I hate ISO again. > > Please remember that your government supports ISO through your > national standard body. So contact AFNOR and persuade them to take an > appropriate action. > > Good luck! > > Janusz > > -- > Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra > Lingwistyki Formalnej) > Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) > jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 08:19:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 15:19:26 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode Message-ID: <1921702526.8879.1434374366809.JavaMail.www@wwinf2229> On Tue Mar 26 2002 - 10:01:43 EST, Mark Davis ?? wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0598.html > Apostrophe, hyphen, and various other puncutation by default continue > a word, but this behavior may be overriden on a per-language basis. > Heuristics or more sophisticated engines may be needed when the > apostrophe is at the end of a word, as in ?the peoples' choice?, since > it is ambiguous. The modifier letter apostrophe, on the other hand, is > always treated as a letter. ? [I replaced '<' '>' with '?' '?' to prevent confusion with a tag by the user agent.] ? On Tue Mar 26 2002 - 11:44:28 EST, Marco Cimarosti wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0604.html ? > Mark Davis wrote: >> Apostrophe, hyphen, and various other puncutation by default continue >> a word, but this behavior may be overriden on a per-language basis. > This may work for things such as finding word boundaries, but not for > identifiers. > According to the ID_Start and ID_Continue properties in > , neither > U+0027 (APOSTROPHE) nor U+2019 (RIGHT SINGLE QUOTATION MARK) are allowed in > an identifier. And this is not surprising, since they are primarily > quotation marks. > On the other hand, U+02BC (MODIFIER LETTER APOSTROPHE) is allowed in any > position within an identifier. Using U+02BC as the apostrophe, would allow > to use words such as: , or <'em> in identifiers. > But this hits against the fact that Unicode's own suggestion is to use > U+2019 for the apostrophe. ? On Tue Mar 26 2002 - 12:08:41 EST , Marco Cimarosti wrote: http://www.unicode.org/mail-arch/unicode-ml/y2002-m03/0608.html > But, as you say, the apostrophe is legitimate and sometimes mandatory in the > orthography of English and many other languages. So, it seems to me that its > preferred encoding should make it possible to use it in identifiers, > filenames, URI(')s, and so on. ? ? Don't we fall back into the times of all-0x27 and stay in front of on-going confusion when English apostrophe is ambiguated with closing-quote? As you told us, having both U+02BC and U+2019 in use will need some supplemental algorithms. But as you told in 2002, this is true when both are confused in only one character, too. ? I suspect that the cost of using MODIFIER LETTER APOSTROPHE for English apostrophe (and as apostrophe on the whole) today would mainly be the cost of updating implementations and text files. If this cost is too high, we would have to consider that text has not to be quoted nor to be converted between British and US English. I hope people will stay communicating and exchanging. ? Marcel Schneider ? ? ? ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From qsjn4ukr at gmail.com Mon Jun 15 08:20:05 2015 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Mon, 15 Jun 2015 16:20:05 +0300 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> Message-ID: By the way, about smart quotes. I am using that for long time. My keyboard layout generates two characters on one key-press (so I have to enter [??][?]{sth}[?] instead of [?]{sth}[?]). It's not that good, but I'm not afraid neither to lose quotation marks or parentheses nor become a victim of artificial intelligence :) About what is one word. Do you know the German prefixes? "... ... macht ... ... ... ... ... ... auf". Let me ask if double-quotes are parts of word or not? For example, in this sentence "not" is a noun, not particle? Was "Titanic" titanic? From verdy_p at wanadoo.fr Mon Jun 15 09:00:51 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 15 Jun 2015 16:00:51 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> Message-ID: 2015-06-15 15:20 GMT+02:00 QSJN 4 UKR : > By the way, about smart quotes. I am using that for long time. My > keyboard layout generates two characters on one key-press (so I have > to enter [??][?]{sth}[?] instead of [?]{sth}[?]). It's not that good, > You could generate three keystrokes [?][?][?] from a single keypress to get the same effect. Various editors already do that when you press the first key for the opening quote, and all you have to type then is the [?] key (instead of the key for a closing quote) after typing the word. Such system is used in many IDE or text editors for programmers when they enter the opening parenthese, or square bracket, or single/double quotes, or braces, or block comment prefixes, or any paired symbols or keywords used in the programming language (e.g. "begin | end" in Pascal, "#if |\n#endif" in C/C++ preprocessor directives : the pipe here notes the position of the cursor after typing what is just before it, what is after the pipe is inserted after the cursor position). If you disagree with those automatic insertions after the cursor, you can immediately press CTRL+Z to cancel this added suffix but keep what you just entered. another CTRL+Z will undo your previous keypress(es) for the character(s) just before the cursor position. Some editors are even smarter before the cursor position is not just a single position but a selected range and as long as you continue typing just before this range, the selection is preserved, and when you press [?] it will skip over this whole selection and you an also press then the backspace key to delete that autoinserted selected range. If you move your cursor elsewhere, the selection is unselected and you get back to the normal insertion cursor with an empty selection. Such system is used for example in Notepad++ (for Windows), or Eclipse (you can disable this automatic insertion in your preferences). This editor feature does not depend on the character layout but depends on the selected language for matching pairs: it does not have to be limited to programming languages and can be used as well for natural human languages, including in advanced word processors. It can also be used to insert automatically some additional space when you just press an initial quote: entering only [?] when editing French text, what you would get is [?][NNBSP]|[NNBSP][?] (with the cursor selection over the last two characters). These editors normally have a way to edit their automatic insertion rules (with the text to match before, the text to add jut after it, the new cursor position, and the text to insert just after it (and to hopefully preselect in such a way that when continuing entering text without moving the insertion position, it is not overwritten but just preseves this selected text). Such rules can be part of the parameters for the spell checker. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 09:49:45 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 16:49:45 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <309976106.10807.1434379785899.JavaMail.www@wwinf2229> On Fri, Jun 12, 2015, Philippe Verdy wrote: > These are application shortcuts, but these modifier keys combinations are used with base function keys (F1...F12), not with keys on the alphanumeric parts of the keyboard. So there's no conflict. Thank you for your advice. It'll be very useful. I was not precise enough, the upper row of the alphanumerical block is used with Ctrl, Shift+Ctrl, Shift+Alt by the language bar but optionally only. > It is normal then to not assign CTRL+keys or CONTROL+shift+keys (independantly of the capslock state) with non-control characters if the same keys are used to type non-control ASCII characters in range U+0040..U+005F. This means that 32 positions on the keyboard must not be used for any assignment. > The same remark applies to ALT+digit and ALT+letter (otherwise keyboard shortcut for application menus or navigation in web forms won't work correctly, or will take the priority when you intended to type a valid character, forcing these application functions instead of accepting your character input). MSKLC performs this "safety checks" and will issue warnings if you do so. The Alt shift state is unassignable in the MSKLC. When used for shortcuts with Clavier+, these are prioritized and work fine. > This is not just "my" advaice but documented in the ISO standard. That depends on which ISO Standard you refer to. If it's ISO/IEC 9995, then beware! IMHO this standard isn't to be taken seriously, otherwise you'll have to stay away from using the Shift + AltGr shift state, to take just one outstanding example. > Assigning characters to positions defined for application shortcuts is a bad idea. Keyboard layouts should map characters in positions that are independant of applications (but layouts may be specific to an OS if the OS interface defines some standard shortcuts: this is a problem when using virtualized OSes, as there's a conflict with shortcuts used to switch from the guest to the host: personnally I have chosen the Application key for this instead of the right control, because the Application key is rarely needed, but I frequently type control with the right hand or two hands, notably CTRL+A, CTRL+C, CTRL+X, CTRL+V). It's indeed very useful to keep two Control modifiers. Because the modifiers at the left and right border of the block are acted with the little finger and should thus be symetrical. This does not apply to the Alt keys and other keys more or less centered around the space bar, which are acted with the thumbs. As Alt is less used than Kana (when there is a Kana key), Kana should be on left Alt, symetrical to the (on many keyboards already implemented) AltGr key. The Alt key comes then on the Applications key, which is mnemonic because of the contextual menu icon. Internally, indeed, the Alt keys (left and right) are called Menu keys (Virtual key Left Menu or VK_LMENU, and VK_RMENU). This contextual menu is then invoked pressing the right Windows key, which is consistently missing on laptops. Laptops must however have an Applications key to prevent the AltGr key from being positioned too far rightwards, beside of a space bar too long, because this hardware layout has some negative impact on ergonomics, specialists say. On the US keyboard layout at http://charupdate.info however, Applications is a Kana toggle, while Right Windows is a Compose key. For laptops this shifts rightwards to get Compose on Applications, and Kana toggle on, well, Right Control. Because there are laptops with nothing between Right Alt and Right Control, so I even thought at mapping the Kana toggle on Pause, but this turned out to be buggy, besides that keyboards without Applications (Menu) often are lacking the Pause key too. > On the French keyboard, CONTROL and SHIFT+CONTROL must be reserved on 7 successive keys of the first row ("5([", "6-|", "7?`", "8_\", "9?^", "0?@", "?)]"), they are needed to get ASCII controls > However CONTROL+@ is extremely rarely needed in applications to enter a NULL control that will be almost always filtered out silently, only some editors that allow loading and editing binary files will use it, e.g. Emacs or Vim which have a "binary editing" mode that avoids altering the encoding of newlines, but displays all controls explicitly, and that does not limit the "line length". Personally I prefer not using text editors to edit binary files, this is too much unsafe with their "insertion" working mode, it is highly preferable and much simpler to use an hexadecimal editor). > This means that CONTROL+"0?@" may be assigned something else more useful (even if the MSKLC compiler warns about it). > But you can assign characters with CONTROL and CONTROL+SHIFT for the 6 other keys of the first row ("?", "1&", "2?~", "3"#", "4'{" on the left side, and "+=}" on the last position to the right). I ended up assigning no characters on Control shift states at all any more. To get the most of a keyboard, the best is to use the Kana shift states. Their disadvantage is that the Caps Lock never can act on them. At least for me. Perhaps somebody can program a driver where it does? That would mean one should add some new attributes. BTW there are still unknown entities, like the mysterious GRPSELTAP. > This means that CONTRL+4 can be safely assigned to U+02BC for the apostrophe letter, but the most common encoding of the French apostrophe is U+2019 (the closing single quote) as French normally does not use single quotation marks, or if it does, it cannot be followed by a letter and cannot be confused with a French apostrophe that is always followed by a letter (or number 1). In German even less, where the single close-quote is the English open-quote, and the single open-quote looks like a comma. However, for quotations and nested quotations, the use of chevrons (angle quotation marks) is widespread. So you have U+2019 never mean anything else than an apostrophe. The problem of shortcuts is their relative clumsiness, that is, for an apostrophe I'd prefer to hit just two keys than to press Control. Ctrl + 4 would be less ergonomical for apostrophe than to have the apostrophe on Shift, which on certain keyboards lead to typos already. We must put much more into our dead key registries. U+02BC is an example of what to add on the CIRCUMFLEX dead key. ---- > For now I've not seen any specific need of U+02BC in French (U+2019 is enough, even if it represents two distinct things in French, but in distinct non-colliding contexts). > But of course U+02BC is needed for English that needs the distinction with single quotes, because the English apostrophes are used more permissively including at end of words just before a space or punctuation or end of line > In French this is not valid to use the apostrophe for elisions at end of words, you need to use instead some abbreviation mark or style.. or no mark at all. This is why in French there's no Apostrophe Catastrophe. Should we rely on this chance? IMO, no. Because this would lead us to: ? Avoid single quotation marks, which are very nice and useful as delimiters in texts for publishing, where U+0027 would look clumsy. ? Stay moving apostrophes to ?secure? places instead of putting them properly at the beginning, like in _?Y a_. ---- > The French abbreviation mark can simply be a dot (same as the ASCII full stop punctuation), or writing the last letter in superscript with styles: it is highly recommended not to use any Unicode superscript letters, the only exception being the superscript letter o used to abbreviate "primo" as "1?" or "num?ro" as "n?", but this letter is also missing on standard French keyboards that assign a degree symbol and many French documents are using a degree sign for "n?" and "1?" (however mechanical typewriters assigned a key for typing "N?" as a single keystroke (where it was narrower that typing N and degree, and with the letter o generally underlined), it was on the first row, and some PC keyboards are displaying it in the shift position of the first key "?"). Underlining superscripted letters for abbreviations is deprecated in French, except for "N?" where it is still frequently seen. > It is no longer recommended to use any dots (or hyphens) for abbreviations (except for abbreviations using only one letter such as "M." for "monsieur") : "S.N.C.F." which was common in the 1960's and 1970's, is now just "SNCF" (and the capitalization of non-initial letters is dropped if this becomes an acronym as in "Insee", which was the ugly "I.N.S.E.E." or "I.N.S.?.?."in the 1960's; some people want also the restoration of accents when decapitalizing acronyms, so they write "Ins??"; and they also want accents on capitalized letters of non-acronym abbreviations such as "?AU" for the Arab Emirates in order to avoid the confusion with "EAU", the capitalization of the French word meaning water; some old abbreviations like "?.-U." for the English "U.S." are no longer used, it would become "?U" with the new rule and would be too much confusable with the European Union: instead we use now "US" or "USA" that have been lexicalized since long, and preferably "UE" for the European Union, but "EU" is still very common). ---- > The remaining cases in French are then just the elision apostrophe which only occurs between two letters, and U+2019 is now its most common encoding, generated by spell checkers (if this is not the ASCII single quote). U+02BC cannot be found anywhere (it won't make any semantic difference though and if ever spell checkers change their autocorrector to use U+02BC, no French user will really complain, provided that it is supported in the same fonts mapping U+2019; Winword knows which fonts it is using so it should not be a problem, but it should be simple to patch the spell checker so that it will accept U+02BC or U+2019 as equivalent in French to avoid unnecessary warnings, and then suggest U+02BC instead of U+2019 to replace the ASCII quote). > Unfortunately, spell checkers in web browsers are still ignoring both U+2019 and U+02BC (e.g. Chrome, IE, Firefox... and in all Android IMEs that only propose the ASCII quote in their visual layouts... I don't know what Safari does on MacOS): they still only recognize the ASCII vertical quote, and incorrectly signal an "error" in the text editor (with red wavy underlining ? which is also unnecessarily warning us almost everywhere in a way that cannot be disabled when entering texts in another language that the default locale set in the Browser, and when there's no locale selector for this spell checker enabled by default). I agree. These spell-checkers bug me more than anything, even if they're useful. Yes, it should be simple. Thanks again for this useful advice. Sorry, sometimes I shifted somewhat off the topic :) -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 15 10:12:59 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Jun 2015 08:12:59 -0700 Subject: Another take on the English Apostrophe in Unicode Message-ID: <20150615081259.665a7a7059d7ee80bb4d670165c8327d.2b2882039d.wbe@email03.secureserver.net> Marcel Schneider wrote: > A free tool, the Microsoft Keyboard Layout Creator, allows every user > to add U+02BC on his preferred keyboard layout I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% compatible with the AltGr-less US keyboard and supports almost 900 other characters, including all of the apostrophes and quotes and dashes and other characters under discussion: http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html I spent years designing and updating my own keyboard layout and studying other layouts. I've ended this quest since I started using Moby Latin; it's the best I've seen in numerous ways. Elsewhere: > ISO stands for stability We wish. Several of us on this list have worked on standards and standard-like activities that correct for, and defend against, instability in ISO standards. > Microsoft?s choice of mashing up apostrophe and close-quote to end up > with an unprocessable hybrid was wrong. Very wrong. Windows-1252 and the other Windows code pages were developed during the 1980s, before Unicode, when almost all non-Asian character sets were limited to 256 code points. The distinctions between apostrophe and right-single-quote, weighed against the confusion caused by encoding two identical-looking characters, would never have been sufficient back then to justify separate encoding in this limited space. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From petercon at microsoft.com Mon Jun 15 10:18:47 2015 From: petercon at microsoft.com (Peter Constable) Date: Mon, 15 Jun 2015 15:18:47 +0000 Subject: Accessing the WG2 document register In-Reply-To: <784833623.3822.1434356981870.JavaMail.www@wwinf2229> References: <26278440.5522.1434354262207.JavaMail.defaultUser@defaultHost> <784833623.3822.1434356981870.JavaMail.www@wwinf2229> Message-ID: I suggest that people on this list that have not personally engaged directly in ISO process via their country?s designated standards bodies should stop opining and editorializing on that body. ISO isn?t perfect by any means, but in the many years I have been directly involved in ISO process I can?t say I?ve ever seen discrimination other than appropriate discrimination of ideas on technical merits. Peter From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Marcel Schneider Sent: Monday, June 15, 2015 1:30 AM To: wjgo_10009 at btinternet.com; pandey at umich.edu; unicode at unicode.org; babelstone at gmail.com Subject: Re: Accessing the WG2 document register On Mon, Jun 15, 2015, William_J_G Overington > wrote: > I have been thinking about the current discussion in the Unicode mailing list about a particular ISO committee no longer being allowed to accept proposal documents from individuals, because of a rule change from a higher level within ISO. > > I am thinking of how the committee meetings might be different from how they would be if the rules had not been changed and what might not get encoded that might have been encoded had the rule change not happened. > > In the short term, the individual contributor is hurt, yet in the long term the document encoding process is hurt and the whole world of information technology may be hurt as potentially good content has been ignored due to discrimination, and a standards document produced that is not as good as it could have been had there not been the discrimination. > ... > I opine that it is important when deciding what will be considered for encoding that there is no discrimination about considering encoding proposals. Not only does ignoring contributions cause immediate problems but also there can be second order effects and so on as potential later contributions will not be made as they will not have the original contribution to build upon, and many people may not even realize that the second order effects have taken place. > I'm shocked that there is still any discrimination, even against individuals, in ISO, and worse, that such discrimination has been newly introduced. This makes me remember the idea I got about ISO when I considered the ISO/IEC 9995 standard. This standard specifies that on all keyboards, there should be a so-called common secondary group, and that this secondary group should contain all the characters that are on the keyboard but aren't for a so-called strictly national use. This sounds to me as if it were fascistic or neofascistic. The way this secondary group is accessed seems rather complicated and been engineered in disconnect from actual OSs and keyboard drivers. The result was that when it went on to be implemented on Windows, the secondary group was not accessed like specified but as Kana levels, which is very consistent with a real keyboard. But in the meantime, this ISO/IEC 9995 standard wastes a whole shift state by excluding it simply from use, on the pretext that you need to press more than two keys: Shift + AltGr + another key. This restriction to a maximum number of two simultaneously pressed keys was so fancy Microsoft didn't bother about. Really, to enter a character from the second level of the secondary group, you need to press Shift + Kana + another key. That's all OK, but the ISO/IEC 9995 standard is *not*. I won't repeat what I already wrote on this List. Sincerely I thought that the International Association for Standardization is today a real international organization which cares for all nations on the earth, whether the proposals come from individuals or collectivities. I dimly recall that in the nineties, ISO was even likely to refuse demands made by its own national members. Reports and results showed that it even dit not consult anybody of the nations it was encoding the characters of, except a few people who were not always reliable, ISO 8859-1 showed. To read such things today makes me furious again. I personally wish that you, Mr Pandey, Mr West and Mr Overington, be fully heard at ISO and that *all* proposals are treated equally, fully, and successfully. What are we going to do? What are you going to do? I repeat, I'm shocked, and I hate ISO again. Best regards, Marcel Schneider > Message du 15/06/15 09:53 > De : "William_J_G Overington" > > A : pandey at umich.edu, unicode at unicode.org, babelstone at gmail.com > Copie ? : > Objet : Re: Accessing the WG2 document register > > I have been thinking about the current discussion in the Unicode mailing list about a particular ISO committee no longer being allowed to accept proposal documents from individuals, because of a rule change from a higher level within ISO. > > I am thinking of how the committee meetings might be different from how they would be if the rules had not been changed and what might not get encoded that might have been encoded had the rule change not happened. > > In the short term, the individual contributor is hurt, yet in the long term the document encoding process is hurt and the whole world of information technology may be hurt as potentially good content has been ignored due to discrimination, and a standards document produced that is not as good as it could have been had there not been the discrimination. > > Thinking of this I remembered that some years ago, possibly on Channel 4 television news in the UK, there was an item about a lady who had that year won the Nobel Prize for Literature. I am trying to trace who it was and a particular work by her, thus far without success. > > There was a work, either a poem or a narrative, about what happened differently at a railway station because she was not there as a passenger that day, as to how what happened was different from what would have happened had she been there. > > I cannot be sure but I think that Hungary came into it somewhere, either as a Hungarian lady or a Hungarian railway station. > > I opine that it is important when deciding what will be considered for encoding that there is no discrimination about considering encoding proposals. Not only does ignoring contributions cause immediate problems but also there can be second order effects and so on as potential later contributions will not be made as they will not have the original contribution to build upon, and many people may not even realize that the second order effects have taken place. > > William Overington > > 15 June 2015 > > > > ----Original message---- > From : pandey at umich.edu > Date : 10/06/2015 - 11:01 (GMTST) > To : babelstone at gmail.com > Cc : unicore at unicode.org, unicode at unicode.org > Subject : Re: Accessing the WG2 document register > > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. > > All the best, > Anshuman > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 15 10:28:58 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 15 Jun 2015 17:28:58 +0200 Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <309976106.10807.1434379785899.JavaMail.www@wwinf2229> References: <309976106.10807.1434379785899.JavaMail.www@wwinf2229> Message-ID: 2015-06-15 16:49 GMT+02:00 Marcel Schneider : > It's indeed very useful to keep two Control modifiers. Because the > modifiers at the left and right border of the block are acted with the > little finger and should thus be symetrical. This does not apply to the Alt > keys and other keys more or less centered around the space bar, which are > acted with the thumbs. As Alt is less used than Kana (when there is a Kana > key), Kana should be on left Alt, symetrical to the (on many keyboards > already implemented) AltGr key. The Alt key comes then on the Applications > key, which is mnemonic because of the contextual menu icon. Internally, > indeed, the Alt keys (left and right) are called Menu keys (Virtual key > Left Menu or VK_LMENU, and VK_RMENU). This contextual menu is then invoked > pressing the right Windows key, which is consistently missing on laptops. > Not just laptops. My desktop PC only has a single Windows key, on the left. Anyway there's little use of the Windows key that was introduced lately (and there are still lot of keyboards that don't have this key). The same remark applies to the ScrollLock key (which is now frequently remapped to Fn+Pause/SysAttn or other similar combination using the single Windows key when there's no Fn key which is typical of notebooks). However I disagree with your opinion about AltGr+Shift combinations: it works perfectly including with the ISO 9995 definitions: the unshifted and shifted position are in the same "group". However ISO 9995 allows CapsLock to be used to create other groups instead of just reproducing the shifted/unshifted layout. It can be very useful for users in India to switch between Latin and local abugidas. It could be used as well by users writing in Arabic and Hebrew abjads, or with African (Ethiopic) or North-American syllabary scripts that are complex to map on a usable keyboard. But I think that keyboard should all have a dedicated Kana key to easily map additional groups without sacrificing other shift keys on the last row: keyboards really don't need two windows keys and so the space bar can remain with a cumfortable width (as well for the Shift key or Backspace which is too narrow on many keyboards). On the last row therre should never be more than 7 keys on both sides of the space bar, and the most external keys (Ctrl) have to remain wide). If a Kana key or present, in fact it should be to the right of the right control, or ro the right of the right Shift AltGr needs to keep some width extension compared to letter keys, and in fact could be larger than the left Alt, because it is used for entering text. The Application key is too large for me, just like the left Windows key (its extra width should be better given to the left Control key to make it a bit more central). Those that design keyboard almost never test them for real usability: they prefer slling them with many packed multimedia functions (or buttons for Calc, Mail, Web or swtiching windows, and that are rarely used). Only keyboards for gamers have some attention, but only to give them additional programmable function keys for specific games... Keyboards on notebooks are extremely poorly designed, a complete nonsense. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 15 10:46:02 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Jun 2015 08:46:02 -0700 Subject: Accessing the WG2 document register Message-ID: <20150615084602.665a7a7059d7ee80bb4d670165c8327d.2a7bf52e83.wbe@email03.secureserver.net> Marcel Schneider wrote: > This makes me remember the idea I got about ISO when I considered the > ISO/IEC 9995 standard. This standard specifies that on all keyboards, > there should be a so-called common secondary group, and that this > secondary group should contain all the characters that are on the > keyboard but aren't for a so-called strictly national use. This > sounds to me as if it were fascistic or neofascistic. Please read the history of attempts to standardize keyboard layouts across national boundaries. National standard bodies have always insisted on their particular differences in layout (Q/A, W/Z, Y/Z) and convenient access to characters specific to their languages. This is not imposed from the outside. > The way this secondary group is accessed seems rather complicated and > been engineered in disconnect from actual OSs and keyboard drivers. > The result was that when it went on to be implemented on Windows, the > secondary group was not accessed like specified but as Kana levels, ? which is very consistent with a real keyboard. But in the meantime, > this ISO/IEC 9995 standard wastes a whole shift state by excluding it > simply from use, on the pretext that you need to press more than two > keys: Shift + AltGr + another key. This restriction to a maximum > number of two simultaneously pressed keys was so fancy Microsoft > didn't bother about. Really, to enter a character from the second > level of the secondary group, you need to press Shift + Kana + another > key. That's all OK, but the ISO/IEC 9995 standard is *not*. At least it was possible to implement the old ISO 9995-3 standard on Windows, treating Group 2, Levels 1 and 2 as if they were Group 1, Levels 3 and 4 -- in other words, by using AltGr and Shift+AltGr. The new ISO 9995-3 standard isn't implemented anywhere, and can't be as long as no specification exists to access the additional groups and shift states without adding more physical keys. "Figure it out for yourself" is not a specification. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Mon Jun 15 11:11:27 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 18:11:27 +0200 (CEST) Subject: Accessing the WG2 document register In-Reply-To: <20150615084602.665a7a7059d7ee80bb4d670165c8327d.2a7bf52e83.wbe@email03.secureserver.net> References: <20150615084602.665a7a7059d7ee80bb4d670165c8327d.2a7bf52e83.wbe@email03.secureserver.net> Message-ID: <1340538145.12811.1434384687773.JavaMail.www@wwinf2229> On Mon, Jun 15, 2015, Doug Ewell wrote: > At least it was possible to implement the old ISO 9995-3 standard on > Windows, treating Group 2, Levels 1 and 2 as if they were Group 1, > Levels 3 and 4 -- in other words, by using AltGr and Shift+AltGr. The US International keyboard layout indeed conforms to ISO/IEC?9995. AFAIK it was preexistent, and was validated for conformance by considering that the AltGr and Shift + AltGr shift states contain the secondary group. I did not think about it as an _implementation_ of ISO/IEC 9995. > The new ISO 9995-3 standard isn't implemented anywhere, and can't be as > long as no specification exists to access the additional groups and > shift states without adding more physical keys. "Figure it out for > yourself" is not a specification. The new German standard keyboard layouts T2 and T3 are ISO/IEC 9995. Other national keyboard layouts before them are, too. There is exactly a Group 1 with three levels and a Group 2 with two. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 15 11:28:22 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Jun 2015 09:28:22 -0700 Subject: Accessing the WG2 document register Message-ID: <20150615092822.665a7a7059d7ee80bb4d670165c8327d.ead563bedf.wbe@email03.secureserver.net> Marcel Schneider wrote: > The US International keyboard layout indeed conforms to ISO/IEC 9995. > AFAIK it was preexistent, and was validated for conformance by > considering that the AltGr and Shift + AltGr shift states contain the > secondary group. > I did not think about it as an _implementation_ of ISO/IEC 9995. "ISO/IEC 9995" is a multi-part standard that covers many different aspects of keyboards. US International certainly conforms to many of the parts: ? it has alphanumeric, numeric, and editing zones with keys which can be referenced by "E01" notation, as per 9995-1 ? it has shifting keys which are used to select levels ? the primary layout (Levels 1 and 2) conforms to 9995-2, as does practically any Latin-script keyboard ? it has Escape and cursor keys in conformance with 9995-5 ? and so on. The Level 3 and "Level 4" (Shift+AltGr) allocations of US International do not conform in any way to the common secondary layout of either 9995-3:2002 or 9995-3:2010. For example, there is no ohm sign on US International in any group or level, either at D01 (2002) or D02 (2010). Perhaps we are not talking about the same thing when we say "conforms to ISO/IEC 9995." -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Mon Jun 15 12:38:33 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 19:38:33 +0200 (CEST) Subject: Accessing the WG2 document register Message-ID: <593184955.19635.1434389913782.JavaMail.www@wwinf1n18> On Mon, Jun 15, 2015, 18:36, Doug Ewell wrote: ? > The Level 3 and "Level 4" (Shift+AltGr) allocations of US International > do not conform in any way to the common secondary layout of either > 9995-3:2002 or 9995-3:2010. For example, there is no ohm sign on US > International in any group or level, either at D01 (2002) or D02 (2010). > Perhaps we are not talking about the same thing when we say "conforms to > ISO/IEC 9995." ? I don't measure exactly the implications of a keyboard compliance to a given standard when this standard is developed "on the paper" and without taking into consideration all needs and preferences of end-users. The Ohm sign you mention reminds me that ISO perpetuated on keyboard some deprecated legacy characters that end up anyway to be replaced with their canonical equivalent, that in this example is Greek capital omega. That's? another disconnect. ? And standardizing the dead key registries to exclude all characters that are not composed ones, is a counterproductive constraint based on the belief that the only way to get aware of the content of a layout is to read the keycap labels. This is a way of never getting curly quotes and apostrophe. ? On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: ? > I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% > compatible with the AltGr-less US keyboard and supports almost 900 other > characters, including all of the apostrophes and quotes and dashes and > other characters under discussion: > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > I spent years designing and updating my own keyboard layout and studying > other layouts. I've ended this quest since I started using Moby Latin; > it's the best I've seen in numerous ways. ? I'm very glad to learn there is this good keyboard layout for the USA and for the UK, and I wonder very much what's missing for everybody to use it. Thank you very much, I just downloaded the two drivers and I'm curious about how to map nine hundred characters on two levels without chaining dead keys! Well I didn't look for, because at the beginning I searched for the French keyboard. ? ? >> Microsoft?s choice of mashing up apostrophe and close-quote to end up >> with an unprocessable hybrid was wrong. Very wrong. ? > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. ? The problem is not about code pages, it is about keeping them vividly in users' minds and letting them impact the Unicode Standard while since a quarter of a century, Unicode is on. ? The amazing chance of being able to disambiguate apostrophe and close-quote has been purposely overridden after Unicode had published clearly that U+02BC is apostrophe. Nothing was simplier than letting this recommendation as it was, and tackle the job of implementing Unicode on Windows, on Microsoft Office, and in the offices. There's so much communication about word processing, that there would have been a little place to introduce the difference between an apostrophe and a single closing quotation mark, but instead of that, Microsoft urged Unicode to remove the recommendation and to restore the chaos. ? I can't believe that was OK. Never, never. ? Marcel Schneider > Message du 15/06/15 18:36 > De : "Doug Ewell" > A : "Unicode Mailing List" > Copie ? : "Marcel Schneider" > Objet : RE: Accessing the WG2 document register > > Marcel Schneider wrote: > > > The US International keyboard layout indeed conforms to ISO/IEC 9995. > > AFAIK it was preexistent, and was validated for conformance by > > considering that the AltGr and Shift + AltGr shift states contain the > > secondary group. > > I did not think about it as an _implementation_ of ISO/IEC 9995. > > "ISO/IEC 9995" is a multi-part standard that covers many different > aspects of keyboards. US International certainly conforms to many of the > parts: > > ? it has alphanumeric, numeric, and editing zones with keys which can > be referenced by "E01" notation, as per 9995-1 > > ? it has shifting keys which are used to select levels > > ? the primary layout (Levels 1 and 2) conforms to 9995-2, as does > practically any Latin-script keyboard > > ? it has Escape and cursor keys in conformance with 9995-5 > > ? and so on. > > The Level 3 and "Level 4" (Shift+AltGr) allocations of US International > do not conform in any way to the common secondary layout of either > 9995-3:2002 or 9995-3:2010. For example, there is no ohm sign on US > International in any group or level, either at D01 (2002) or D02 (2010). > Perhaps we are not talking about the same thing when we say "conforms to > ISO/IEC 9995." > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > ? > Message du 15/06/15 17:21 > De : "Doug Ewell" > A : "Unicode Mailing List" > Copie ? : > Objet : Re: Another take on the English Apostrophe in Unicode > > Marcel Schneider wrote: > > > A free tool, the Microsoft Keyboard Layout Creator, allows every user > > to add U+02BC on his preferred keyboard layout > > I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% > compatible with the AltGr-less US keyboard and supports almost 900 other > characters, including all of the apostrophes and quotes and dashes and > other characters under discussion: > > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > > I spent years designing and updating my own keyboard layout and studying > other layouts. I've ended this quest since I started using Moby Latin; > it's the best I've seen in numerous ways. > > Elsewhere: > > > ISO stands for stability > > We wish. Several of us on this list have worked on standards and > standard-like activities that correct for, and defend against, > instability in ISO standards. > > > Microsoft?s choice of mashing up apostrophe and close-quote to end up > > with an unprocessable hybrid was wrong. Very wrong. > > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Jun 15 12:39:14 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 15 Jun 2015 19:39:14 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <20150615081259.665a7a7059d7ee80bb4d670165c8327d.2b2882039d.wbe@email03.secureserver.net> References: <20150615081259.665a7a7059d7ee80bb4d670165c8327d.2b2882039d.wbe@email03.secureserver.net> Message-ID: <1632973833.19659.1434389954562.JavaMail.www@wwinf1n18> On Mon, Jun 15, 2015, 18:36, Doug Ewell wrote: ? > The Level 3 and "Level 4" (Shift+AltGr) allocations of US International > do not conform in any way to the common secondary layout of either > 9995-3:2002 or 9995-3:2010. For example, there is no ohm sign on US > International in any group or level, either at D01 (2002) or D02 (2010). > Perhaps we are not talking about the same thing when we say "conforms to > ISO/IEC 9995." ? I don't measure exactly the implications of a keyboard compliance to a given standard when this standard is developed "on the paper" and without taking into consideration all needs and preferences of end-users. The Ohm sign you mention reminds me that ISO perpetuated on keyboard some deprecated legacy characters that end up anyway to be replaced with their canonical equivalent, that in this example is Greek capital omega. That's? another disconnect. ? And standardizing the dead key registries to exclude all characters that are not composed ones, is a counterproductive constraint based on the belief that the only way to get aware of the content of a layout is to read the keycap labels. This is a way of never getting curly quotes and apostrophe. ? On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: ? > I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% > compatible with the AltGr-less US keyboard and supports almost 900 other > characters, including all of the apostrophes and quotes and dashes and > other characters under discussion: > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > I spent years designing and updating my own keyboard layout and studying > other layouts. I've ended this quest since I started using Moby Latin; > it's the best I've seen in numerous ways. ? I'm very glad to learn there is this good keyboard layout for the USA and for the UK, and I wonder very much what's missing for everybody to use it. Thank you very much, I just downloaded the two drivers and I'm curious about how to map nine hundred characters on two levels without chaining dead keys! Well I didn't look for, because at the beginning I searched for the French keyboard. ? ? >> Microsoft?s choice of mashing up apostrophe and close-quote to end up >> with an unprocessable hybrid was wrong. Very wrong. ? > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. ? The problem is not about code pages, it is about keeping them vividly in users' minds and letting them impact the Unicode Standard while since a quarter of a century, Unicode is on. ? The amazing chance of being able to disambiguate apostrophe and close-quote has been purposely overridden after Unicode had published clearly that U+02BC is apostrophe. Nothing was simplier than letting this recommendation as it was, and tackle the job of implementing Unicode on Windows, on Microsoft Office, and in the offices. There's so much communication about word processing, that there would have been a little place to introduce the difference between an apostrophe and a single closing quotation mark, but instead of that, Microsoft urged Unicode to remove the recommendation and to restore the chaos. ? I can't believe that was OK. Never, never. ? Marcel Schneider > Message du 15/06/15 18:36 > De : "Doug Ewell" > A : "Unicode Mailing List" > Copie ? : "Marcel Schneider" > Objet : RE: Accessing the WG2 document register > > Marcel Schneider wrote: > > > The US International keyboard layout indeed conforms to ISO/IEC 9995. > > AFAIK it was preexistent, and was validated for conformance by > > considering that the AltGr and Shift + AltGr shift states contain the > > secondary group. > > I did not think about it as an _implementation_ of ISO/IEC 9995. > > "ISO/IEC 9995" is a multi-part standard that covers many different > aspects of keyboards. US International certainly conforms to many of the > parts: > > ? it has alphanumeric, numeric, and editing zones with keys which can > be referenced by "E01" notation, as per 9995-1 > > ? it has shifting keys which are used to select levels > > ? the primary layout (Levels 1 and 2) conforms to 9995-2, as does > practically any Latin-script keyboard > > ? it has Escape and cursor keys in conformance with 9995-5 > > ? and so on. > > The Level 3 and "Level 4" (Shift+AltGr) allocations of US International > do not conform in any way to the common secondary layout of either > 9995-3:2002 or 9995-3:2010. For example, there is no ohm sign on US > International in any group or level, either at D01 (2002) or D02 (2010). > Perhaps we are not talking about the same thing when we say "conforms to > ISO/IEC 9995." > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > ? > Message du 15/06/15 17:21 > De : "Doug Ewell" > A : "Unicode Mailing List" > Copie ? : > Objet : Re: Another take on the English Apostrophe in Unicode > > Marcel Schneider wrote: > > > A free tool, the Microsoft Keyboard Layout Creator, allows every user > > to add U+02BC on his preferred keyboard layout > > I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% > compatible with the AltGr-less US keyboard and supports almost 900 other > characters, including all of the apostrophes and quotes and dashes and > other characters under discussion: > > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > > I spent years designing and updating my own keyboard layout and studying > other layouts. I've ended this quest since I started using Moby Latin; > it's the best I've seen in numerous ways. > > Elsewhere: > > > ISO stands for stability > > We wish. Several of us on this list have worked on standards and > standard-like activities that correct for, and defend against, > instability in ISO standards. > > > Microsoft?s choice of mashing up apostrophe and close-quote to end up > > with an unprocessable hybrid was wrong. Very wrong. > > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 15 13:14:22 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 15 Jun 2015 11:14:22 -0700 Subject: Accessing the WG2 document register Message-ID: <20150615111422.665a7a7059d7ee80bb4d670165c8327d.42e4fd3395.wbe@email03.secureserver.net> Marcel Schneider wrote: > I don't measure exactly the implications of a keyboard compliance to > a given standard when this standard is developed "on the paper" and > without taking into consideration all needs and preferences of end- > users. ISO did not come up with the 2010 revision to 9995-3 on their own. It originated with the German NB. > The Ohm sign you mention reminds me that ISO perpetuated on > keyboard some deprecated legacy characters that end up anyway to be > replaced with their canonical equivalent, that in this example is > Greek capital omega. That's another disconnect. The relationship between U+2126 OHM SIGN and U+03A9 GREEK CAPITAL LETTER OMEGA is not at issue here. Neither of these characters is present on US International. > And standardizing the dead key registries to exclude all characters > that are not composed ones, is a counterproductive constraint based on > the belief that the only way to get aware of the content of a layout > is to read the keycap labels. This is a way of never getting curly > quotes and apostrophe. Dead keys under Windows are not constrained in the way you describe. As I said earlier today, I use a keyboard on Windows on which all of these characters are available via dead keys: ? ? ? ? ? > I'm very glad to learn there is this good keyboard layout for the USA > and for the UK, and I wonder very much what's missing for everybody to > use it. > Thank you very much, I just downloaded the two drivers and I'm curious > about how to map nine hundred characters on two levels without > chaining dead keys! > Well I didn't look for, because at the beginning I searched for the > French keyboard. Since John made the .klc source file available with the download, I'm sure it would not be too difficult to adapt it to a French-based layout. > The problem is not about code pages, it is about keeping them vividly > in users' minds and letting them impact the Unicode Standard while > since a quarter of a century, Unicode is on. I'd guess there are very few users who consciously see the use of U+2019 as both apostrophe and right-single-quote as a vestige of code pages, or as a conscious effort by Evil Microsoft? to force them into anything. > There's so much communication about word processing, that there would > have been a little place to introduce the difference between an > apostrophe and a single closing quotation mark, but instead of that, > Microsoft urged Unicode to remove the recommendation and to restore > the chaos. Perhaps a UTC member can confirm whether this is fact or speculation. Markus Kuhn's comment from 1999 about "couldn't Unicode follow Microsoft...?" doesn't prove that Unicode was in fact strong-armed by Microsoft. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From charupdate at orange.fr Tue Jun 16 12:02:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:02:26 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: > Marcel Schneider wrote: [...] >> Microsoft?s choice of mashing up apostrophe and close-quote to end up >> with an unprocessable hybrid was wrong. Very wrong. > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. I replied: > The problem is not about code pages [...] I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. I concede that my last reply yesterday in the evening was incorrect. Additionally to Microsoft?s action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicode?s respective recommendation has been withdrawn with respect to a widespread practice founded on CP?Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I?always liked Windows-1252 for its completing the ISO?8859-1 charset (which was so useless* it had to be replaced with ISO?8859-15). * Please read this paper (in French): http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf Now that I?examined closely CP1252?s layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO?8859?1. Further, in this range, I?found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus ? and ? came later. However, CP1252 remained stable since Windows?98, for which ? and the ?? pair were added. And five places were left empty. >From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. About the ?confusion? everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase ? digraph, and even the uppercase ?. Microsoft?s big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. Unfortunately, Microsoft visibly didn?t finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. Well, all that are errors of the past. If I don?t see a need, I?won?t meet it. By leaving ? and ? off the charset, they got ? and ? in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustes? beds were out. At least, they should have been. Whence that survival of CP1252-based confusion? Briefly, today?s text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused to have to use two similar squiggles), is shifting it at process level, where the damage it causes is far bigger. Trust me, users who find themselves unable to set apart the apostrophes when they?re going to replace single quotes, won?t bless Microsoft for the input simplicity! Ted?Clancy?s blog post is here to prove. https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ It was time to get rid of that confusion when Unicode recommended U+02BC for apostrophe. Microsoft?s choice not to comply was wrong again. Very wrong. ? Let's come back to some of your replies. ? On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > I'd guess there are very few users who consciously see the use of U+2019 > as both apostrophe and right-single-quote as a vestige of code pages, or > as a conscious effort by Evil Microsoft? to force them into anything. ? Quite sure. These are habits, not constraints. I'm not sharing such views about a battle between Google and Microsoft and about ethical prefixes to allocate to companies. The problem is that when the result proves to be bad, the idea was, too. ? The mismatch between apostrophe and close-quote is now part of our culture. We must get back pragmatic and see the advantages and disadvantages of each option (ambiguating, disambiguating), not say "I believe there are no disadvantages in ambiguating" or "there is no reason to disambiguate" or "people will get confused, let them alone" or the like. These all are statements. We must look at real people and listen to what they say to us. Ted Clancy is one of them. When he's worried about that malfunctioning of text-processing, who will keep smiling and stay saying "There's no problem, there's no reason to fix that, it's all OK like it is"? That's to despise people, that's to spit at their face. ? > Perhaps a UTC member can confirm whether this is fact or speculation. > Markus Kuhn's comment from 1999 about "couldn't Unicode follow > Microsoft...?" doesn't prove that Unicode was in fact strong-armed by > Microsoft. ? Yes, please let us know. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 16 12:04:43 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:04:43 +0200 (CEST) Subject: Accessing the WG2 document register Message-ID: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> On Mon, Jun 15, 2015, Peter Constable wrote: > I suggest that people on this list that have not personally engaged directly in ISO process via their country?s > designated standards bodies should stop opining and editorializing on that body. > > ISO isn?t perfect by any means, but in the many years I have been directly involved in ISO process > I can?t say I?ve ever seen discrimination other than appropriate discrimination of ideas on technical merits. Please consider that Mr?Pandey reported a *new* rule change and *new* discrimination you can?t have experienced in the past. If you have carefully read the emails in this thread, you learned that this new discrimination is all but ?appropriate discrimination of ideas on technical merits? which you refer to. You will be the more indignated, and the more you will welcome everybody who does the same. Having the honor of discussing here, I take the matters (I know about) very seriously and I know since a long time that unfortunately, persons who are obliged to bodies by contract tend not to point out malfunctioning, so other people must help to point out and find ways to correct or improve. Even if scarcely expecting any thanks, I?underscore that unfortunately I can?t afford to do this every day because it takes time, normally I must think about, mature and consolidate. It would be nice if you too, Mr?Constable, thanks to your inside experience and relationships from your ISO activity, would help Mr?Pandey to get heard at ISO?Workgroup?2 and accessed the documents register. As everybody knows, every person who comes up with proposals deserves full attention, respect and consideration, especially when the person did already great work and got meritorious. ISO managers who persistently prevent workgroups from ethics, deserve to be moved from the responsibilities they do not fulfill. Everybody on the Unicode Mailing List is well placed to know that Unicode publicly reports about its activities and accepts public feedback. Quality insurance seems little reason for ISO not to accept input from outside national Standards Bodies. What are you knowing about the reasons ISO does not, and even recently narrowed its eligibility conditions? Best wishes, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 16 12:08:05 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:08:05 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: <893702172.2401.1434352653036.JavaMail.www@wwinf2229> Message-ID: <1302383882.21163.1434474485938.JavaMail.www@wwinf1n18> On Sat, Jun 13, 2015, Mark Davis wrote: > In particular, I see no need to change our recommendation on the character used > in contractions for English and many other languages (U+2019). Similarly, we wouldn't > recommend use of anything but the colon for marking abbreviations in Swedish, or > propose a new MODIFIER LETTER ELLIPSIS for "supercali...docious". > (IMO, U+02BC was probably just a mistake; the minor benefit is not worth the confusion.) On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ?? wrote: > On Mon, Jun 15, 2015 at 9:17 AM, Marcel Schneider wrote: >> When we take the topic down again from linguistics to the core mission of Unicode, that is character encoding and text processing standardisation, ellipsis and Swedish abbreviation colon differ from the single closing quotation mark in this, that they are not to be processed. >> [...] > Quite nice of you to inform me of the core mission of Unicode?I must have somehow missed that. I was rather astonished and amused when I read I could have aimed at informing you of Unicode?s core. The goal was to check I?m at the right level. Well, there would have been another manner to say it... which didn?t come at mind to me. However, what surprises me even more as I think about, is while knowing all on Unicode, you?ve got just a weak opinion on which apostrophe recommendation is the right one... > More seriously, it is not all so black and white. As we developed Unicode, we considered whether to separate characters by function, eg, an END OF SENTENCE PERIOD, ABBREVIATION PERIOD, DECIMAL PERIOD, NUMERIC GROUPING PERIOD, etc. Or DIARASIS vs UMLAUT. We quickly concluded that the costs far, far outweighed the benefits. It?s another proof of Unicode?s professionalism as to have thought about distinguishing DIAERESIS and UMLAUT. Despite of being a French-German bilingual and knowing the diacritics, I encountered that first in Microsoft?s kbd.h, where the one is called DIARESIS and is mapped to UMLAUT. I?m not a friend of such distinctions (except in vocabulary and grammatics), because in writing practice they would be but useless and counterproductive complications. An abbreviation dot would have been much more useful, but to deploy its benefits, it would have needed a supplemental key mapping. On this background, Unicode?s choice of recommending to disambiguate the apostrophe is even more meritorious. I see it as a proof that there is really a good reason that people mind at the difference whenever they don?t use the ASCII apostrophe for all of them. What would have bugged Microsoft then, was that it could have to implement this difference in its word processing and desktop publishing software, and to tell users about. Nothing easier for Microsoft with all the Help and Info! ?The new smart quotes help you to check whether you need an apostrophe or a quote. This makes quotes conversion easy.? Or the like. > In practice, whenever characters are essentially identical?and by that I mean that the overlap between the acceptable glyphs for each character is very high?people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. Based on the Unicode principle to encode characters, not glyphs, I doubt whether two characters may be called _essentially_ identical when they look the same. A huge subset of the Code Charts? xrefs is to help font designers on this point. About people mixing up, they are most likely to do so when the keyboard allows only one of two. This is not the case of U+02BC and U+2019, none of whose is on standard keyboards. Here it?s the smart quotes algorithm which will mix up! And this one is easily helped not to do so, since it?s embedded in high-end software with all its display and shortcut capabilities. Eventually, the only one who wanted to keep mixing up was?guess who??Microsoft. The reason? Word processing that depends on distinction between opening and closing quotation marks, which needs a very tiny algorighm, is much easier to implement than processing that depends on distinction between apostrophe and simple closing quotation mark, and between apostrophe and simple quotation marks on the whole. Informal English word forms are so rich and varying that some are ambiguous and scarcely any software dictionary can contain them all. But even formal English is not wholly supported since nested quotes often are not. Why would users not be interested in improved software, even if it would cost a little more? About searching and equivalence classes: There is already plenty of equivalence implemented in the simplest search algorighm: casing! A class more with (U+0027, U+02BC, U+2019) wouldn?t change that a lot. >So we only separated essentially identical characters in limited cases: such as letters from different scripts. I repeat myself: Calling like-looking glyphs ?essentially identical characters? is inconsistent with Unicode?s encoding characters, not glyphs. But whatever, I repeat myself again: Under these circumstances, Unicode?s recommendation of preferring U+02BC for apostrophe weighs the heavier! Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 16 12:09:40 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:09:40 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <648652693.21249.1434474580468.JavaMail.www@wwinf1n18> On Mon, Jun 15, Philippe Verdy wrote: > But I think that keyboard should all have a dedicated Kana key to easily map additional groups without sacrificing other shift keys > on the last row: keyboards really don't need two windows keys and so the space bar can remain with a cumfortable width [...]. IMHO?the space bar should not exceed five keys in width. > If a Kana key or present, in fact it should be to the right of the right control, or ro the right of the right Shift The best is always that the asymetric modifiers be actioned with the thumbs. If I had to choose between AltGr and Kana, I would prefer the latter because it does not interfere with Ctrl+Alt and does not disable dead keys on Word. But alternately we could map the MODIFIER LETTER APOSTROPHE on the right-hand Alt key for a fluid input of high-quality text files. > [...] Keyboards on notebooks are extremely poorly designed, a complete nonsense. Yes there are many models from big manufacturers whose key dispatch I don?t like. By contrast, my computer is a netbook, where nevertheless I find all keys I need, in an ergonomical array. I?m not bound, and I?m not paid to make ad?. It?s just an advice. The manufacturer my netbook is from, shipped the same model for the United States *with* an Applications key, *with* a Pause key, *with* a second Function modifier key to the right, with up and down keys of the *same size* as left and right, and *with* an overlaid numpad: When you disable the numpad specials on a customised layout, you just press Fn while entering digits (or press the toggle before and after), the same as on Macbooks I read and heard. It?s Asus. Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 16 12:11:07 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:11:07 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1169482423.21305.1434474667939.JavaMail.www@wwinf1n18> On Mon, Jun 15, 2015, Doug Ewell wrote: > Marcel Schneider wrote: > >> A free tool, the Microsoft Keyboard Layout Creator, allows every user >> to add U+02BC on his preferred keyboard layout > I use John Cowan's Moby Latin keyboard, built with MSKLC, which is 100% > compatible with the AltGr-less US keyboard and supports almost 900 other > characters, including all of the apostrophes and quotes and dashes and > other characters under discussion: > > http://recycledknowledge.blogspot.com/2013/09/us-moby-latin-keyboard-for-windows.html > > I spent years designing and updating my own keyboard layout and studying > other layouts. I've ended this quest since I started using Moby Latin; > it's the best I've seen in numerous ways. Yesterday late in the evening, I?looked up John?Cowans keyboard layouts. They are the best MSKLC based keyboard layouts I?ve ever seen. They are memonic. I note that it naturally uses AltGr (right-hand Alt or Alt+Ctrl). In my last yesterday?s reply I?reminded a multilingual layout from a research institute which really does not use more than two shift states. It?s not free. Mr Cowan writes about some allocations being temporary until a new MSKLC version for chained dead keys is released. This MSKLC?2,0 is still not born and I fear it will never. IMO this is the result of the disinterest of many people. You and others probably represent exceptions. This goes so far that MSKLC is declared ?appears very rarely? in the Acronym Finder. Normally the release and update of MSKLC should have created a buzz on social media, and today nobody would complain about missing characters. Well, I too complained one year long without knowing about MSKLC. Today, one year ago, I installed my copy of the MSKLC. Later I tried to define a universal Latin layout too, but when I was at 1,921 Unicode characters, I never could remind it. I?gave up this way, it?s hard to get on one keyboard, among other Unicode? characters, all 1,736 of 8.0.0 used in Latin script (if my subset is right). Do you know Ilya?Zakharewich?s approach? http://search.cpan.org/~ilyaz/UI-KeyboardLayout-0.64/lib/UI/KeyboardLayout.pm Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 16 12:27:01 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 16 Jun 2015 19:27:01 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> References: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> Message-ID: <975185747.21752.1434475621961.JavaMail.www@wwinf1n18> Ten minutes ago, I wrote: ? > Microsoft?s big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. ? Too long a sentence, a comme too much, and a big ellipsis... Please read: Microsoft?s big merit has been to produce a ready remedy to this bungling. I call it "bungling" but in fact, as far as belongs to the OE digraph, their exclusion from ISO 8859-1 was meant to match defective peripherics (that couldn't manage ?? digraph). ? Sorry. Marcel Schneider ? ? ? > Message du 16/06/15 19:12 > De : "Marcel Schneider" > A : "Doug Ewell" > Copie ? : "Unicode Mailing List" > Objet : Re: Another take on the English Apostrophe in Unicode > > > On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: > Marcel Schneider wrote: [...] >> Microsoft?s choice of mashing up apostrophe and close-quote to end up >> with an unprocessable hybrid was wrong. Very wrong. > Windows-1252 and the other Windows code pages were developed during the > 1980s, before Unicode, when almost all non-Asian character sets were > limited to 256 code points. The distinctions between apostrophe and > right-single-quote, weighed against the confusion caused by encoding two > identical-looking characters, would never have been sufficient back then > to justify separate encoding in this limited space. I replied: > The problem is not about code pages [...] I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. > I concede that my last reply yesterday in the evening was incorrect. > Additionally to Microsoft?s action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicode?s respective recommendation has been withdrawn with respect to a widespread practice founded on CP?Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I?always liked Windows-1252 for its completing the ISO?8859-1 charset (which was so useless* it had to be replaced with ISO?8859-15). * Please read this paper (in French): http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf Now that I?examined closely CP1252?s layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO?8859?1. Further, in this range, I?found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus ? and ? came later. However, CP1252 remained stable since Windows?98, for which ? and the ?? pair were added. And five places were left empty. >From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. About the ?confusion? everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase ? digraph, and even the uppercase ?. Microsoft?s big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. Unfortunately, Microsoft visibly didn?t finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. Well, all that are errors of the past. If I don?t see a need, I?won?t meet it. By leaving ? and ? off the charset, they got ? and ? in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustes? beds were out. At least, they should have been. Whence that survival of CP1252-based confusion? Briefly, today?s text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused to have to use two similar squiggles), is shifting it at process level, where the damage it causes is far bigger. Trust me, users who find themselves unable to set apart the apostrophes when they?re going to replace single quotes, won?t bless Microsoft for the input simplicity! Ted?Clancy?s blog post is here to prove. https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ It was time to get rid of that confusion when Unicode recommended U+02BC for apostrophe. Microsoft?s choice not to comply was wrong again. Very wrong. > ? > Let's come back to some of your replies. > ? > On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > > I'd guess there are very few users who consciously see the use of U+2019 > as both apostrophe and right-single-quote as a vestige of code pages, or > as a conscious effort by Evil Microsoft? to force them into anything. > ? > Quite sure. These are habits, not constraints. I'm not sharing such views about a battle between Google and Microsoft and about ethical prefixes to allocate to companies. The problem is that when the result proves to be bad, the idea was, too. > ? > The mismatch between apostrophe and close-quote is now part of our culture. We must get back pragmatic and see the advantages and disadvantages of each option (ambiguating, disambiguating), not say "I believe there are no disadvantages in ambiguating" or "there is no reason to disambiguate" or "people will get confused, let them alone" or the like. These all are statements. We must look at real people and listen to what they say to us. Ted Clancy is one of them. When he's worried about that malfunctioning of text-processing, who will keep smiling and stay saying "There's no problem, there's no reason to fix that, it's all OK like it is"? > That's to despise people, that's to spit at their face. > ? > > Perhaps a UTC member can confirm whether this is fact or speculation. > Markus Kuhn's comment from 1999 about "couldn't Unicode follow > Microsoft...?" doesn't prove that Unicode was in fact strong-armed by > Microsoft. > ? > Yes, please let us know. > Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Jun 16 12:33:50 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 16 Jun 2015 10:33:50 -0700 Subject: Another take on the English Apostrophe in Unicode Message-ID: <20150616103350.665a7a7059d7ee80bb4d670165c8327d.274f72d111.wbe@email03.secureserver.net> Marcel Schneider wrote: > That's to despise people, that's to spit at their face. You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. I do wish we could put an end to all the accusations of malfeasance. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From michel at suignard.com Tue Jun 16 12:47:58 2015 From: michel at suignard.com (Michel Suignard) Date: Tue, 16 Jun 2015 17:47:58 +0000 Subject: Accessing the WG2 document register In-Reply-To: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> References: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> Message-ID: >It would be nice if you too, Mr Constable, thanks to your inside experience and relationships from your ISO activity, would help Mr Pandey to get heard at ISO Workgroup 2 and accessed the documents register. As everybody knows, every person who comes up with proposals deserves full attention, respect and consideration, especially when the person did already great work and got meritorious. ISO managers who persistently prevent workgroups from ethics, deserve to be moved from the responsibilities they do not fulfill. The ISO WG2 chair is monitoring this discussion (as well as the 10646 project editor) and is very tired of it. ISO in the SC2 side of things is just a group of volunteers, who are doing their best at accommodating various needs. Anshuman knows me very well, he has all the consideration he deserves from the WG2 participants where his contributions are made. Also don?t underestimate the role of the Script Encoding Initiative which is in fact endorsing a lot of Anshuman work (including his contribution to UTC and WG2). Anshuman, I and a few others had some private exchange and I am sure he understands the situation better. There are no ISO ?managers? that can act on your demand, unless I am the one by being the newly appointed WG2 convenor. BTW you can have my job if you think you are so much better. I thought I had explained the situation few days ago. ISO is not a monolithic organization, but most of us are unpaid volunteers who are barely recognized for their contribution. At the same time ISO has an infrastructure which needs to be paid for, we can all argue about the new directions that the overhead has taking but blaming the peons at the WG level is not doing any good. BTW Peter and I are good friends, he is the Unicode liaison rep for both SC2 and WG2 and we are in frequent contact. Everybody on the Unicode Mailing List is well placed to know that Unicode publicly reports about its activities and accepts public feedback. Quality insurance seems little reason for ISO not to accept input from outside national Standards Bodies. What are you knowing about the reasons ISO does not, and even recently narrowed its eligibility conditions? Sorry, you have no idea what you are talking about. The day you can have a civilized conversation, maybe I will help you. Michel Suignard WG2 convenor, ISO/IEC 10646 Project Editor and Unicode Secretary (just to show that we work in some symbiosis), I also do most of the draft chart work for both sides. Been in the trenches in both sides for the last 25 years (and more). -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Jun 16 13:57:03 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 16 Jun 2015 20:57:03 +0200 Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <20150616103350.665a7a7059d7ee80bb4d670165c8327d.274f72d111.wbe@email03.secureserver.net> References: <20150616103350.665a7a7059d7ee80bb4d670165c8327d.274f72d111.wbe@email03.secureserver.net> Message-ID: And, Marcel, while you are at it, this is getting tiresome. Please find some other place to vent about events you know very little about; the internet is full of them. Mark Mark *? Il meglio ? l?inimico del bene ?* On Tue, Jun 16, 2015 at 7:33 PM, Doug Ewell wrote: > Marcel Schneider wrote: > > > That's to despise people, that's to spit at their face. > > You know what? If you want to use U+02BC as an English apostrophe, go > ahead and use it. Nobody's stopping you really. Not Unicode, not > Microsoft, not ISO. > > I do wish we could put an end to all the accusations of malfeasance. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 16 14:08:22 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 16 Jun 2015 21:08:22 +0200 Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> References: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> Message-ID: When ISO 8859-1 was designed (in fact in an early version by Digital for its own version of Unix), allowing a bijective compatibility with 8-bit EBCDIC and its C1 controls was still a priority. Microsoft abandoned its own develomment of Unix to develop DOS and extend it with Windows in parallel of its work with IBM that had wanted DOS to be a very lightweight version of CP/M, but without a scheduler in order to run softwares on personal computers that could be used in small organisations that could not buy its mainframes, but had to prepare documents and data that could be reused on IBM mainframes... 2015-06-16 19:02 GMT+02:00 Marcel Schneider : > On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: > > > Marcel Schneider wrote: > [...] > >> Microsoft?s choice of mashing up apostrophe and close-quote to end up > >> with an unprocessable hybrid was wrong. Very wrong. > > > Windows-1252 and the other Windows code pages were developed during the > > 1980s, before Unicode, when almost all non-Asian character sets were > > limited to 256 code points. The distinctions between apostrophe and > > right-single-quote, weighed against the confusion caused by encoding two > > identical-looking characters, would never have been sufficient back then > > to justify separate encoding in this limited space. > > I replied: > > > The problem is not about code pages [...] > > I thank you for your answers and I'll come back upon some of them below. > There's some new fact to bring first. > > I concede that my last reply yesterday in the evening was incorrect. > > Additionally to Microsoft?s action in the late nineties urging Unicode to > give up its useful apostrophe recommendation (U+02BC), the design of code > page Windows-1252 is in my scope, indeed. > > Since I learned there are very good and outweighing reasons to use U+02BC > in English, and that Unicode?s respective recommendation has been withdrawn > with respect to a widespread practice founded on CP Windows-1252, I soon > suspected there would have been means to get the apostrophe into this code > page. Here I need to recall that I always liked Windows-1252 for its > completing the ISO 8859-1 charset (which was so useless* it had to be > replaced with ISO 8859-15). > * Please read this paper (in French): > http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf > > Now that I examined closely CP1252?s layout, I found five empty code > points, five code points left out, in the C1 ranges that Microsoft > allocated to complete ISO 8859?1. Further, in this range, I found two > MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL > TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate > the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on > one side, and the diacritics on the other side. There is to say that when > Windows was first released, the left and right single quotes were the only > printable characters in these two ranges. All other characters plus ? and ? > came later. However, CP1252 remained stable since Windows 98, for which ? > and the ?? pair were added. And five places were left empty. > > From this on I got convinced that it would have been very easy to place > the letter apostrophe for example at code point 144 (0x90), near the single > turned comma quotation mark 0x91 and the single comma quotation mark > (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. > > About the ?confusion? everybody refers to, there is to say that the only > way to get people confused, is to do things and not to explain anything to > anybody. > > The core problem would have been that code pages were designed with > glyph-based *character* encoding in mind, not semantics-based *text* > encoding. > > I repeat that others had done even worse. Others, that is some of the > so-called expert members of the ISO WG designing 8859-1, as two of them not > even aimed at encoding all needed characters, by refusing deliberately to > encode the lower- and uppercase ? digraph, and even the uppercase ?. > Microsoft?s big merit has been to produce a ready remedy to this bungling, > that as far as belongs to the OE digraph, was meant to match defective > peripherics. > > Unfortunately, Microsoft visibly didn?t finish this job, by aiming at > encoding characters only, and thus not allocating more than one code point > to that squiggle, whilst several places were left. > > Well, all that are errors of the past. If I don?t see a need, I won?t meet > it. By leaving ? and ? off the charset, they got ? and ? in, at least. > Where things ran really bad, was when Unicode was on, and code pages > Procrustes? beds were out. At least, they should have been. Whence that > survival of CP1252-based confusion? > > Briefly, today?s text processing is suffering from the > apostrophe-close-quote confusion. This confusion is firstly out of date, > and secondly it was unnecessary from the beginning on. Avoiding this > confusion at a trivial level (by not getting users confused to have to use > two similar squiggles), is shifting it at process level, where the damage > it causes is far bigger. Trust me, users who find themselves unable to set > apart the apostrophes when they?re going to replace single quotes, won?t > bless Microsoft for the input simplicity! Ted Clancy?s blog post is here to > prove. > > https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ > > > It was time to get rid of that confusion when Unicode recommended U+02BC > for apostrophe. Microsoft?s choice not to comply was wrong again. Very > wrong. > > > > Let's come back to some of your replies. > > > > On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > > > > I'd guess there are very few users who consciously see the use of U+2019 > > as both apostrophe and right-single-quote as a vestige of code pages, or > > as a conscious effort by Evil Microsoft? to force them into anything. > > > > Quite sure. These are habits, not constraints. I'm not sharing such views > about a battle between Google and Microsoft and about ethical prefixes to > allocate to companies. The problem is that when the result proves to be > bad, the idea was, too. > > > > The mismatch between apostrophe and close-quote is now part of our > culture. We must get back pragmatic and see the advantages and > disadvantages of each option (ambiguating, disambiguating), not say "I > believe there are no disadvantages in ambiguating" or "there is no reason > to disambiguate" or "people will get confused, let them alone" or the like. > These all are statements. We must look at real people and listen to what > they say to us. Ted Clancy is one of them. When he's worried about that > malfunctioning of text-processing, who will keep smiling and stay saying > "There's no problem, there's no reason to fix that, it's all OK like it > is"? > > That's to despise people, that's to spit at their face. > > > > > Perhaps a UTC member can confirm whether this is fact or speculation. > > Markus Kuhn's comment from 1999 about "couldn't Unicode follow > > Microsoft...?" doesn't prove that Unicode was in fact strong-armed by > > Microsoft. > > > > Yes, please let us know. > > > > > > Marcel Schneider > -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Tue Jun 16 14:31:12 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 16 Jun 2015 20:31:12 +0100 Subject: Another take on the English apostrophe in Unicode In-Reply-To: <973380398.1843.1434350457831.JavaMail.www@wwinf2229> References: <973380398.1843.1434350457831.JavaMail.www@wwinf2229> Message-ID: <20150616203112.59e02f27@JRWUBU2> On Mon, 15 Jun 2015 08:40:57 +0200 (CEST) Marcel Schneider wrote: > ...while in the meantime, in obliging > anticipation, the world?s biggest software company stays inviting us > to feel free to customise our keyboard with a free tool for free > download at > http://www.microsoft.com/en-us/download/details.aspx?id=22339 I don't know if you have the wrong link for MSKLC, but that link claims it is only 'supported' up to Vista. That's not much of an invitation! I do know that MSKLC works on Windows 7, and its output there is appropriate for Windows 7, generate multiple versions of the DLL and its installer. Richard. From petercon at microsoft.com Tue Jun 16 17:53:12 2015 From: petercon at microsoft.com (Peter Constable) Date: Tue, 16 Jun 2015 22:53:12 +0000 Subject: Accessing the WG2 document register In-Reply-To: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> References: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> Message-ID: There are changes in processes, but nothing that I would consider new _discrimination_. Also, Mr. Pandey?s positions have always and continue to be very well represented in ISO/IEC JTC1/SC2/WG2. Again, if you are not yourself engaging in ISO processes or working with your country?s national standards body in connection to ISO processes, then you are not in a good position to be critiquing ISO processes. Peter From: Marcel Schneider [mailto:charupdate at orange.fr] Sent: Tuesday, June 16, 2015 10:05 AM To: Peter Constable Cc: Unicode Mailing List Subject: RE: Accessing the WG2 document register Importance: High On Mon, Jun 15, 2015, Peter Constable > wrote: > I suggest that people on this list that have not personally engaged directly in ISO process via their country?s > designated standards bodies should stop opining and editorializing on that body. > > ISO isn?t perfect by any means, but in the many years I have been directly involved in ISO process > I can?t say I?ve ever seen discrimination other than appropriate discrimination of ideas on technical merits. Please consider that Mr Pandey reported a *new* rule change and *new* discrimination you can?t have experienced in the past. If you have carefully read the emails in this thread, you learned that this new discrimination is all but ?appropriate discrimination of ideas on technical merits? which you refer to. You will be the more indignated, and the more you will welcome everybody who does the same. Having the honor of discussing here, I take the matters (I know about) very seriously and I know since a long time that unfortunately, persons who are obliged to bodies by contract tend not to point out malfunctioning, so other people must help to point out and find ways to correct or improve. Even if scarcely expecting any thanks, I underscore that unfortunately I can?t afford to do this every day because it takes time, normally I must think about, mature and consolidate. It would be nice if you too, Mr Constable, thanks to your inside experience and relationships from your ISO activity, would help Mr Pandey to get heard at ISO Workgroup 2 and accessed the documents register. As everybody knows, every person who comes up with proposals deserves full attention, respect and consideration, especially when the person did already great work and got meritorious. ISO managers who persistently prevent workgroups from ethics, deserve to be moved from the responsibilities they do not fulfill. Everybody on the Unicode Mailing List is well placed to know that Unicode publicly reports about its activities and accepts public feedback. Quality insurance seems little reason for ISO not to accept input from outside national Standards Bodies. What are you knowing about the reasons ISO does not, and even recently narrowed its eligibility conditions? Best wishes, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Wed Jun 17 02:23:16 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Wed, 17 Jun 2015 10:23:16 +0300 Subject: Accessing the WG2 document register In-Reply-To: References: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> Message-ID: <001001d0a8ce$7a950420$6fbf0c60$@fi> I fully agree with Peter. I used to be heavily involved in SC2 and its working groups since the turn of the century and I?ve been party to several proposals and other contributions. The activity of the National Bodies has decreased in the past few years (particularly that of the European countries), mostly due to the fact that the bulk of the encoding work directly related to them is nearly complete, but SC2 still has an important role to play. Although I?m strongly against undue bureaucracy, I also understand that the members of ISO, the National Bodies, want to ensure that the technical proposals have been vetted prior to their submission, particularly if they turn up being somehow attached to the National Body of the submitter?s home country. I highly appreciate the work of the conveners (Mike Ksar and more recently Michel Suignard), the project editors and the recording secretary (Uma Umamaheswaran). The liaison with Unicode has brought in a lot of technical expertise, which has been most beneficial in the course of the years. WG2 hasn?t had a general email discussion facility (although Michael Everson has privately maintained some lists for specific topics) for which purpose the Unicode list has been used. Sincerely, Erkki L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Peter Constable L?hetetty: 17. kes?kuuta 2015 01:53 Vastaanottaja: Marcel Schneider Kopio: Unicode Mailing List Aihe: RE: Accessing the WG2 document register There are changes in processes, but nothing that I would consider new _discrimination_. Also, Mr. Pandey?s positions have always and continue to be very well represented in ISO/IEC JTC1/SC2/WG2. Again, if you are not yourself engaging in ISO processes or working with your country?s national standards body in connection to ISO processes, then you are not in a good position to be critiquing ISO processes. Peter From: Marcel Schneider [mailto:charupdate at orange.fr] Sent: Tuesday, June 16, 2015 10:05 AM To: Peter Constable Cc: Unicode Mailing List Subject: RE: Accessing the WG2 document register Importance: High On Mon, Jun 15, 2015, Peter Constable wrote: > I suggest that people on this list that have not personally engaged directly in ISO process via their country?s > designated standards bodies should stop opining and editorializing on that body. > > ISO isn?t perfect by any means, but in the many years I have been directly involved in ISO process > I can?t say I?ve ever seen discrimination other than appropriate discrimination of ideas on technical merits. Please consider that Mr Pandey reported a *new* rule change and *new* discrimination you can?t have experienced in the past. If you have carefully read the emails in this thread, you learned that this new discrimination is all but ?appropriate discrimination of ideas on technical merits? which you refer to. You will be the more indignated, and the more you will welcome everybody who does the same. Having the honor of discussing here, I take the matters (I know about) very seriously and I know since a long time that unfortunately, persons who are obliged to bodies by contract tend not to point out malfunctioning, so other people must help to point out and find ways to correct or improve. Even if scarcely expecting any thanks, I underscore that unfortunately I can?t afford to do this every day because it takes time, normally I must think about, mature and consolidate. It would be nice if you too, Mr Constable, thanks to your inside experience and relationships from your ISO activity, would help Mr Pandey to get heard at ISO Workgroup 2 and accessed the documents register. As everybody knows, every person who comes up with proposals deserves full attention, respect and consideration, especially when the person did already great work and got meritorious. ISO managers who persistently prevent workgroups from ethics, deserve to be moved from the responsibilities they do not fulfill. Everybody on the Unicode Mailing List is well placed to know that Unicode publicly reports about its activities and accepts public feedback. Quality insurance seems little reason for ISO not to accept input from outside national Standards Bodies. What are you knowing about the reasons ISO does not, and even recently narrowed its eligibility conditions? Best wishes, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jun 17 10:09:31 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 17 Jun 2015 17:09:31 +0200 (CEST) Subject: Accessing the WG2 document register In-Reply-To: <001001d0a8ce$7a950420$6fbf0c60$@fi> References: <1630765980.21046.1434474283758.JavaMail.www@wwinf1n18> <001001d0a8ce$7a950420$6fbf0c60$@fi> Message-ID: <1672470426.12114.1434553771360.JavaMail.www@wwinf2229> I thank you Mr Suignard, Mr Constable and Mr Kolehmainen for your kind replies, and assure that my blame targeted the overhead you refer to, since Mr Pandey pointed clearly at higher level decisions, not at WG2. I'm now pretty sure that thanks to your tight relationship in ISO Workgroup, Mr Pandey will access the documents he wishes to consult in the Document Register, and will never be starved of the information he needs for his work and contributions. I'm grateful for the time you took, especially Mr Suignard and Mr Kolehmainen, to write up this information to my (and the Mail List subcribers') attention, which gives me some wholesome insight I never would have accessed through the extremely sheltered and repellent ISO website. Consistently, I am very sorry about the suspicions I uttered, notably in in the two threads I've the honor of taking part the past few days. ? About my taking part in this thread, I must confess that I had paid very little attention to the threads related to ISO work; also because of my lack of interest in ISO topics. Now I understand however the reasons of why this list contains threads specifically related to ISO SC2 WG2. It was not until the second time Mr Overington mailed in this thread "in reply" (as I imagined, because of the little timelap) to my suggestions I sent him to answer his requests about input and display facilities (A new take on the English apostrophe in Unicode), that I read thoroughly his message, which on Monday, June 15, was particularly touching and appealed to my emotions. Whence I got very angry against ISO as more as I reminded my past ideas. ? I won't hide I restrained from sending copies with respect to my recent little mail contact with ISO which enhanced considerably my image of the Standards Body in a whole, by extrapolation. I'm glad again to have so good news and would share you that I feel it's a pity that there is AFAIK no source where everybody could inform himself; like a website. But now I can refer to this thread if you agree, when the topic is on somewhere. ? I would like to ask all persons who were affected by my e-mails, to excuse me. ? Best regards, ? Marcel Schneider ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jun 17 10:35:46 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 17 Jun 2015 17:35:46 +0200 (CEST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: <20150616203112.59e02f27@JRWUBU2> References: <973380398.1843.1434350457831.JavaMail.www@wwinf2229> <20150616203112.59e02f27@JRWUBU2> Message-ID: <711874393.12630.1434555346659.JavaMail.www@wwinf2229> On Mon, Jun 16, 2015, "Richard Wordingham" wrote: > I don't know if you have the wrong link for MSKLC, but that link > claims it is only 'supported' up to Vista. That's not much of an > invitation! I do know that MSKLC works on Windows 7, and its output > there is appropriate for Windows 7, generate multiple versions of > the DLL and its installer. I'm sorry, I didn't think about the issue. The download link is not wrong, AFAIK it's the only available download page for the (most recent) 1.4 version. And this version works for Windows 8, too [and, I hope, for the coming Windows 10], this thread on Microsoft Community shows: http://answers.microsoft.com/en-us/windows/forum/windows_8-winapps/msklc-microsoft-keyboard-layout-creator-for/a54a4db0-94c0-4f08-8909-37a7c5b758bb ? Marcel? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jun 17 10:43:57 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 17 Jun 2015 17:43:57 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode In-Reply-To: References: <1165856201.20980.1434474146145.JavaMail.www@wwinf1n18> Message-ID: <1507910160.12770.1434555837463.JavaMail.www@wwinf2229> On Tue, Jun 16, 2015, Philippe Verdy wrote: > When ISO 8859-1 was designed (in fact in an early version by Digital for its own version of Unix), allowing a bijective compatibility with 8-bit EBCDIC and its C1 controls was still a priority. > Microsoft abandoned its own develomment of Unix to develop DOS and extend it with Windows in parallel of its work with IBM that had wanted DOS to be a very lightweight version of CP/M, but without a scheduler in order to run softwares on personal computers that could be used in small organisations that could not buy its mainframes, but had to prepare documents and data that could be reused on IBM mainframes... ? Thank you Philippe for the information. It was a very good idea to build a system without need of C1 and to remap the two ranges to completing characters, which are indispensable, notably in French, and to start with the single quotes. ? Marcel > Message du 16/06/15 21:08 > De : "Philippe Verdy" > A : "Marcel Schneider" > Copie ? : "Doug Ewell" , "Unicode Mailing List" > Objet : Re: Another take on the English Apostrophe in Unicode > > > > 2015-06-16 19:02 GMT+02:00 Marcel Schneider : > > On Mon, Jun 15, 2015, 17:12, Doug Ewell wrote: > > > Marcel Schneider wrote: > [...] > >> Microsoft?s choice of mashing up apostrophe and close-quote to end up > >> with an unprocessable hybrid was wrong. Very wrong. > > > Windows-1252 and the other Windows code pages were developed during the > > 1980s, before Unicode, when almost all non-Asian character sets were > > limited to 256 code points. The distinctions between apostrophe and > > right-single-quote, weighed against the confusion caused by encoding two > > identical-looking characters, would never have been sufficient back then > > to justify separate encoding in this limited space. > > I replied: > > > The problem is not about code pages [...] > > I thank you for your answers and I'll come back upon some of them below. There's some new fact to bring first. > I concede that my last reply yesterday in the evening was incorrect. > Additionally to Microsoft?s action in the late nineties urging Unicode to give up its useful apostrophe recommendation (U+02BC), the design of code page Windows-1252 is in my scope, indeed. > > Since I learned there are very good and outweighing reasons to use U+02BC in English, and that Unicode?s respective recommendation has been withdrawn with respect to a widespread practice founded on CP?Windows-1252, I soon suspected there would have been means to get the apostrophe into this code page. Here I need to recall that I?always liked Windows-1252 for its completing the ISO?8859-1 charset (which was so useless* it had to be replaced with ISO?8859-15). > * Please read this paper (in French): > http://cahiers.gutenberg.eu.org/cg-bin/article/CG_1996___25_65_0.pdf > > Now that I?examined closely CP1252?s layout, I found five empty code points, five code points left out, in the C1 ranges that Microsoft allocated to complete ISO?8859?1. Further, in this range, I?found two MODIFIER LETTERS, CIRCUMFLEX ACCENT (136, 0x88, later U+02C6) and SMALL TILDE (152, 0x98, U+02DC). Obviously these two were added to disambiguate the extensively used spacing characters ^ (94, 0x5E) and ~ (126, 0x7E) on one side, and the diacritics on the other side. There is to say that when Windows was first released, the left and right single quotes were the only printable characters in these two ranges. All other characters plus ? and ? came later. However, CP1252 remained stable since Windows?98, for which ? and the ?? pair were added. And five places were left empty. > > From this on I got convinced that it would have been very easy to place the letter apostrophe for example at code point 144 (0x90), near the single turned comma quotation mark 0x91 and the single comma quotation mark (right-single-quote) 0x92 which Microsoft recommended for use as apostrophe. > > About the ?confusion? everybody refers to, there is to say that the only way to get people confused, is to do things and not to explain anything to anybody. > > The core problem would have been that code pages were designed with glyph-based *character* encoding in mind, not semantics-based *text* encoding. > > I repeat that others had done even worse. Others, that is some of the so-called expert members of the ISO WG designing 8859-1, as two of them not even aimed at encoding all needed characters, by refusing deliberately to encode the lower- and uppercase ? digraph, and even the uppercase ?. Microsoft?s big merit has been to produce a ready remedy to this bungling, that as far as belongs to the OE digraph, was meant to match defective peripherics. > > Unfortunately, Microsoft visibly didn?t finish this job, by aiming at encoding characters only, and thus not allocating more than one code point to that squiggle, whilst several places were left. > > Well, all that are errors of the past. If I don?t see a need, I?won?t meet it. By leaving ? and ? off the charset, they got ? and ? in, at least. Where things ran really bad, was when Unicode was on, and code pages Procrustes? beds were out. At least, they should have been. Whence that survival of CP1252-based confusion? > > Briefly, today?s text processing is suffering from the apostrophe-close-quote confusion. This confusion is firstly out of date, and secondly it was unnecessary from the beginning on. Avoiding this confusion at a trivial level (by not getting users confused to have to use two similar squiggles), is shifting it at process level, where the damage it causes is far bigger. Trust me, users who find themselves unable to set apart the apostrophes when they?re going to replace single quotes, won?t bless Microsoft for the input simplicity! Ted?Clancy?s blog post is here to prove. > https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ > > > It was time to get rid of that confusion when Unicode recommended U+02BC for apostrophe. Microsoft?s choice not to comply was wrong again. Very wrong. > ? > Let's come back to some of your replies. > > ? > On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > > > I'd guess there are very few users who consciously see the use of U+2019 > > as both apostrophe and right-single-quote as a vestige of code pages, or > > as a conscious effort by Evil Microsoft? to force them into anything. > ? > Quite sure. These are habits, not constraints. I'm not sharing such views about a battle between Google and Microsoft and about ethical prefixes to allocate to companies. The problem is that when the result proves to be bad, the idea was, too. > > ? > The mismatch between apostrophe and close-quote is now part of our culture. We must get back pragmatic and see the advantages and disadvantages of each option (ambiguating, disambiguating), not say "I believe there are no disadvantages in ambiguating" or "there is no reason to disambiguate" or "people will get confused, let them alone" or the like. These all are statements. We must look at real people and listen to what they say to us. Ted Clancy is one of them. When he's worried about that malfunctioning of text-processing, who will keep smiling and stay saying "There's no problem, there's no reason to fix that, it's all OK like it is"? > That's to despise people, that's to spit at their face. > > ? > > Perhaps a UTC member can confirm whether this is fact or speculation. > > Markus Kuhn's comment from 1999 about "couldn't Unicode follow > > Microsoft...?" doesn't prove that Unicode was in fact strong-armed by > > Microsoft. > ? > Yes, please let us know. > > > > > > Marcel Schneider > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Wed Jun 17 11:18:32 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 17 Jun 2015 18:18:32 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode Message-ID: <1309569758.13472.1434557912246.JavaMail.www@wwinf2229> On Tue, Jun 16, Mark Davis ?? wrote: > And, Marcel, while you are at it, this is getting tiresome. > Please find some other place to vent about events you know very little about; the internet is full of them. Dear Mark, I understand (a little) that I'm tiresome. Please consider nevertheless that the Unicode Public Maliling List is AFAIK the only spot where people can communicate with Unicode decision makers. No other mailing list nor any forum on the internet can do this. Even Microsoft's Community forum can do nothing at Microsoft, forum volunteers told me. I posted there in French and in English. In French my most useful post seems to be at http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b Since your message could not reach me yesterday, I prepared two replies I wanted to send today. It was exactly one to Doug and one to you. If you agree, I'll paste them both hereafter. On Tue, Jun 16, 2015, Doug Ewell wrote: > You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. You know I did, and if it were just for my own?s sake, I?d probably never started mailing in this thread. A big part of text to be processed on quotes originates from other people. So when I?use U+02BC, I?did a good work (if I were quoted :)). A essential condition is that all text handling software is updated to handle correctly the letter apostrophe. Without an official recommendation, this is not likely to be done. And this recommendation cannot be usefully issued unless Microsoft agrees. We remember that without Microsoft, the Unicode Consortium probably wouldn?t have been founded, and character encoding wouldn?t thrive as it does today. On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > Perhaps a UTC member can confirm whether this is fact or speculation. Markus Kuhn's comment from 1999 about "couldn't Unicode follow Microsoft...?" doesn't prove that Unicode was in fact strong-armed by Microsoft. I know that Markus Kuhn?s concern was very valuable and he did a great job by showing how to eradicate the clumsy quotes simulation that was current by the time, due to the lack of characters. You remember, they used accents as quotes, and at that stage, the mixup was between apostrophe and acute! https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped to 0x60 Mr?Kuhn shows on this page and how to replace them properly, remind the U+201B?U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION MARK was discussed on this List, the conclusion being: On Thu, Jun 15, 2006, Andreas Prilop wrote: http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html > Actually, I have seen such quotation marks in English-language books printed in Britain and the USA. But, as I wrote, they are certainly not preferred. *If* you want such quotation marks, then please use U+201B for them! At that time, the matter was correct rendering. Today, it?s correct processing. Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking closer, IMO there is *no recommendation* for U+2019 neither, just a stated preference. As I wrote sooner in this thread, Unicode logically and seemingly changed the preference against its will. Logically, because the first recommendation (like the whole Standard) was consciously designed, Mr?Davis recalled us the day before yesterday. Seemingly, because the U+0027 comment line in the Code Chart has been changed from > preferred character for apostrophe is 2019 to > 2019 is preferred for apostrophe between the 3.0.0 and 4.0.0 versions (while the line ?preferred characters in English for paired quotation marks are 2018 & 2019? remained unchanged; see the complete comparison at http://charupdate.info#ambiguation). On Tue, Jun 16, 2015, Doug Ewell wrote: > I do wish we could put an end to all the accusations of malfeasance. Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. The best way of getting nothing to be done is to get everybody convinced it?s all OK. That?s what I?sometimes feel reading this thread, or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! And the only way to get something change has always been to show it?s wrong. >From there on, the next step would be to find out who is responsible. About the apostrophe, we?re all a bit responsible. Why to hide that British English usage does not much to disambiguate things, by preferring single quotes as current quotation marks, leading some authors to end up preferring chevrons even in English?see Chris?Harvey (pleading for U+2019 as apostrophe) at http://www.languagegeek.com/typography/apostrophes.html#Anchor-Potentia-61409 But Microsoft is responsible, too. And Microsoft and we have the power to bring it to a solution: everybody on his PC, and Microsoft together with Unicode and ISO on a global level. So let?s tackle. ? On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ?? wrote: > In practice, whenever characters are essentially identical?and by that I mean that the overlap between the acceptable glyphs for each character is very high?people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. Now I use U+02BC, I experience that in most applications, this is not yet a part of the equivalence class of apostrophe-single-quote, where only U+0027, U+2019 and U+2018 seem to be in. However, when at the occasion of the next software updates, U+02BC is added to this class, that wouldn?t always be enough for the software to work fine. Options should be added to disable these equivalences, like today case-sensitivity can already be enabled in most search dialogs. But without an official recommendation, all this will scarcely be done. Could Unicode please add again a recommendation for U+02BC at U+0027? You could for example recommend to prefer U+02BC for processing, and U+2019 for printing while waiting that fonts are updated. Or you could recommend U+02BC, and admit that U+2019 is used in legacy compatibility mode. The main reason for the status quo to be protected (as it seems to be), could however be the fear of image damages. Imagine people learning that there is a flaw in the apostrophe. It will be hard to explain why it was ambiguated and why we come up today with disambiguation; why there are new radio buttons for LETTER APOSTROPHE and PUNCTUATION APOSTROPHE (to give it a cool name; the former converts U+0027 always to U+02BC, the latter works as today...); how the nested quotes algorithm works (supposing that today, it isn?t still implemented); and why to hit the quotation mark two times when the ?other? quotation mark is wished. Quite a lot of job. There?s a nice workaround to input high quality text files. Turn off smart quotes, use U+0027 for apostrophe only, and type a left square bracket to open a quotation, a curly for a nested or alternate one. The brackets pairing algorithm will accurately close. Square brackets for output may be entered as or as two parentheses. Once finished, save that file at a secure place. Then open a copy and replace the apostrophes with U+02BC (or U+2019, depending on whether U+02BC is in the target font), then the four bracketing characters with whatever quotes you need, and finish with the definite square brackets. That should work on every text or wysiwyg editor. However, I believe we should start at another end, stopping to eat that insane stuff that is processed from insulted, tortured, poisoned, and slowly killed animals and brings us but acidosis, osteoporosis, and much more... but nothing good, nothing that were worth the confusion of ethical values on the pattern intiated by the Nazi government.? I know it?s off the topic, but it should bring us nearer to a helpful solution.Therefore I permit me to suggest to (re)watch ?Earthlings? and visit Gary Yourofsky?s website and Facebook profile. Once we?ve resolved the problems pointed out there?which at a personal level is very easy to perform?, I believe we shall stop redoing the errors of the past: http://adaptt.org http://www.facebook.com/therealgaryyourofsky http://youtube.com/GaryYourofskyAdaptt http://earthlings.com/ (also on YouTube). ? Best regards, Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Jun 18 03:17:57 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 18 Jun 2015 10:17:57 +0200 (CEST) Subject: Another take on the English Apostrophe in Unicode In-Reply-To: <1309569758.13472.1434557912246.JavaMail.www@wwinf2229> References: <1309569758.13472.1434557912246.JavaMail.www@wwinf2229> Message-ID: <984264692.4358.1434615477728.JavaMail.www@wwinf2229> Dear Mr Ewell, as I was very puzzled reading Mr Davis' last reply yesterday, I stood away from mailing to you separately as I'd the purpose to do. For the same reason, I forgot to remove an outdated period I'd never have written after reading Mr Kolehmainen's, Mr Suignard's and Mr Constable's e-mails I found yesterday. I beg everybody's pardon. On Wen, Jun 17, I?wrote: > Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. > The best way of getting nothing to be done is to get everybody convinced it?s all OK. That?s what I sometimes feel reading this thread, > or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! > And the only way to get something change has always been to show it?s wrong. > From there on, the next step would be to find out who is responsible. Please read instead: | Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. | The best way of getting nothing to be done is to get everybody convinced it?s all OK. That?s what I sometimes feel reading this thread. | And the only way to get something change has always been to show it?s wrong. | From there on, the next step would be to find out who is responsible. ? Best regards, Marcel S.? > Message du 17/06/15 18:29 > De : "Marcel Schneider" > A : "MarkDavis??" , "DougEwell" > Copie ? : "TedClancy" , "UnicodeMailingList" > Objet : Re: Another take on the English Apostrophe in Unicode > > > On Tue, Jun 16, Mark Davis ?? wrote: > And, Marcel, while you are at it, this is getting tiresome. > Please find some other place to vent about events you know very little about; the internet is full of them. Dear Mark, I understand (a little) that I'm tiresome. Please consider nevertheless that the Unicode Public Maliling List is AFAIK the only spot where people can communicate with Unicode decision makers. No other mailing list nor any forum on the internet can do this. Even Microsoft's Community forum can do nothing at Microsoft, forum volunteers told me. I posted there in French and in English. In French my most useful post seems to be at http://answers.microsoft.com/fr-fr/office/forum/office_2010-word/recherche-invers%C3%A9e-dans-les-listes/845a02fa-aa2d-4d81-a03e-12ecb7f2f46b Since your message could not reach me yesterday, I prepared two replies I wanted to send today. It was exactly one to Doug and one to you. If you agree, I'll paste them both hereafter. On Tue, Jun 16, 2015, Doug Ewell wrote: > You know what? If you want to use U+02BC as an English apostrophe, go ahead and use it. Nobody's stopping you really. Not Unicode, not Microsoft, not ISO. You know I did, and if it were just for my own?s sake, I?d probably never started mailing in this thread. A big part of text to be processed on quotes originates from other people. So when I?use U+02BC, I?did a good work (if I were quoted :)). A essential condition is that all text handling software is updated to handle correctly the letter apostrophe. Without an official recommendation, this is not likely to be done. And this recommendation cannot be usefully issued unless Microsoft agrees. We remember that without Microsoft, the Unicode Consortium probably wouldn?t have been founded, and character encoding wouldn?t thrive as it does today. On Mon, Jun 15, 2015, 20:14, Doug Ewell wrote: > Perhaps a UTC member can confirm whether this is fact or speculation. Markus Kuhn's comment from 1999 about "couldn't Unicode follow Microsoft...?" doesn't prove that Unicode was in fact strong-armed by Microsoft. I know that Markus Kuhn?s concern was very valuable and he did a great job by showing how to eradicate the clumsy quotes simulation that was current by the time, due to the lack of characters. You remember, they used accents as quotes, and at that stage, the mixup was between apostrophe and acute! https://www.cl.cam.ac.uk/~mgk25/ucs/apostrophe.html The curly glyph for 0x27 in old ASCII fonts and its reversed counterpart mapped to 0x60 Mr?Kuhn shows on this page and how to replace them properly, remind the U+201B?U+2019 quotes pair where the deprecated REVERSED SINGLE COMMA QUOTATION MARK was discussed on this List, the conclusion being: On Thu, Jun 15, 2006, Andreas Prilop wrote: http://www.unicode.org/mail-arch/unicode-ml/y2006-m06/0265.html > Actually, I have seen such quotation marks in English-language books printed in Britain and the USA. But, as I wrote, they are certainly not preferred. *If* you want such quotation marks, then please use U+201B for them! At that time, the matter was correct rendering. Today, it?s correct processing. Yes, fortunately U+02BC is *not deprecated* for English apostrophe, and looking closer, IMO there is *no recommendation* for U+2019 neither, just a stated preference. As I wrote sooner in this thread, Unicode logically and seemingly changed the preference against its will. Logically, because the first recommendation (like the whole Standard) was consciously designed, Mr?Davis recalled us the day before yesterday. Seemingly, because the U+0027 comment line in the Code Chart has been changed from > preferred character for apostrophe is 2019 to > 2019 is preferred for apostrophe between the 3.0.0 and 4.0.0 versions (while the line ?preferred characters in English for paired quotation marks are 2018 & 2019? remained unchanged; see the complete comparison at http://charupdate.info#ambiguation). On Tue, Jun 16, 2015, Doug Ewell wrote: > I do wish we could put an end to all the accusations of malfeasance. Experience proves that often a lot of mails, e-mails, blog posts, fora posts, tweets and so on are needed to get things move. The best way of getting nothing to be done is to get everybody convinced it?s all OK. That?s what I?sometimes feel reading this thread, or the one about ISO/IEC JTC1/SC2/WG2 that is on-going in the meantime! And the only way to get something change has always been to show it?s wrong. >From there on, the next step would be to find out who is responsible. About the apostrophe, we?re all a bit responsible. Why to hide that British English usage does not much to disambiguate things, by preferring single quotes as current quotation marks, leading some authors to end up preferring chevrons even in English?see Chris?Harvey (pleading for U+2019 as apostrophe) at http://www.languagegeek.com/typography/apostrophes.html#Anchor-Potentia-61409 But Microsoft is responsible, too. And Microsoft and we have the power to bring it to a solution: everybody on his PC, and Microsoft together with Unicode and ISO on a global level. So let?s tackle. On Mon, Jun 15, 2015 at 10:19 AM, Mark Davis ?? wrote: > In practice, whenever characters are essentially identical?and by that I mean that the overlap between the acceptable glyphs for each character is very high?people will inevitably mix up the characters on entry. So any processing that depends on that distinction is forced to correct the data anyway. And separating them causes even simple things like searching for a character on a page to get screwed up without having equivalence classes. Now I use U+02BC, I experience that in most applications, this is not yet a part of the equivalence class of apostrophe-single-quote, where only U+0027, U+2019 and U+2018 seem to be in. However, when at the occasion of the next software updates, U+02BC is added to this class, that wouldn?t always be enough for the software to work fine. Options should be added to disable these equivalences, like today case-sensitivity can already be enabled in most search dialogs. But without an official recommendation, all this will scarcely be done. Could Unicode please add again a recommendation for U+02BC at U+0027? You could for example recommend to prefer U+02BC for processing, and U+2019 for printing while waiting that fonts are updated. Or you could recommend U+02BC, and admit that U+2019 is used in legacy compatibility mode. The main reason for the status quo to be protected (as it seems to be), could however be the fear of image damages. Imagine people learning that there is a flaw in the apostrophe. It will be hard to explain why it was ambiguated and why we come up today with disambiguation; why there are new radio buttons for LETTER APOSTROPHE and PUNCTUATION APOSTROPHE (to give it a cool name; the former converts U+0027 always to U+02BC, the latter works as today...); how the nested quotes algorithm works (supposing that today, it isn?t still implemented); and why to hit the quotation mark two times when the ?other? quotation mark is wished. Quite a lot of job. There?s a nice workaround to input high quality text files. Turn off smart quotes, use U+0027 for apostrophe only, and type a left square bracket to open a quotation, a curly for a nested or alternate one. The brackets pairing algorithm will accurately close. Square brackets for output may be entered as or as two parentheses. Once finished, save that file at a secure place. Then open a copy and replace the apostrophes with U+02BC (or U+2019, depending on whether U+02BC is in the target font), then the four bracketing characters with whatever quotes you need, and finish with the definite square brackets. That should work on every text or wysiwyg editor. However, I believe we should start at another end, stopping to eat that insane stuff that is processed from insulted, tortured, poisoned, and slowly killed animals and brings us but acidosis, osteoporosis, and much more... but nothing good, nothing that were worth the confusion of ethical values on the pattern intiated by the Nazi government.? I know it?s off the topic, but it should bring us nearer to a helpful solution.Therefore I permit me to suggest to (re)watch ?Earthlings? and visit Gary Yourofsky?s website and Facebook profile. Once we?ve resolved the problems pointed out there?which at a personal level is very easy to perform?, I believe we shall stop redoing the errors of the past: http://adaptt.org http://www.facebook.com/therealgaryyourofsky http://youtube.com/GaryYourofskyAdaptt http://earthlings.com/ (also on YouTube). Best regards, Marcel From roche+kml2 at exalead.com Thu Jun 18 02:54:03 2015 From: roche+kml2 at exalead.com (Xavier Roche) Date: Thu, 18 Jun 2015 09:54:03 +0200 Subject: Possible issue with Character Fallback Substitutions between version 24 and 25 ? Message-ID: <5582791B.3080405@exalead.com> Hi! There are some differences in character fallback substitutions introduced between version 24 (http://www.unicode.org/cldr/charts/24/supplemental/character_fallback_substitutions.html) and 25 (http://www.unicode.org/cldr/charts/25/supplemental/character_fallback_substitutions.html) ; for example, these two letters have been removed: 0153 ? LATIN SMALL LIGATURE OE Explicit 006F, 0065 oe LATIN SMALL LETTER O, LATIN SMALL LETTER E 0152 ? LATIN CAPITAL LIGATURE OE Explicit 004F, 0045 OE LATIN CAPITAL LETTER O, LATIN CAPITAL LETTER E However, they are still listed at: http://unicode.org/repos/cldr/trunk/common/supplemental/characters.xml OE oe I was wondering what was the rationale behind ? Could it be a bug ? Regards, Xavier From markus.icu at gmail.com Thu Jun 18 11:36:49 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 18 Jun 2015 18:36:49 +0200 Subject: Possible issue with Character Fallback Substitutions between version 24 and 25 ? In-Reply-To: <5582791B.3080405@exalead.com> References: <5582791B.3080405@exalead.com> Message-ID: If the chart does not reflect the data, then please submit a bug ticket. http://unicode.org/cldr/trac/newticket The data is what counts. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Jun 18 11:39:11 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 18 Jun 2015 18:39:11 +0200 Subject: Possible issue with Character Fallback Substitutions between version 24 and 25 ? In-Reply-To: <5582791B.3080405@exalead.com> References: <5582791B.3080405@exalead.com> Message-ID: It sounds like a bug in the CLDR chart. Can you file a ticket at http://unicode.org/cldr/trac/newticket please? Mark *? Il meglio ? l?inimico del bene ?* On Thu, Jun 18, 2015 at 9:54 AM, Xavier Roche wrote: > Hi! > > There are some differences in character fallback substitutions introduced > between version 24 ( > http://www.unicode.org/cldr/charts/24/supplemental/character_fallback_substitutions.html) > and 25 ( > http://www.unicode.org/cldr/charts/25/supplemental/character_fallback_substitutions.html) > ; for example, these two letters have been removed: > > 0153 ? LATIN SMALL LIGATURE OE Explicit 006F, 0065 > oe LATIN SMALL LETTER O, LATIN SMALL LETTER E > 0152 ? LATIN CAPITAL LIGATURE OE Explicit 004F, > 0045 OE LATIN CAPITAL LETTER O, LATIN CAPITAL LETTER E > > However, they are still listed at: > http://unicode.org/repos/cldr/trunk/common/supplemental/characters.xml > > OE > oe > > I was wondering what was the rationale behind ? Could it be a bug ? > > > Regards, > Xavier > -------------- next part -------------- An HTML attachment was scrubbed... URL: From roche+kml2 at exalead.com Fri Jun 19 00:07:20 2015 From: roche+kml2 at exalead.com (Xavier Roche) Date: Fri, 19 Jun 2015 07:07:20 +0200 Subject: Possible issue with Character Fallback Substitutions between version 24 and 25 ? In-Reply-To: References: <5582791B.3080405@exalead.com> Message-ID: <5583A388.2020108@exalead.com> Le 18/06/2015 18:36, Markus Scherer a ?crit : > If the chart does not reflect the data, then please submit a bug ticket. > http://unicode.org/cldr/trac/newticket Thanks, done: http://unicode.org/cldr/trac/ticket/8662 Regards, Xavier From public at khwilliamson.com Fri Jun 19 15:29:06 2015 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 19 Jun 2015 14:29:06 -0600 Subject: trying to understand the relationship between the Version 1 Hangul syllables and the later versions' Message-ID: <55847B92.3020201@khwilliamson.com> I haven't found any information on this. It can't just be a transliteration difference, because the number of code points is vastly different between them. Is it the case that the version 1 syllables is a failed abstraction that was replaced by the later versions? Thanks From public at khwilliamson.com Fri Jun 19 15:51:20 2015 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 19 Jun 2015 14:51:20 -0600 Subject: Why aren't the emoji modifiers GCB=Extend? Message-ID: <558480C8.90707@khwilliamson.com> Someone writing code using Unicode 8 found that the FITZPATRICK modifiers are considered separate graphemes from what they modify. This is surprising, and seems contrary to not only the concept of a grapheme cluster, but the spirit of tr51 2.2.3 "A supported emoji modifier sequence should be treated as a single grapheme cluster for editing purposes" From kenwhistler at att.net Fri Jun 19 17:12:59 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 19 Jun 2015 15:12:59 -0700 Subject: trying to understand the relationship between the Version 1 Hangul syllables and the later versions' In-Reply-To: <55847B92.3020201@khwilliamson.com> References: <55847B92.3020201@khwilliamson.com> Message-ID: <558493EB.4000807@att.net> Karl, As usual, the situation is way more complicated that perhaps it has any business being! It isn't just Version 1 Hangul that have to be considered, but also Version 1.1 Hangul. Version 1.0 contained 2350 Hangul syllables, encoded in the range 3400..3D2D. Version 1.1 contained 6646 Hangul syllables, encoded in the range 3400..3D2D and a distinct new range 3D2E..4DFF. It thus added 4306 to what was in Version 1.0 already. Version 2.0 (and all subsequent versions) contained the 11172 Hangul syllables we now see, encoded in the range AC00..D7A3. Version 2.0 *deleted* all the Hangul syllables in the range 3400..4DFF. You also need to pay attention to the history of the encoding of jamo. Version 1.0 contained 94 "Hangul Elements", encoded in the range 3131..318E. Version 1.1 retained the same 94 "Hangul Letters" in the range 3131..318E. Version 1.1 added 240 conjoining jamo letters in the range 1100..11F9. Version 2.0 retained both of those sets. O.k., now what were those various chunks? The Unicode 1.0 set of 2350 was encoded for compatibility with KS C 5601-1987. They were given no formal decompositions (the concept didn't yet exist), but the implication in the standard was essentially that Hangul syllables could just be spelled out with jamo letter sequences. The details were an exercise for implementation, however, and were soon overtaken by events in the Unicode/10646 merger. The Unicode 1.1 set of 4306 additions came from the 10646 merger work, and comprised two actual subsets: Hangul Supplementary Syllables A (1930 modern syllables) from KS C 5659-1990. (See the Unicode 1.1 subrange: 3D2E..44BD.) Hangul Supplementary Syllables B (2376 old Korean syllables) from KS C 5657-1991. (See the Unicode 1.1 subrange: 44BE..4DFF.) *All* of the Unicode 1.1 Hangul syllables were given decompositions. (Although the formalization of Unicode normalization did not yet exist.) The decompositions can be see in UnicodeData-1.1.5.txt. Because the syllables were then encoded in three "alphabetical" extents, with a few stragglers tucked on, the decompositions were not algorithmically defined -- they were just enumerated in the data file. The decompositions involved the new set of conjoining jamo letters, rather than the older set, which were relegated to compatibility mapping status. The Unicode 2.0 set of 11,172 was known as the "Johab" set from KS C 5601-1992. That was an algorithmically designed replacement of the earlier sets from Korean standards -- designed to cover all modern syllables algorithmically, by putting all the combinations of initial, medial and final jamos in a defined alphabetical order, whether or not each syllable that resulted was actually attested in modern Korean use or not. There was an enormous hullabaloo at the time, of course, about the changes required to switch over from the old ranges to the new set. But the whole shebang was balloted as Amendment 5 to ISO/IEC 10646-1:1993, and when that ballot passed, Unicode adopted the change wholesale into the documentation and data files for Unicode 2.0, to stay in synch. But "The Korean Mess", as it was then known, led directly to the determination by both SC2 and the UTC that such re-encoding of already standardized and published characters was enormously damaging to both standards. It was also expensive to the early implementers: Oracle, for example, long maintained distinct database support for the Unicode 1.1 Korean, which was incompatible with the Unicode 2.0 Korean. In any case, if anybody has any lingering questions about why the following policy exists and is *strictly* enforced: http://www.unicode.org/policies/stability_policy.html#Encoding or why the applicable version for that stability policy is 2.0+, the answer is that it was a direct reaction to "The Korean Mess". --Ken On 6/19/2015 1:29 PM, Karl Williamson wrote: > I haven't found any information on this. It can't just be a > transliteration difference, because the number of code points is > vastly different between them. > > Is it the case that the version 1 syllables is a failed abstraction > that was replaced by the later versions? From kenwhistler at att.net Fri Jun 19 17:24:26 2015 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 19 Jun 2015 15:24:26 -0700 Subject: Why aren't the emoji modifiers GCB=Extend? In-Reply-To: <558480C8.90707@khwilliamson.com> References: <558480C8.90707@khwilliamson.com> Message-ID: <5584969A.7030406@att.net> Karl, This results from the fact that the fallback behavior for the modifiers is simply as independent pictographic blorts, i.e. the color swatch images. That is also related to why they are treated as gc=Sk symbol modifiers, rather than as combining marks or format characters. If you *support* emoji modifier sequences, then yes, you should treat them as single grapheme clusters for editing -- but their behavior is more akin then to ligatures or conjuncts than to combining character sequences. You need additional, specific knowledge about these sequences -- it doesn't just fall out from a *default* implementation of UAX #29 rules for grapheme clusters. --Ken On 6/19/2015 1:51 PM, Karl Williamson wrote: > Someone writing code using Unicode 8 found that the FITZPATRICK > modifiers are considered separate graphemes from what they modify. > This is surprising, and seems contrary to not only the concept of a > grapheme cluster, but the spirit of tr51 2.2.3 "A supported emoji > modifier sequence should be treated as a single grapheme cluster for > editing purposes" > > > > From mark at macchiato.com Sat Jun 20 04:02:45 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 20 Jun 2015 11:02:45 +0200 Subject: Why aren't the emoji modifiers GCB=Extend? In-Reply-To: <5584969A.7030406@att.net> References: <558480C8.90707@khwilliamson.com> <5584969A.7030406@att.net> Message-ID: On Sat, Jun 20, 2015 at 12:24 AM, Ken Whistler wrote: > This results from the fact that the fallback behavior for the modifiers is > simply as independent pictographic blorts, i.e. the color swatch images. > That is also related to why they are treated as gc=Sk symbol modifiers, > rather than as combining marks or format characters. > > If you *support* emoji modifier sequences, then yes, you should treat > them as single grapheme clusters for editing -- but their behavior is > more akin then to ligatures or conjuncts than to combining character > sequences. You need additional, specific > knowledge about these sequences -- it doesn't just fall out from a > *default* implementation of UAX #29 rules for grapheme clusters. > ?Looks like this would be a good FAQ addition...? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 20 04:32:47 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 20 Jun 2015 11:32:47 +0200 (CEST) Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=C2=AE_Standard,_Version_8.0?= In-Reply-To: <5581DA60.6080003@unicode.org> References: <5581DA60.6080003@unicode.org> Message-ID: <1319079448.4881.1434792767199.JavaMail.www@wwinf1n18> This is intrinsicly the nicest version announcement of all the history of Unicode, because of the opportune use of the newly encoded emoji U+1F37E BOTTLE WITH POPPING CORK. Even if I wouldn?t drink what?s in, nor eat any more U+1F9C0 CHEESE WEDGE (you know I?ve become a vegan between my beta feedback* and now), I was already very pleased when Unicode started adding emojis, and I'm still more as emojis are now thriving and covering the complete cultural range. I?d the purpose not to mail to the List for a time. But I?ve got some other topics I need to discuss. And, first, it would be a pity if there were no follow-up in this 8.0.0 version announcement thread (even if it wasn't sent as a "new topic" to discuss). --- * On Fri Apr 24 12:51:50 CDT 2015, I wrote: http://www.unicode.org/review/pri297/feedback.html > U+1F9C0 CHEESE WEDGE and Translations of the Code Charts > Dear Unicode Consortium, I'm pleased to read the Feedback from Mr Lawson and would join my > congratulations to his' [apostrophe mistake; read: his].? > The Cheese Wedge symbol he underscores, recalls me the new sets have already been translated to French [...]. > > More precisely about the Cheese Wedge, I'm glad to see unbloody, no-slaughter > food is now strongly promoted and is given a fabulous opportunity of becoming > a wide-spread cultural phenomenon. [Alas! That turned out not to be so pleasing at all.] I know, the purpose of this Mailing List is encoding and implementation, not civilisation. That?s why I?ve made up a new keyboard layout for the United Kingdom. It should help British users to get readily fully processible Unicode text files. That means, quotation marks can be simply converted to US?usage by doing two research-and-replace-all. I?ve called this keyboard layout ?typographic?, because U+02BC MODIFIER LETTER APOSTROPHE is now inserted by default, while U+0027 for smart quotes (and names of archive files) is obtained with AltGr, a shift state that is already present in the shipped layout, and where now all comma (and angle) quotation marks for use in English and Welsh are equally found, along with em and en dashes. (As is well known, Welsh is the locale of the UK?extended keyboard layout shipped with Windows, which this driver is based on.) If a header or a readme can be provided with the input text, the use of U+02BC for apostrophe should be mentioned, until all software has been updated (by adding U+02BC to the equivalence class for U+0027), because as a collateral damage of legacy practice, searches for apostrophe-containing words are actually prevented from being successful when U+0027 is used in the search bar while the matching words present in the text are accurately spelled with U+02BC. For future readers: For more information about MODIFIER LETTER APOSTROPHE, please look up the thread ?A new take on the English apostrophe in Unicode?. This time, the layout is released for UK only, not for USA because disambiguating apostrophe and single closing-quote seems not to be worth-while in US English, where indeed single quotation marks must scarcely be in current use. The (again unlicensed) drivers (several architecture versions for all actual Windows versions) are for free download at: http://bit.ly/1K1XGBs For more information about keyboard drivers and the Microsoft Keyboard Layout Creator, please download your free copy of MSKLC at: http://www.microsoft.com/en-us/download/details.aspx?id=22339 and look up the Help. Compatibility extends to Windows?7?and?8: http://answers.microsoft.com/en-us/windows/forum/windows_8-winapps/msklc-microsoft-keyboard-layout-creator-for/a54a4db0-94c0-4f08-8909-37a7c5b758bb Best regards, Marcel Schneider > Message du 17/06/15 23:17 > De : announcements at unicode.org > A : announcements at unicode.org > Copie ? : > Objet : Announcing The Unicode? Standard, Version 8.0 > > > Version 8.0 of the Unicode Standard is now available. It includes 41 new emoji characters (including five modifiers for diversity), 5,771 new ideographs for Chinese, Japanese, and Korean, the new Georgian lari currency symbol, and 86 lowercase Cherokee syllables. It also adds letters to existing scripts to support Arwi (the Tamil language written in the Arabic script), the Ik language in Uganda, Kulango in the C?te d?Ivoire, and other languages of Africa. In total, this version adds 7,716 new characters and six new scripts. > The first version of Unicode Technical Report #51, Unicode Emoji is being released at the same time. That document describes the new emoji characters. It provides design guidelines and data for improving emoji interoperability across platforms, gives background information about emoji symbols, and describes how they are selected for inclusion in the Unicode Standard. The data is used to support emoji characters in implementations, specifying which symbols are commonly displayed as emoji, how the new skin-tone modifiers work, and how composite emoji can be formed with joiners. The Unicode website now supplies charts of emoji characters, showing vendor variations and providing other useful information. > The 41 new emoji in Unicode 8.0 include the following: > Diversity > five emoji modifiers > Faces and Hands > NERD FACE, FACE WITH ROLLING EYES, ROBOT FACE > Food-Related > HOT DOG, TACO, CHEESE WEDGE, POPCORN > Sports > CRICKET BAT AND BALL, VOLLEYBALL, BOW AND ARROW > Animals > UNICORN FACE, LION FACE, CRAB, SCORPION > Religious > MOSQUE, SYNAGOGUE, PRAYER BEADS > (For the full list, including images, see emoji additions for Unicode 8.0.) > Phones and computers often need operating system updates to support new emoji, which may take some time. It is also now clear which existing characters, such as the often requested SHOPPING BAGS, can be used as emoji. Once phones and computers support these characters, people will be able to see colorful images such as the BOTTLE WITH POPPING CORK above. > Three other important Unicode specifications are updated for Version 8.0: UTS #10, Unicode Collation Algorithm ? for sorting Unicode text UTS #39, Unicode Security Mechanisms ? for reducing Unicode spoofing UTS #46, Unicode IDNA Compatibility Processing ? for compatible processing of non-ASCII URLs > Some of the changes in Version 8.0 and associated Unicode technical standards may require modifications in implementations. For more information, see Unicode 8.0 Migration and the migration sections of UTS #10, UTS #39, and UTS #46. For full details on Version 8.0, see Unicode 8.0. > http://blog.unicode.org/2015/06/announcing-unicode-standard-version-80.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: champagne-bottle-vector2.jpg Type: image/jpeg Size: 20694 bytes Desc: not available URL: From wjgo_10009 at btinternet.com Sat Jun 20 05:44:58 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 20 Jun 2015 11:44:58 +0100 (BST) Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=C2=AE_Standard,_Version_8.0?= In-Reply-To: <1319079448.4881.1434792767199.JavaMail.www@wwinf1n18> References: <5581DA60.6080003@unicode.org> <1319079448.4881.1434792767199.JavaMail.www@wwinf1n18> Message-ID: <12873937.14634.1434797098469.JavaMail.defaultUser@defaultHost> Marcel Schneider wrote: ... I?ve become a vegan ... I too am a vegan, in fact a gluten-avoiding vegan. Could there be emoji to signal those two diets in descriptions of food please? William Overington 20 June 2015 / -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 20 10:06:09 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 20 Jun 2015 17:06:09 +0200 (CEST) Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=C2=AE_Standard,_Version_8.0?= Message-ID: <918442708.8056.1434812769021.JavaMail.www@wwinf1h34> On Sat, Jun 20, 2015, William_J_G Overington wrote: > I too am a vegan, in fact a gluten-avoiding vegan. > > Could there be emoji to signal those two diets in descriptions of food please? This would be very important, to get more people take the move. Today where everything is emoji-powered, Unicode should encode the sooner the better, some striking emojis carrying the message of veganism. Because today, AFAIK, there are only food-labels as the wavy-barred circled ear of wheat for gluten-free food, or something like a barred glass of milk for dairy-free food. There will be to fix a flaw on designations too, because dairy-free liquids and bifidus-fermented products may be referred to as for example soya-based dairy. The extremely precise non-vegan food-emojis that actually exist, need to be counter-balanced by an even greater variety of vegan emojis. Marcel Schneider > Message du 20/06/15 12:44 > De : "William_J_G Overington" > A : "Marcel Schneider" , unicode at unicode.org > Copie ? : > Objet : Re: Announcing The Unicode? Standard, Version 8.0 > > Marcel Schneider wrote: > > ... I?ve become a vegan ... > > I too am a vegan, in fact a gluten-avoiding vegan. > > Could there be emoji to signal those two diets in descriptions of food please? > > William Overington > > 20 June 2015 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > / > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Sun Jun 21 12:15:10 2015 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 21 Jun 2015 11:15:10 -0600 Subject: Why aren't the emoji modifiers GCB=Extend? In-Reply-To: References: <558480C8.90707@khwilliamson.com> <5584969A.7030406@att.net> Message-ID: <5586F11E.9040208@khwilliamson.com> On 06/20/2015 03:02 AM, Mark Davis ?? wrote: > > On Sat, Jun 20, 2015 at 12:24 AM, Ken Whistler > wrote: > > This results from the fact that the fallback behavior for the > modifiers is > simply as independent pictographic blorts, i.e. the color swatch images. > That is also related to why they are treated as gc=Sk symbol modifiers, > rather than as combining marks or format characters. > > If you *support* emoji modifier sequences, then yes, you should treat > them as single grapheme clusters for editing -- but their behavior is > more akin then to ligatures or conjuncts than to combining character > sequences. You need additional, specific > knowledge about these sequences -- it doesn't just fall out from a > *default* implementation of UAX #29 rules for grapheme clusters. > > > ?Looks like this would be a good FAQ addition...? Yes please > > > > Mark > / > / > /? Il meglio ? l?inimico del bene ?/ > ////// From doug at ewellic.org Sun Jun 21 12:38:09 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 21 Jun 2015 11:38:09 -0600 Subject: International Register of Coded Character Sets Message-ID: Does anyone know what happened to the International Register of Coded Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is the repository for character sets registered for use with ISO 2022. The page was redirected to a general "we've reorganized our site" page a few weeks ago, and now the entire site seems to be down. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From eric.muller at efele.net Sun Jun 21 15:03:17 2015 From: eric.muller at efele.net (Eric Muller) Date: Sun, 21 Jun 2015 13:03:17 -0700 Subject: Help with African characters, please Message-ID: <55871885.7040406@efele.net> Can you help me identify the characters used in the Kulango, Bouna translation of the UDHR? The text is at . Look for article 14. What is the second letter of the word for "article" (after the N, looks like a greek nu), and what is the second letter of the first word (after the M, looks similar but different)? What is the letter that looks somewhat like an epsilon (but compare with the epsilon like in articles 13 and 15)? Thanks, Eric. From everson at evertype.com Sun Jun 21 15:15:14 2015 From: everson at evertype.com (Michael Everson) Date: Sun, 21 Jun 2015 21:15:14 +0100 Subject: Help with African characters, please In-Reply-To: <55871885.7040406@efele.net> References: <55871885.7040406@efele.net> Message-ID: <13D59640-0FE6-4596-82FD-19F43C3C1943@evertype.com> On 21 Jun 2015, at 21:03, Eric Muller wrote: > > Can you help me identify the characters used in the Kulango, Bouna translation of the UDHR? I believe so. > The text is at . Look for article 14. > > What is the second letter of the word for "article" (after the N, looks like a greek nu), and what is the second letter of the first word (after the M, looks similar but different)? U+028B LATIN SMALL LETTER V WITH HOOK > What is the letter that looks somewhat like an epsilon (but compare with the epsilon like in articles 13 and 15)? U+025B LATIN SMALL LETTER OPEN E The variations you see are font variations only. Note that ?y??n w??? occurs in both styles. Michael Everson * http://www.evertype.com/ From frederic.grosshans at gmail.com Sun Jun 21 15:37:28 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Sun, 21 Jun 2015 20:37:28 +0000 Subject: International Register of Coded Character Sets In-Reply-To: References: Message-ID: I don't know if it's what you're looking for but Google brought me to the following URL. https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf I managed to download the pdf without problems. I also successfully downloaded a standard ( http://www.itscj.ipsj.or.jp/iso-ir/169.pdf ) to check the URLs from the register. Le dim. 21 juin 2015 19:41, Doug Ewell a ?crit : > Does anyone know what happened to the International Register of Coded > Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is > the repository for character sets registered for use with ISO 2022. > > The page was redirected to a general "we've reorganized our site" page a > few weeks ago, and now the entire site seems to be down. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From shervinafshar at gmail.com Sun Jun 21 16:19:59 2015 From: shervinafshar at gmail.com (Shervin Afshar) Date: Sun, 21 Jun 2015 14:19:59 -0700 Subject: International Register of Coded Character Sets In-Reply-To: References: Message-ID: There are fairly recent copies on Internet Archive as well: https://web.archive.org/web/20150318013320/http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? Shervin On Sun, Jun 21, 2015 at 1:37 PM, Fr?d?ric Grosshans < frederic.grosshans at gmail.com> wrote: > I don't know if it's what you're looking for but Google brought me to the > following URL. > https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf > I managed to download the pdf without problems. I also successfully > downloaded a standard ( http://www.itscj.ipsj.or.jp/iso-ir/169.pdf ) to > check the URLs from the register. > > Le dim. 21 juin 2015 19:41, Doug Ewell a ?crit : > >> Does anyone know what happened to the International Register of Coded >> Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is >> the repository for character sets registered for use with ISO 2022. >> >> The page was redirected to a general "we've reorganized our site" page a >> few weeks ago, and now the entire site seems to be down. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Sun Jun 21 23:09:17 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 22 Jun 2015 13:09:17 +0900 Subject: International Register of Coded Character Sets In-Reply-To: References: Message-ID: <55878A6D.5040501@it.aoyama.ac.jp> On 2015/06/22 05:37, Fr?d?ric Grosshans wrote: > I don't know if it's what you're looking for but Google brought me to the > following URL. > https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf > I managed to download the pdf without problems. I also successfully > downloaded a standard ( http://www.itscj.ipsj.or.jp/iso-ir/169.pdf ) to > check the URLs from the register. I was able to access https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/, but that just says "page not found" in Japanese. Same for https://www.itscj.ipsj.or.jp/ISO-IR/, http://www.itscj.ipsj.or.jp/ISO-IR/, and http://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ (the http versions redirect to the https versions). I left a note on their contact page (https://www.itscj.ipsj.or.jp/contact/index.html), in Japanese. I'll tell you when I hear back from them. If I don't, I'll call them; I remember having done that a few years ago. Regards, Martin. > Le dim. 21 juin 2015 19:41, Doug Ewell a ?crit : > >> Does anyone know what happened to the International Register of Coded >> Character Sets page at http://kikaku.itscj.ipsj.or.jp/ISO-IR/ ? This is >> the repository for character sets registered for use with ISO 2022. >> >> The page was redirected to a general "we've reorganized our site" page a >> few weeks ago, and now the entire site seems to be down. >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> > From mark at macchiato.com Mon Jun 22 03:04:25 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 22 Jun 2015 10:04:25 +0200 Subject: Why aren't the emoji modifiers GCB=Extend? In-Reply-To: <5586F11E.9040208@khwilliamson.com> References: <558480C8.90707@khwilliamson.com> <5584969A.7030406@att.net> <5586F11E.9040208@khwilliamson.com> Message-ID: BTW, Karl, one of our TODOs is to look at the breaking behavior of the emoji sequences.... Mark *? Il meglio ? l?inimico del bene ?* On Sun, Jun 21, 2015 at 7:15 PM, Karl Williamson wrote: > On 06/20/2015 03:02 AM, Mark Davis [image: ?]? wrote: > >> >> On Sat, Jun 20, 2015 at 12:24 AM, Ken Whistler > > wrote: >> >> This results from the fact that the fallback behavior for the >> modifiers is >> simply as independent pictographic blorts, i.e. the color swatch >> images. >> That is also related to why they are treated as gc=Sk symbol >> modifiers, >> rather than as combining marks or format characters. >> >> If you *support* emoji modifier sequences, then yes, you should treat >> them as single grapheme clusters for editing -- but their behavior is >> more akin then to ligatures or conjuncts than to combining character >> sequences. You need additional, specific >> knowledge about these sequences -- it doesn't just fall out from a >> *default* implementation of UAX #29 rules for grapheme clusters. >> >> >> ?Looks like this would be a good FAQ addition...? >> > > Yes please > > >> >> >> Mark >> / >> / >> /? Il meglio ? l?inimico del bene ?/ >> ////// >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 2776 bytes Desc: not available URL: From charupdate at orange.fr Mon Jun 22 10:09:23 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 22 Jun 2015 17:09:23 +0200 (CEST) Subject: =?UTF-8?Q?Re:_Vegan_and_gluten-avoiding_vegan_emojis_(was:_?= =?UTF-8?Q?Re:_Announcing_The_Unicode=C2=AE_Standard,_Version_8.0)?= In-Reply-To: <918442708.8056.1434812769021.JavaMail.www@wwinf1h34> References: <918442708.8056.1434812769021.JavaMail.www@wwinf1h34> Message-ID: <1081888372.11604.1434985763851.JavaMail.www@wwinf2229> On Sat, Jun 20, 2015, William_J_GOverington wrote: > I too am a vegan, in fact a gluten-avoiding vegan. > > Could there be emoji to signal those two diets in descriptions of food please? I replied: > This would be very important, to get more people take the move. Today where everything is emoji-powered, Unicode should encode the sooner the better, some striking emojis carrying the message of veganism. > > Because today, AFAIK, there are only food-labels as the wavy-barred circled ear of wheat for gluten-free food, or something like a barred glass of milk for dairy-free food. There will be to fix a flaw on designations too, because dairy-free liquids and bifidus-fermented products may be referred to as for example soya-based dairy. > > The extremely precise non-vegan food-emojis that actually exist, need to be counter-balanced by an even greater variety of vegan emojis. To greet Mr Overington?s idea I?replied on the spot, but a closer review of the U+1F300???U+1F5FF block reveals to me that the already huge number of vegan food emojis (overweighing today in a 2:1 ratio) could have triggered the demand for meat&cheese emojis. This could be the beginning of an emoji battle between vegan and non-vegan. For the vegan lifestyle on the whole, I think now about encoding some of the many already existing vegan food labels. For the gluten-avoiding vegan diet, the question could then be how to combine both this one and one of the circled and (swung-dash-)barred ears of wheat used in labelling. Perhaps this could be added above right. However, to promote diets and lifestyle, one would probably better prefer the circled GF logo because negation is perhaps not the best idea, it brings a connotation of starvation, while in truth, paradoxically, starvation is the counter-part of meat production when looking at the local populations expropriated of their farms by multinational companies or poisoned by pesticides in the neighborhood where food for our cattle is produced, as well as the end-point of meat&cheese&egg&company because of the serious disease, performance-breakdown, illness and finally prematured death they bring to people who eat them. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jun 24 09:51:21 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 24 Jun 2015 15:51:21 +0100 (BST) Subject: Summer 2015 Localizable Sentence Concept Assessment Experiment Message-ID: <16615654.42957.1435157481849.JavaMail.defaultUser@defaultHost> Summer 2015 Localizable Sentence Concept Assessment Experiment Please use the Base Character followed by Tags concept to express two localizable sentences so as to facilitate transmission and reception of a message through the language barrier. However, only plane 0 Private Use Area characters are used for base character and tags. This is so as to use only Private Use Area characters because the Base Character followed by Tags concept applied to localizable sentences has not at this time been officially accepted, in fact at this time not having been put forward formally for consideration regarding official acceptance either. Also, an all plane 0 initial concept proving may possibly be somewhat easier in practice than a plane 15 concept proving. U+EFFF EXPERIMENTAL LOCALIZABLE SENTENCE BASE CHARACTER U+EE20 .. U+EE7E EXPERIMENTAL TAG CHARACTERS The experimental tag characters are the same meanings as, respectively, the tag characters U+E0020 .. U+E007E of regular Unicode. The experiment needs to provide for at least the following. ---- Enter each sentence from a menu where the sentence is listed in English. Selecting from the menu to cause the Private Use Area codes for the sentence to be included in a message, with the English text not appearing in the message. Transmitting and receiving the message. Decoding the message to produce the message displayed localized into Swedish. ---- The sentences are as follows, shown in English, then the sequence of code point descriptions, then shown in Swedish. ---- Good day. U+EFFF U+EE31 U+EE30 U+EE30 U+EE30 U+EE31 God dag! ---- Best regards, U+EFFF U+EE31 U+EE30 U+EE30 U+EE31 U+EE34 V?nliga h?lsningar, ---- The translations are from the following post by Magnus Bodin. http://www.unicode.org/mail-arch/unicode-ml/y2009-m04/0231.html ---- Just in case the accented characters are displayed wrongly in either the mailing list email or in the archive, please know that there are only two accented characters and that the two accented characters are both the same and are as follows. U+00E4 LATIN SMALL LETTER A WITH DIAERESIS The character is listed in the following document. http://www.unicode.org/charts/PDF/U0080.pdf ---- Glyphs for the two localizable sentences are not necessary for this experiment, but should they be of interest and useful, please find attached an image of the two glyphs, the less complex one, at the left, being for Good day. ---- The following post is mentioned in case it is helpful. http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0196.html ---- As it happens I do not personally at present have the knowledge, skills and facilities to carry out the experiment and prove the concept myself. Alas, there is no prize for participating, yet it is not a competition either. Participation could however potentially have far reaching beneficial advantages for the future of communication through the language barrier. William Overington 24 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: glyphs.png Type: image/png Size: 7935 bytes Desc: not available URL: From petercon at microsoft.com Wed Jun 24 10:57:22 2015 From: petercon at microsoft.com (Peter Constable) Date: Wed, 24 Jun 2015 15:57:22 +0000 Subject: moratorium on repeated discussion of rejected topics Message-ID: Dear Sarasvati: There is a new thread on the topic of using characters to give abstract representation of semantic propositions that can be rendered as sentences in various languages - so called "localizable sentences". This idea has been brought up repeatedly over several years now and has gained no traction as having potential for a Unicode encoding proposal. To having this topic continually re-opened is tiresome; it's a form of spam on this list, degrading the experience for all who come to the list to discuss reasonable proposals or to get help with real usage scenarios. I wonder if you might want to consider putting a moratorium on further discussion of this topic. Peter -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Jun 24 11:16:49 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 24 Jun 2015 18:16:49 +0200 Subject: moratorium on repeated discussion of rejected topics In-Reply-To: References: Message-ID: On Wed, Jun 24, 2015 at 5:57 PM, Peter Constable wrote: > There is a new thread on the topic of using characters to give abstract > representation of semantic propositions that can be rendered as sentences > in various languages ? so called ?localizable sentences?. This idea has > been brought up repeatedly over several years now and has gained no > traction as having potential for a Unicode encoding proposal. To having > this topic continually re-opened is tiresome; it?s a form of spam on this > list, degrading the experience for all who come to the list to discuss > reasonable proposals or to get help with real usage scenarios. I wonder if > you might want to consider putting a moratorium on further discussion of > this topic. > ?I strongly agree. It is simply a waste of time.? (Even though I have blacklisted Overington on my email, I still get other people responding to him.) Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 24 11:30:22 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 24 Jun 2015 18:30:22 +0200 Subject: moratorium on repeated discussion of rejected topics In-Reply-To: References: Message-ID: I agree, but this thread just restarted because the very active encoding of emojis creates such opporutnity to encode some ideas/words with symbols (though these symbols are just symbols and have no grammar and do not attempt to represent full text, they are just pictural substitutes for what they represent directly). Emojis are sort of reintroducting of ideograms (but not simplifying them with counted strokes or reducing them to be dran with a brish and single ink or reducing them to single syllables as in Chinese: emojis are true ideograms, just like prehistoric inscriptions, and contain a lot of pictural art and offer a wide-open creativity, much more than conventional glyphs for letters or syllables). The other iddiference is that emojis are actively supported by vendors and by many users in the world, profiting the fact that some instant messaging protocols allowed inserting small bitmap icons. Vendors wanted then to support these also on larger ranges of devices using different resolutions (or absence of colors, something rare now). For some applications like SMS and Twitter, using icons was too costly they wanted a more compact representation (that did not require shifting to costly MMS or posting URLs hosted on random hosts, with security and privacy problems). It's natural that emojis came first from Asia (hence their name), where the creation of sinograms is still very active, but with glyphs that are difficult to interpret by most readers. They wanted more attractive ideograms that everybody could read, notably on the social medias where they are targetting the mass that don't wnat to learn a new language. 2015-06-24 17:57 GMT+02:00 Peter Constable : > Dear Sarasvati: > > > > There is a new thread on the topic of using characters to give abstract > representation of semantic propositions that can be rendered as sentences > in various languages ? so called ?localizable sentences?. This idea has > been brought up repeatedly over several years now and has gained no > traction as having potential for a Unicode encoding proposal. To having > this topic continually re-opened is tiresome; it?s a form of spam on this > list, degrading the experience for all who come to the list to discuss > reasonable proposals or to get help with real usage scenarios. I wonder if > you might want to consider putting a moratorium on further discussion of > this topic. > > > > > > > > Peter > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From root at unicode.org Wed Jun 24 11:37:04 2015 From: root at unicode.org (Sarasvati) Date: Wed, 24 Jun 2015 11:37:04 -0500 Subject: moratorium on repeated discussion of rejected topics Message-ID: <201506241637.t5OGb4Zf008948@sarasvati.unicode.org> By popular and repeated request, a moratorium is hereby declared on discussion of so-called "localizable sentences". Please do not respond any further on that topic. If you have additional comments, you are welcome to e-mail privately. Your, -- Sarasvati From mpsuzuki at hiroshima-u.ac.jp Wed Jun 24 11:38:27 2015 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Thu, 25 Jun 2015 01:38:27 +0900 Subject: ["Unicode"] Re: moratorium on repeated discussion of rejected topics In-Reply-To: References: Message-ID: <558ADD03.8000803@hiroshima-u.ac.jp> > They wanted more attractive > ideograms that everybody could read, notably on the social medias where > they are targetting the mass that don't wnat to learn a new language. Who they are? Regards, mpsuzuki Philippe Verdy wrote: > I agree, but this thread just restarted because the very active encoding of > emojis creates such opporutnity to encode some ideas/words with symbols > (though these symbols are just symbols and have no grammar and do not > attempt to represent full text, they are just pictural substitutes for what > they represent directly). > > Emojis are sort of reintroducting of ideograms (but not simplifying them > with counted strokes or reducing them to be dran with a brish and single > ink or reducing them to single syllables as in Chinese: emojis are true > ideograms, just like prehistoric inscriptions, and contain a lot of > pictural art and offer a wide-open creativity, much more than conventional > glyphs for letters or syllables). > > The other iddiference is that emojis are actively supported by vendors and > by many users in the world, profiting the fact that some instant messaging > protocols allowed inserting small bitmap icons. Vendors wanted then to > support these also on larger ranges of devices using different resolutions > (or absence of colors, something rare now). For some applications like SMS > and Twitter, using icons was too costly they wanted a more compact > representation (that did not require shifting to costly MMS or posting URLs > hosted on random hosts, with security and privacy problems). > > It's natural that emojis came first from Asia (hence their name), where the > creation of sinograms is still very active, but with glyphs that are > difficult to interpret by most readers. They wanted more attractive > ideograms that everybody could read, notably on the social medias where > they are targetting the mass that don't wnat to learn a new language. > > 2015-06-24 17:57 GMT+02:00 Peter Constable : > >> Dear Sarasvati: >> >> >> >> There is a new thread on the topic of using characters to give abstract >> representation of semantic propositions that can be rendered as sentences >> in various languages ? so called ?localizable sentences?. This idea has >> been brought up repeatedly over several years now and has gained no >> traction as having potential for a Unicode encoding proposal. To having >> this topic continually re-opened is tiresome; it?s a form of spam on this >> list, degrading the experience for all who come to the list to discuss >> reasonable proposals or to get help with real usage scenarios. I wonder if >> you might want to consider putting a moratorium on further discussion of >> this topic. >> >> >> >> >> >> >> >> Peter >> >> >> >> >> >> >> > From verdy_p at wanadoo.fr Wed Jun 24 12:09:05 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 24 Jun 2015 19:09:05 +0200 Subject: moratorium on repeated discussion of rejected topics In-Reply-To: <201506241637.t5OGb4Zf008948@sarasvati.unicode.org> References: <201506241637.t5OGb4Zf008948@sarasvati.unicode.org> Message-ID: I have NEVER actively supported the "localizable sentences". Only one user wanted to discuss it here and I gave him my same opinion repeatedly. In fact you may even have also used my own opinion as one (among others) wanting to stop discussing this topic. But if you want my opinion, there's also really too much discussions about emojis. They are however encoded due to popular demand and very active and demonstrated usage (in conformance with the encoding policy), unlike what William Overrington posts here instead of an appropriate online community for people really interested in developing his project. I have always seen the posts by William Overrington on this list being a real form of "free" advertizing (trying to advertize his own web site). William should better find a Usenet group, or create his own Yahoo group, or social Facebook/Twitter group, or Github group, or similar and advertize it there. 2015-06-24 18:37 GMT+02:00 Sarasvati : > By popular and repeated request, a moratorium is hereby declared > on discussion of so-called "localizable sentences". > > Please do not respond any further on that topic. If you have > additional comments, you are welcome to e-mail privately. > > Your, > -- Sarasvati > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 24 12:47:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 24 Jun 2015 10:47:42 -0700 Subject: Old Hungarian font Message-ID: <20150624104742.665a7a7059d7ee80bb4d670165c8327d.2e00205992.wbe@email03.secureserver.net> Now that Old Hungarian is encoded in Unicode, is anyone aware of a font (freely available or not) that supports it, or of plans by anyone to develop one? I'm not looking for a font that maps OH to the ASCII range, such as the original Csenge. I've already tried the major search engines and the well-known font pages, such as Alan Wood and SIL and Wazu Japan. Please send a link only if you've already confirmed there is an OH font there. Thanks, -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From public at khwilliamson.com Wed Jun 24 15:03:09 2015 From: public at khwilliamson.com (Karl Williamson) Date: Wed, 24 Jun 2015 14:03:09 -0600 Subject: trying to understand the relationship between the Version 1 Hangul syllables and the later versions' In-Reply-To: <558493EB.4000807@att.net> References: <55847B92.3020201@khwilliamson.com> <558493EB.4000807@att.net> Message-ID: <558B0CFD.2030008@khwilliamson.com> On 06/19/2015 04:12 PM, Ken Whistler wrote: > Karl, > > As usual, the situation is way more complicated that perhaps it has any > business > being! > > It isn't just Version 1 Hangul that have to be considered, but also > Version 1.1 Hangul. > > Version 1.0 contained 2350 Hangul syllables, encoded in the range > 3400..3D2D. > > Version 1.1 contained 6646 Hangul syllables, encoded in the range > 3400..3D2D > and a distinct new range 3D2E..4DFF. It thus added 4306 to what was in > Version 1.0 already. > > Version 2.0 (and all subsequent versions) contained the 11172 Hangul > syllables we now see, encoded in the range AC00..D7A3. Version 2.0 > *deleted* all the Hangul syllables in the range 3400..4DFF. > > You also need to pay attention to the history of the encoding of jamo. > > Version 1.0 contained 94 "Hangul Elements", encoded in the range > 3131..318E. > > Version 1.1 retained the same 94 "Hangul Letters" in the range 3131..318E. > Version 1.1 added 240 conjoining jamo letters in the range 1100..11F9. > > Version 2.0 retained both of those sets. > > O.k., now what were those various chunks? > > The Unicode 1.0 set of 2350 was encoded for compatibility with KS C > 5601-1987. > They were given no formal decompositions (the concept didn't yet exist), > but > the implication in the standard was essentially that Hangul syllables could > just be spelled out with jamo letter sequences. The details were an > exercise > for implementation, however, and were soon overtaken by events in > the Unicode/10646 merger. > > The Unicode 1.1 set of 4306 additions came from the 10646 merger work, > and comprised two actual subsets: > > Hangul Supplementary Syllables A (1930 modern syllables) from KS C > 5659-1990. > (See the Unicode 1.1 subrange: 3D2E..44BD.) > > Hangul Supplementary Syllables B (2376 old Korean syllables) from KS C > 5657-1991. > (See the Unicode 1.1 subrange: 44BE..4DFF.) > > *All* of the Unicode 1.1 Hangul syllables were given decompositions. > (Although the formalization of Unicode normalization did not yet exist.) > The decompositions can be see in UnicodeData-1.1.5.txt. Because the > syllables were then encoded in three "alphabetical" extents, with a few > stragglers tucked > on, the decompositions were not algorithmically defined -- they were just > enumerated in the data file. The decompositions involved the new set of > conjoining jamo letters, rather than the older set, which were relegated > to compatibility mapping status. > > The Unicode 2.0 set of 11,172 was known as the "Johab" set from KS C > 5601-1992. > That was an algorithmically designed replacement of the earlier sets from > Korean standards -- designed to cover all modern syllables algorithmically, > by putting all the combinations of initial, medial and final jamos in a > defined > alphabetical order, whether or not each syllable that resulted was actually > attested in modern Korean use or not. Does this mean the original 2 standards (KS C 5601-1987 and KS C 5657-1991) fell into disuse (or perhaps never were actually used) so there was no need to map the new code points to them (hence no round-trip defined)? > > There was an enormous hullabaloo at the time, of course, about the changes > required to switch over from the old ranges to the new set. But the whole > shebang was balloted as Amendment 5 to ISO/IEC 10646-1:1993, and when > that ballot passed, Unicode adopted the change wholesale into the > documentation and data files for Unicode 2.0, to stay in synch. > > But "The Korean Mess", as it was then known, led directly to the > determination > by both SC2 and the UTC that such re-encoding of already standardized > and published characters was enormously damaging to both standards. > It was also expensive to the early implementers: Oracle, for example, long > maintained distinct database support for the Unicode 1.1 Korean, which was > incompatible with the Unicode 2.0 Korean. > > In any case, if anybody has any lingering questions about why the following > policy exists and is *strictly* enforced: > > http://www.unicode.org/policies/stability_policy.html#Encoding > > or why the applicable version for that stability policy is 2.0+, the > answer is > that it was a direct reaction to "The Korean Mess". > > --Ken > > On 6/19/2015 1:29 PM, Karl Williamson wrote: >> I haven't found any information on this. It can't just be a >> transliteration difference, because the number of code points is >> vastly different between them. >> >> Is it the case that the version 1 syllables is a failed abstraction >> that was replaced by the later versions? > > From kenwhistler at att.net Wed Jun 24 15:20:20 2015 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 24 Jun 2015 13:20:20 -0700 Subject: trying to understand the relationship between the Version 1 Hangul syllables and the later versions' In-Reply-To: <558B0CFD.2030008@khwilliamson.com> References: <55847B92.3020201@khwilliamson.com> <558493EB.4000807@att.net> <558B0CFD.2030008@khwilliamson.com> Message-ID: <558B1104.8010604@att.net> No, there were in fact round-trip mappings defined (and used) at the time. See, e.g.: http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/OLD5601.TXT which shows the Unicode 1.1 <--> KS C 5601-1987 mappings for the old range of Unicode 1.1 Hangul syllables 3400..3D2D. http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT shows the updated mappings for the complete johab set for Unicode 2.0 to an EUC encoding of KS C 5601-1992. I'm not sure about the details of the implementation of KS C 5657-1991. Somebody more familiar with the churn in Korean standards from the early 1990's might know, however. --Ken On 6/24/2015 1:03 PM, Karl Williamson wrote: > On 06/19/2015 04:12 PM, Ken Whistler wrote: >> >> >> The Unicode 2.0 set of 11,172 was known as the "Johab" set from KS C >> 5601-1992. >> That was an algorithmically designed replacement of the earlier sets >> from >> Korean standards -- designed to cover all modern syllables >> algorithmically, >> by putting all the combinations of initial, medial and final jamos in a >> defined >> alphabetical order, whether or not each syllable that resulted was >> actually >> attested in modern Korean use or not. > > Does this mean the original 2 standards (KS C 5601-1987 and KS C > 5657-1991) fell into disuse (or perhaps never were actually used) so > there was no need to map the new code points to them (hence no > round-trip defined)? >> From charupdate at orange.fr Fri Jun 26 05:48:39 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP Message-ID: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> I?ve got a problem with the word joiner and would ask anybody if things could be changed please. After two examples, I?ll draw the issue. To do traditional French typography on the PC, a justifying no-break space is needed along with the colon, because this punctuation must be placed in the middle between the word it belongs to and the following word. According to the Standard, page?799 (??23.2), such a space is obtained by bracketing a white space with word joiners: U+2060 U+0020 U+2060. To make this colon readily available on keyboard, I should therefore program the sequence: {VK_OEM_2 /*T34 B09*/ ,3 ,0x2060 ,' ' ,0x2060 ,':' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE } Still in French, the letter apostrophe, when used as current apostrophe, prevents the following word from being identified as a word because of the missing word boundary and, subsequently, prevents the autoexpand from working. This can be fixed by adding a word joiner after the apostrophe, thanks to an autocorrect entry that replaces U+02BC inserted by default in typographic mode, with U+02BC U+2060. (About why to use U+02BC, even in French, please refer to the preceding thread ?A?new take on the English Apostrophe in Unicode?. I?ll just add now that without disambiguating apostrophes and close-quotes, any search for quotations, e.g. to mark them up, using the generic character * bracketed like ?*?, must fail because results are cut at the next apostrophe instead of extending to the closing-quote.) However, despite of the word joiner having been encoded and recommended since version?3.2 of the Standard, it is still not implemented on Windows?7. Therefore I must use the traditional zero width no-break space U+FEFF instead. In TUS, sections?23.2 (page?799) and 23.8 (pages?821?sqq), we are taught that for the semantics of word joining, U+2060 is strongly preferred, but U+FEFF must still be supported for backward compatibility. As well, it results from ??23.8 that in careful text processing, U+FEFF always occurs only at the very beginning of text files when used as a byte order mark (page?822), while applications where Unicode has been carefully implemented, are expected to always mention the charset and the transformation format the files are written in, and don?t need U+FEFF as a BOM. Therefore, it seems that U+FEFF can still be used as a ZWNBSP in *new* text files, despite of its use being strongly discouraged and U+2060 being preferred. Supposing that Microsoft choose not to implement U+2060?WJ because quitting the usage of U+FEFF ZWNBSP appeared needless and would have brought much trouble for no use (or at least, not much), please permit me to ask if Unicode couldn?t follow Microsoft once again and remove the recommendation of U+2060 please. Most people just *can?t* use this character, and keyboard implementations *must* avoid it. Best regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From tomasek at etf.cuni.cz Fri Jun 26 06:02:43 2015 From: tomasek at etf.cuni.cz (Petr Tomasek) Date: Fri, 26 Jun 2015 13:02:43 +0200 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> Message-ID: <20150626110243.GB18139@ebed.etf.cuni.cz> On Fri, Jun 26, 2015 at 12:48:39PM +0200, Marcel Schneider wrote: > > However, despite of the word joiner having been encoded and recommended since version?3.2 of the Standard, it is still not implemented on Windows?7. Therefore I must use the traditional zero width no-break space U+FEFF instead. Therefore you should complain by Microsoft, not here. > Supposing that Microsoft choose not to implement U+2060?WJ Then you should probably choose another operating system which does... Petr Tomasek From charupdate at orange.fr Fri Jun 26 06:16:16 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 26 Jun 2015 13:16:16 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP In-Reply-To: <20150626110243.GB18139@ebed.etf.cuni.cz> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <20150626110243.GB18139@ebed.etf.cuni.cz> Message-ID: <988090788.6459.1435317376570.JavaMail.www@wwinf2229> On Fri, Jun 26, Petr Tomasek wrote: > Therefore you should complain by Microsoft, not here. U+FEFF works fine for me, no complaint from me now except about recommendations... > Then you should probably choose another operating system which does... You know, the issue is about keyboard layouts, not about me. Thanks for your advice... Regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.muller at efele.net Fri Jun 26 09:10:35 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 26 Jun 2015 07:10:35 -0700 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> Message-ID: <558D5D5B.506@efele.net> An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Jun 26 09:44:44 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Fri, 26 Jun 2015 16:44:44 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP In-Reply-To: <558D5D5B.506@efele.net> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <558D5D5B.506@efele.net> Message-ID: <1683796058.19352.1435329884593.JavaMail.www@wwinf1m18> On Fri, Jun 26, 2015, Eric Muller wrote: > On 6/26/2015 3:48 AM, Marcel Schneider wrote: >> To do traditional French typography on the PC, > or anywhere You want to say, on any computer. >> a justifying no-break space is needed along with the colon, because this punctuation must be placed in the middle between the word it belongs to and the following word. > Actually, it's non-justifying and it's thin. U+202F ??? NARROW NO-BREAK SPACE is your friend. U+202F is a very good friend of mine, and it's a part of ready sequences with all spaced French punctuations (;:?!??) I program for the keyboard driver, as well as with U+00A0 for use with monospaced fonts (or following user preferences, since word processors got habits with NBSP). That are things everybody knows. Right now, I'm talking about *traditional French typography* on a computer. And I'm talking about the *colon*. As you can read in old style manuals and as I know from more recent sources and from authoritative examples, things must work just as I wrote a couple of hours ago. Love it or hate it, you should provide the facility. Thank you for the advice. Regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Fri Jun 26 13:28:34 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 26 Jun 2015 19:28:34 +0100 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> Message-ID: <20150626192834.701021ff@JRWUBU2> On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote: > To do traditional French typography on the PC, a justifying no-break > space is needed along with the colon, because this punctuation must > be placed in the middle between the word it belongs to and the > following word. According to the Standard, page?799 (??23.2), such a > space is obtained by bracketing a white space with word joiners: > U+2060 U+0020 U+2060. To make this colon readily available on > keyboard, I should therefore program the sequence: {VK_OEM_2 /*T34 > B09*/ ,3 ,0x2060 ,' > ' ,0x2060 ,':' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE } For readability, I strongly recommend 0x0020 over ' ' in this context. What is the behavioural difference between and U+00A0? However, if you reread the section, you will see that the sequence they have in mind is . > Still in French, the letter apostrophe, when used as current > apostrophe, prevents the following word from being identified as a > word because of the missing word boundary and, subsequently, prevents > the autoexpand from working. This can be fixed by adding a word > joiner after the apostrophe, thanks to an autocorrect entry that > replaces U+02BC inserted by default in typographic mode, with U+02BC > U+2060. No, this doesn't work. While the primary purpose of U+2060 is to prevent line breaks, it is also used to overrule word boundary detectors in scriptio continua. (It works quite well for spell-checking Thai in LibreOffice). It's name implies to me that it is intended to prevent a word boundary being deduced, through the strong correlation between word boundaries and line break opportunities. There doesn't seem to be a code for 'zero-width word boundary at which lines should not normally be broken'. Richard. From verdy_p at wanadoo.fr Fri Jun 26 15:16:48 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 26 Jun 2015 22:16:48 +0200 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <1474556907.20113.1435331416802.JavaMail.www@wwinf1m18> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <179395084.6523.1435317700283.JavaMail.www@wwinf2229> <1474556907.20113.1435331416802.JavaMail.www@wwinf1m18> Message-ID: Quand j'ai r?pondu j'utilisait un smartphone (PC occup? pendant une mise ? jour). "trait DD jonction" aurait du ?tre "trait de jonction" (en arabe et d?vanagari o? les lettres sont jointives on a besoin de ces traits de jonction, on en a besoin aussi pour l?criture curisve latine). Seul le trait de jonction arabe est cod? dans Unicode (pour compatibilit? avec d'anciens jeux de caract?res cod?s, mais les moteurs de rendu arabe utilisent le mapping de ce caract?re et les m?triques du glyphe associ? dans les polices pour correctement positionner le trait de jonction, en le juxtaposant et superposant partiellement ou le tronquant) L'imprimerie nationale ne respecte m?me pas elle-m?me cette pseudo-r?gle permettant de justifer les fines. Les journaux, magazines, les ?ditions imprim?es utilisent des fines de chasse fixe. En revanche la microjustification (letter-spacing en CSS) s'applique une fois que la justification normale des espaces justifiable a atteint un maximum (elle peut aussi s'employer n?gativement quand le maximum est d?pass? mais qu'on pourrait l'?viter en compactant l?g?rement l'approche entre tous les caract?res, en pratique cela ne va pas en dessous de -0,15 em sinon il se produit des collisions ind?sirables). Elle permet de distribuer ?quitablement le reste de largeur entre les caract?res (y compris les espaces justifiables qui ont ?t? d?j? agrandi ? leur maximum sans micro-justification). Du temps de l'imprimerie au plomb, la fine ?tait soit int?gr?e aux caract?res de ponctuation, soit ?tait un caract?re de plomb comme les autres: on commen?ait par mettre tout ce qui tient dans une ligne entre les r?glettes, puis on ins?rait la r?gle ? coins dans les espaces entre mots et on pressait verticalement. Si les coins ?taient d?j? pouss?s au maximum, on les rempla?ait par des caract?res espaces fixes pui on refaisait la justification en mettant la r?gle ? coins pour ins?r?rer des coins entre tous les caract?res de la ligne, y compris les caract?res espaces de chasse fixe. La microjustification ?tait alors effective Pour la microjustification n?gative, en fait on rempl?ait les caract?res par des caract?res plus ?troits ou dont la chasse interne ?tait sp?cialement r?duite au point que si on avait juxtapos? les lettres, elles se seraient trouv?es jointes par l'encrage sur le papier. Les ?diteurs de presse et de livres en France utilisent tous des fines de chasse fixe dans leurs moteurs de composition (m?me l'imprimerie nationale qui ne se passent pas non plus des logiciels standards et travaille aussi pour d'autres ?diteurs). Quand elle doit reproduiire des ouvrages, elle doit en respecter la forme. L'imprimerie nationale n'est pas non plsu exempte d'erreurs typographiques dans ses ?ditions et en fait ne suit pas touours non plus ses propres "r?gles" qui ne sont que des suggestions. D tout temps il y a eu d'autres ?diteurs tout autant attach?s pourtant ? la typographie et sa tradition. Mais une "fine" qui serait justifiable ne correspond pas du tout ? la tradition. Je sais de quoi je parle ayant travaill? avec la ***quasi-totalit?*** des ?diteurs de presse et de livres en France (et aussi en partie dans d'autres pays europ?ens, et d'Am?rique du Nord ou du Moyen-Orient) et la plupart de r?gies publicitaires, y compris la presse sp?cialis?e professionnelle (il ne reste que les petits ?diteurs de communication publicitaire qui se contentent juste de faire de la reproduction et la distribution des bons ? tirer demand?s par leurs clients, gros et petits, surtout pour les tracts et d?pliants publicitaires, ou des formulaires professionnels, ou des particuliers pour leurs cartons d'invitation, ou restaurants pour les menus: l? c'est le client qui d?cide ce qu'il veut, m?me si c'est moche ou contraire ? certains usages "officiels"... La presse est libre et peut se passer des r?gles officielles, et m?me les adminsitrations font chacune ce qu'elles veulent). Donc pas besoin de "zero-width (dis)joiner" pour ce cas l? pour s'intercaler **en plus** d'une espace fine ins?cable et d'un deux-points, ou pire se mettre deux fois. La fine U+202F est suffisante m?me pour le cas des microjustification (letter-spacing en CSS) Le 26 juin 2015 17:10, Marcel Schneider a ?crit : > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 27 10:48:41 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 27 Jun 2015 17:48:41 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP Message-ID: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> On Fri, Jun 26, Richard Wordingham wrote: > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote: >> To do traditional French typography on the PC, a justifying no-break >> space is needed along with the colon, because this punctuation must >> be placed in the middle between the word it belongs to and the >> following word. According to the Standard, page 799 (? 23.2), such a >> space is obtained by bracketing a white space with word joiners: >> U+2060 U+0020 U+2060. To make this colon readily available on >> keyboard, I should therefore program the sequence: {VK_OEM_2 /*T34 >> B09*/ ,3 ,0x2060 ,' >> ' ,0x2060 ,':' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE } > For readability, I strongly recommend 0x0020 over ' ' in this context. I pasted the line from the C source, where all ASCII characters, including 0x20, are written in clear. To ensure readibility, I inserted a line break before this line. This line break must have been deleted. I don't write 0x0020 in C when it's not necessary. However I take notice of your recommendation. > What is the behavioural difference between and U+00A0? The difference appears in word processing, where justification works with U+0020, while all other spaces, including U+00A0, are not justified. > However, if you reread the section, you will see that the sequence they have in mind is . The section I cited reads as follows: ? | The word joiner can be used to prevent line breaking with other characters that do not have nonbreaking variants, such as U+2009 thin space or U+2015 horizontal bar, by bracketing the character. ? I don't believe that U+2009 is a target character rather than a mere example. IMHO you can bracket with U+2060s whatever character you need. >> Still in French, the letter apostrophe, when used as current >> apostrophe, prevents the following word from being identified as a >> word because of the missing word boundary and, subsequently, prevents >> the autoexpand from working. This can be fixed by adding a word >> joiner after the apostrophe, thanks to an autocorrect entry that >> replaces U+02BC inserted by default in typographic mode, with U+02BC >> U+2060. >No, this doesn't work. While the primary purpose of U+2060 is to prevent line breaks, it is also used to overrule word boundary detectors in scriptio continua. (It works quite well for spell-checking Thai in LibreOffice). It's name implies to me that it is intended to prevent a word boundary being deduced, through the strong correlation between word boundaries and line break opportunities. There doesn't seem to be a code for 'zero-width word boundary at which lines should not normally be broken'. ? Well, I extrapolated from U+FEFF, which works fine for me, even in this particular context. The fact that U+2060 does not work, is another reason not to use it, and the more I agree with Microsoft which did not implement U+2060 in Windows 7. Do you have any news about whether U+2060 is a part of at least one font on Windows 8? ? Marcel Schneider ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 27 12:33:44 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 27 Jun 2015 19:33:44 +0200 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> References: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> Message-ID: 2015-06-27 17:48 GMT+02:00 Marcel Schneider : > On Fri, Jun 26, Richard Wordingham > wrote: > > > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider < > charupdate at orange.fr> wrote: > >> To do traditional French typography on the PC, a justifying no-break > >> space is needed along with the colon, because this punctuation must > >> be placed in the middle between the word it belongs to and the > >> following word. According to the Standard, page 799 (? 23.2), such a > >> space is obtained by bracketing a white space with word joiners: > >> U+2060 U+0020 U+2060. To make this colon readily available on > >> keyboard, I should therefore program the sequence: {VK_OEM_2 /*T34 > >> B09*/ ,3 ,0x2060 ,' > >> ' ,0x2060 ,':' ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE ,NONE > ,NONE ,NONE ,NONE } > > For readability, I strongly recommend 0x0020 over ' ' in this context. > > I pasted the line from the C source, where all ASCII characters, including > 0x20, are written in clear. To ensure readibility, I inserted a line break > before this line. This line break must have been deleted. I don't write > 0x0020 in C when it's not necessary. However I take notice of your > recommendation. > > > What is the behavioural difference between and > U+00A0? > > The difference appears in word processing, where justification works with > U+0020, while all other spaces, including U+00A0, are not justified. > This is untrue, U+00A0 is not a fixed-width space and it remains justifiable as well. However it has a default width (when not justified) which is too large for the "fine" we want (its width and justificabiality is exactly like the regular SPACE U+0020, the only difference beting that it is normally not breakabke before or after it). That's why there's NNBSP U+202F whose default width when not justified is narrower (about one half the regular 0.5em SPACE, or one third in English typography, i.e. between 1/6 and 1/4 em: the NNBSP glyph should be sized by default to about 1/5em (0.2em) to work with both conventions, unless the language can be determined. The "hairy space" is even thinner (about 0.1em, or nearly 1px in CSS sizes with default font sizes used in HTML of 13pt at 96 logical dpi; on hiDPI displays or in zoomed in modes working at higher physical resolutions, the 96 logical dpi of CSS will map to dppx equal to several times the logical dpi, and there will be more than 1 physical pixel: the CSS pixel unit is a logical one and the 13pt default font map to about 17.3 logical pixels, and the hairy space is then about 1/17 the em square width, but it is generally rounded up to 1/10 em; the exact metric depends on several rendering factors and visual hints, but it should be the minimum distince that separates two dots and make them visibly and contrastedly separated, without bluring, it is about the same distance that separates the dot above the vertical stem of a "i") NNBSP however will normally not be justified, its width remains constant when NBSP will be expanded like other spaces. This also makes NBSP not suitable for French punctuations and group separators, as it could be really too large (just like the regular SPACE). As group separators, some are arguing that this space should be replaced by the punctuation space (to match the same width as the comma used as the decimal separator in French). But traditionally, the metal typographs just did not make this discriminination only introduced on computers dur to legacy softwares. -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Sat Jun 27 12:26:22 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sat, 27 Jun 2015 17:26:22 +0000 Subject: Adding RAINBOW FLAG to Unicode Message-ID: Hello! It is Pride Month and the US just legalised queer marriage in every state. No better time to start a conversation about including the internationally recognised rainbow flag in Unicode! Here?s some background reading on the flag itself: *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) * Here's Bustle on the inclusion of the rainbow flag: > Nearly 40 years after it was first flown, the rainbow flag remains a powerful and potent symbol of not only current gay rights struggles, but the history of gay rights in America. So why isn?t it available as an emoji? The flag is in the public domain, so it certainly isn?t being held up by copyright issues. And the current range of rainbow-related emoji show that the technology to jam all those colors distinctly into a very tiny space is available. Numerous national flags have been emojified. And given that the flag has recently been added to the Museum of Modern Art?s design collection, everyone is in agreement about its ongoing cultural significance. So what gives? http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our This article also includes an example (via screenshot) of how many people ?make do? without the rainbow flag. Typically, they use U+1F308 RAINBOW. This can be seen by searching on Twitter (or any other social media platform) for that character. Indeed, GitHub uses RAINBOW for this: http://i.imgur.com/KaKQzIC.png Facebook did the same sort of thing, as seen here: http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ They also did this: http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ These emojis are *derivative* of the rainbow flag, or include characters displaying the rainbow flag. While it can be argued that the RAINBOW emoji itself is usable as a stand-in (as above), it usually requires some sort of additional context to work. There is a clear need for a rainbow flag that unambiguously symbolises queer pride. This is already going on, with some platforms choosing to use a custom emoji shim where no Unicode code-point exists. This is Twitter?s rainbow flag: https://twitter.com/ericajoy/status/614822988609794048 Screenshot: http://i.imgur.com/1kewdN1.png Slack has one too: https://twitter.com/SlackHQ/status/602779337784430592 Screenshot: http://i.imgur.com/8cOK8MH.png Reddit also offers one: http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ Screenshot: http://i.imgur.com/p6YDRkF.png In all three examples, the symbol is being used in running text. I found this: > [...] the UTC does not wish to entertain further proposals for encoding of symbol characters for flags, whether national, state, regional, international, or otherwise. References to UTC Minutes: [134-C2], January 28, 2013. http://www.unicode.org/alloc/nonapprovals.html I looked up the minutes, but could not find a more detailed explanation. My guess is that these concerns related to geopolitical issues. Hopefully the same rationale does not apply to the rainbow flag. Looking at: http://unicode.org/reports/tr51/#Selection_Factors Here's a quick list of summary answers: a. Compatibility: yes. There are existing platform-specific rainbow flag emojis, as demonstrated above. To build a Twitter or Slack client that replicated the native functionality, you would have to use an image instead of a Unicode code point. b. Expected usage level: the rainbow emoji is listed at #168 on emojitracker.com, and as demonstrated, the rainbow flag has been in wide use since the 1970s. c. Image distinctiveness: the rainbow flag is visually distinct. d. Disparity: the rainbow flag is a missing flag. e. Frequently requested: unsure. I could organise a petition if this would help to sway the decision. f. Generality: the rainbow flag is not overly specific. Indeed it is the most general of all the pride flags. g. Open-ended: the rainbow flag is open ended, being the most general of all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols page, but there are many more in the wild.) h. Representable already: a rainbow can be represented, but it is ambiguous. The RAINBOW emoji cannot be combined with anything pictorial that makes the meaning clear. Context is required, such as paring it with the word "pride". i. Logos, Brands, UI icons, signage, specific people, deities: the image is suitable for for encoding as a character. What is the best thing for me to do next? My proposal is that we add RAINBOW FLAG to Unicode, and that we use the ?six-color version popular since 1979?. I only found one official proposal for a single emoji: http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf I couldn?t find any templates for proposals, though I did look through a number of different examples. I noticed that a number of them include the ISO/IEC form at the end. Can someone explain that to me? Does it make sense to submit a proposal to the UTC without one of these? I also notice that it looks like I have to provide (or find a person to provide) a font for the character. Is there any guidance on that? I am happy to pay someone to prepare such a thing for me. Thank you in advance for your help. Noah Slater -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 27 13:49:39 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 27 Jun 2015 20:49:39 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: 2015-06-27 19:26 GMT+02:00 Noah Slater : > c. Image distinctiveness: the rainbow flag is visually distinct. > Not so distinct from several other former rainbow flags used in South America. In fact the number of colours in the rainbow varies culturally depending on countries, even when it is intended to refer to the LGBT communities. The exact list of colors is not really fixed And their number of bands varying between 6 or 7: in US, it generally has 6 bands. But in France it frequently has 7 bands, adding fuschia/magenta after violet, because traditionally rainbows are described and drawn in France with 7 colors; the exact tints also vary, notably the lightness of blue and green which may be lighter as lime and skyblue or royal blue, violet becoming sometimes dark blue, and the last one magenta/fuschia becoming sometimes rose, the initial red being also frequently darker than in US where it is in fact more orange than red, and where US orange is nearly gold. The presence also of an cyan/aquamarine band between blue and green bands is also common (with a reduced contrast between the yellow and green, using a lighter shade of green), or simply the light cyan/aqua, or sky blue, replaces the darker blue band. In fact as long as it locally unambiguously represents a rainbow, it is accurate (there's no legal authority defining or restricting its definition, this is not a national emblem anywhere, except for the Jewish community in Russia where the rainbox is a large horizontal one with thin bands over a white flag). The dimensions/proportions are also not fixed: flags are just scaled to fit well with other flags or symbols. In many countries and events, the rainbow flag is displayed along with other national or regional flags. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sat Jun 27 14:06:05 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 27 Jun 2015 21:06:05 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Nothing really needs to be added to Unicode; vendors could already use: ????? U+1F3F3, U+200D, U+1F308 WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW credit to Shervin for the idea Mark *? Il meglio ? l?inimico del bene ?* On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater wrote: > Hello! > > It is Pride Month and the US just legalised queer marriage in every state. > No better time to start a conversation about including the internationally > recognised rainbow flag in Unicode! > > Here?s some background reading on the flag itself: > > *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) > * > > Here's Bustle on the inclusion of the rainbow flag: > > > Nearly 40 years after it was first flown, the rainbow flag remains a > powerful and potent symbol of not only current gay rights struggles, but > the history of gay rights in America. So why isn?t it available as an > emoji? The flag is in the public domain, so it certainly isn?t being held > up by copyright issues. And the current range of rainbow-related emoji show > that the technology to jam all those colors distinctly into a very tiny > space is available. Numerous national flags have been emojified. And given > that the flag has recently been added to the Museum of Modern Art?s design > collection, everyone is in agreement about its ongoing cultural > significance. So what gives? > > > http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our > > This article also includes an example (via screenshot) of how many people > ?make do? without the rainbow flag. Typically, they use U+1F308 RAINBOW. > This can be seen by searching on Twitter (or any other social media > platform) for that character. > > Indeed, GitHub uses RAINBOW for this: > > http://i.imgur.com/KaKQzIC.png > > Facebook did the same sort of thing, as seen here: > > http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ > > They also did this: > > > http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ > > These emojis are *derivative* of the rainbow flag, or include characters > displaying the rainbow flag. > > While it can be argued that the RAINBOW emoji itself is usable as a > stand-in (as above), it usually requires some sort of additional context to > work. There is a clear need for a rainbow flag that unambiguously > symbolises queer pride. > > This is already going on, with some platforms choosing to use a custom > emoji shim where no Unicode code-point exists. > > This is Twitter?s rainbow flag: > > https://twitter.com/ericajoy/status/614822988609794048 > > Screenshot: http://i.imgur.com/1kewdN1.png > > Slack has one too: > > https://twitter.com/SlackHQ/status/602779337784430592 > > Screenshot: http://i.imgur.com/8cOK8MH.png > > Reddit also offers one: > > http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ > > Screenshot: http://i.imgur.com/p6YDRkF.png > > In all three examples, the symbol is being used in running text. > > I found this: > > > [...] the UTC does not wish to entertain further proposals for encoding > of symbol characters for flags, whether national, state, regional, > international, or otherwise. References to UTC Minutes: [134-C2], January > 28, 2013. > > http://www.unicode.org/alloc/nonapprovals.html > > I looked up the minutes, but could not find a more detailed explanation. > My guess is that these concerns related to geopolitical issues. Hopefully > the same rationale does not apply to the rainbow flag. > > Looking at: > > http://unicode.org/reports/tr51/#Selection_Factors > > Here's a quick list of summary answers: > > a. Compatibility: yes. There are existing platform-specific rainbow flag > emojis, as demonstrated above. To build a Twitter or Slack client that > replicated the native functionality, you would have to use an image instead > of a Unicode code point. > > b. Expected usage level: the rainbow emoji is listed at #168 on > emojitracker.com, and as demonstrated, the rainbow flag has been in wide > use since the 1970s. > > c. Image distinctiveness: the rainbow flag is visually distinct. > > d. Disparity: the rainbow flag is a missing flag. > > e. Frequently requested: unsure. I could organise a petition if this would > help to sway the decision. > > f. Generality: the rainbow flag is not overly specific. Indeed it is the > most general of all the pride flags. > > g. Open-ended: the rainbow flag is open ended, being the most general of > all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols > page, but there are many more in the wild.) > > h. Representable already: a rainbow can be represented, but it is > ambiguous. The RAINBOW emoji cannot be combined with anything pictorial > that makes the meaning clear. Context is required, such as paring it with > the word "pride". > > i. Logos, Brands, UI icons, signage, specific people, deities: the image > is suitable for for encoding as a character. > > What is the best thing for me to do next? > > My proposal is that we add RAINBOW FLAG to Unicode, and that we use the > ?six-color version popular since 1979?. > > I only found one official proposal for a single emoji: > > http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf > > I couldn?t find any templates for proposals, though I did look through a > number of different examples. > > I noticed that a number of them include the ISO/IEC form at the end. Can > someone explain that to me? Does it make sense to submit a proposal to the > UTC without one of these? > > I also notice that it looks like I have to provide (or find a person to > provide) a font for the character. Is there any guidance on that? I am > happy to pay someone to prepare such a thing for me. > > Thank you in advance for your help. > > Noah Slater > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Sat Jun 27 14:06:48 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sat, 27 Jun 2015 20:06:48 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: On Sat, 27 Jun 2015 at 19:49 Philippe Verdy wrote: > 2015-06-27 19:26 GMT+02:00 Noah Slater : > >> c. Image distinctiveness: the rainbow flag is visually distinct. >> > > Not so distinct from several other former rainbow flags used in South > America. > > In fact the number of colours in the rainbow varies culturally depending > on countries, even when it is intended to refer to the LGBT communities. > The exact list of colors is not really fixed > Thanks for the info! As I read it, item (c) of the Selection Factors annex is about whether it is possible to have a "clearly recognisable" image. It does not appear to be talking about whether there is a single visual representation. In fact, on the strength of that, I strike the "six-color version popular since 1979" part of my proposal. Instead, I'd suggest that how implementors represent the rainbow flag is up to them. As you point out, there may be multiple valid ways of representing this single concept. Should it be entered as RAINBOW FLAG, in a generic sense, with the intention that it could be used for many different things, perhaps with an comment about it being a pride flag or an LGBT flag? Or should it be entered as PRIDE FLAG, with it's use as a rainbow flag as noted as a comment? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Sat Jun 27 14:12:22 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sat, 27 Jun 2015 20:12:22 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Mark, are there any other instances of a ZERO WIDTH JOINER being used in this way? (i.e. Outside of its intended use with Arabic and Indic scripts, etc.) Please excuse my ignorance. On 27 June 2015 at 20:06, Mark Davis ?? wrote: > Nothing really needs to be added to Unicode; vendors could already use: > > ???[image: ??] > U+1F3F3, U+200D, U+1F308 > WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW > > credit to Shervin for the idea > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater wrote: > >> Hello! >> >> It is Pride Month and the US just legalised queer marriage in every >> state. No better time to start a conversation about including the >> internationally recognised rainbow flag in Unicode! >> >> Here?s some background reading on the flag itself: >> >> *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) >> * >> >> Here's Bustle on the inclusion of the rainbow flag: >> >> > Nearly 40 years after it was first flown, the rainbow flag remains a >> powerful and potent symbol of not only current gay rights struggles, but >> the history of gay rights in America. So why isn?t it available as an >> emoji? The flag is in the public domain, so it certainly isn?t being held >> up by copyright issues. And the current range of rainbow-related emoji show >> that the technology to jam all those colors distinctly into a very tiny >> space is available. Numerous national flags have been emojified. And given >> that the flag has recently been added to the Museum of Modern Art?s design >> collection, everyone is in agreement about its ongoing cultural >> significance. So what gives? >> >> >> http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our >> >> This article also includes an example (via screenshot) of how many people >> ?make do? without the rainbow flag. Typically, they use U+1F308 RAINBOW. >> This can be seen by searching on Twitter (or any other social media >> platform) for that character. >> >> Indeed, GitHub uses RAINBOW for this: >> >> http://i.imgur.com/KaKQzIC.png >> >> Facebook did the same sort of thing, as seen here: >> >> http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ >> >> They also did this: >> >> >> http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ >> >> These emojis are *derivative* of the rainbow flag, or include characters >> displaying the rainbow flag. >> >> While it can be argued that the RAINBOW emoji itself is usable as a >> stand-in (as above), it usually requires some sort of additional context to >> work. There is a clear need for a rainbow flag that unambiguously >> symbolises queer pride. >> >> This is already going on, with some platforms choosing to use a custom >> emoji shim where no Unicode code-point exists. >> >> This is Twitter?s rainbow flag: >> >> https://twitter.com/ericajoy/status/614822988609794048 >> >> Screenshot: http://i.imgur.com/1kewdN1.png >> >> Slack has one too: >> >> https://twitter.com/SlackHQ/status/602779337784430592 >> >> Screenshot: http://i.imgur.com/8cOK8MH.png >> >> Reddit also offers one: >> >> http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ >> >> Screenshot: http://i.imgur.com/p6YDRkF.png >> >> In all three examples, the symbol is being used in running text. >> >> I found this: >> >> > [...] the UTC does not wish to entertain further proposals for encoding >> of symbol characters for flags, whether national, state, regional, >> international, or otherwise. References to UTC Minutes: [134-C2], January >> 28, 2013. >> >> http://www.unicode.org/alloc/nonapprovals.html >> >> I looked up the minutes, but could not find a more detailed explanation. >> My guess is that these concerns related to geopolitical issues. Hopefully >> the same rationale does not apply to the rainbow flag. >> >> Looking at: >> >> http://unicode.org/reports/tr51/#Selection_Factors >> >> Here's a quick list of summary answers: >> >> a. Compatibility: yes. There are existing platform-specific rainbow flag >> emojis, as demonstrated above. To build a Twitter or Slack client that >> replicated the native functionality, you would have to use an image instead >> of a Unicode code point. >> >> b. Expected usage level: the rainbow emoji is listed at #168 on >> emojitracker.com, and as demonstrated, the rainbow flag has been in wide >> use since the 1970s. >> >> c. Image distinctiveness: the rainbow flag is visually distinct. >> >> d. Disparity: the rainbow flag is a missing flag. >> >> e. Frequently requested: unsure. I could organise a petition if this >> would help to sway the decision. >> >> f. Generality: the rainbow flag is not overly specific. Indeed it is the >> most general of all the pride flags. >> >> g. Open-ended: the rainbow flag is open ended, being the most general of >> all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols >> page, but there are many more in the wild.) >> >> h. Representable already: a rainbow can be represented, but it is >> ambiguous. The RAINBOW emoji cannot be combined with anything pictorial >> that makes the meaning clear. Context is required, such as paring it with >> the word "pride". >> >> i. Logos, Brands, UI icons, signage, specific people, deities: the image >> is suitable for for encoding as a character. >> >> What is the best thing for me to do next? >> >> My proposal is that we add RAINBOW FLAG to Unicode, and that we use the >> ?six-color version popular since 1979?. >> >> I only found one official proposal for a single emoji: >> >> http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf >> >> I couldn?t find any templates for proposals, though I did look through a >> number of different examples. >> >> I noticed that a number of them include the ISO/IEC form at the end. Can >> someone explain that to me? Does it make sense to submit a proposal to the >> UTC without one of these? >> >> I also notice that it looks like I have to provide (or find a person to >> provide) a font for the character. Is there any guidance on that? I am >> happy to pay someone to prepare such a thing for me. >> >> Thank you in advance for your help. >> >> Noah Slater >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From verdy_p at wanadoo.fr Sat Jun 27 14:29:32 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 27 Jun 2015 21:29:32 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: A zero-width joiner between two spacing symbols does not mean that they should overlap completely, even if it allows some limtied form of ligature (but mostly for true letters or letter like symbols, such as between a long dash and an arrow head to connect them together in a long arrow...) Your idea would mean that the oiner changes the width of the rainbow to zero, using in fact a negative placement to overlap the flag, and then cutting the rainbox exactly to its dimension. Also the rainbow symbol alone U+1F308 is more like the one in the sky: it is circular, and has a central uncolored area. But the flag is means to be fully covered (not like the Jewish autonomous republic in Russia) and should be using parallel horizontal bands. If it is encoded the flag will become certainly part of the emoji set (it certainly has support for it in instant messaging, soon many apps for mobile phones will feature it in US, Google will include it as well for Android and Hangouts applications, Apple for iOS. And various IRC tools. Mobile phone providers will include it even if on such LGBT topic the Japanese manufacturers were more "discrete" (there's still a social taboo even if there's some level of acceptation). It is already sent via MMS only as bitmap icons, but users will want to pay less to send them using SMS, or to send them in Twitter. 2015-06-27 21:06 GMT+02:00 Mark Davis ?? : > Nothing really needs to be added to Unicode; vendors could already use: > > ???[image: ??] > U+1F3F3, U+200D, U+1F308 > WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW > > credit to Shervin for the idea > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater wrote: > >> Hello! >> >> It is Pride Month and the US just legalised queer marriage in every >> state. No better time to start a conversation about including the >> internationally recognised rainbow flag in Unicode! >> >> Here?s some background reading on the flag itself: >> >> *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) >> * >> >> Here's Bustle on the inclusion of the rainbow flag: >> >> > Nearly 40 years after it was first flown, the rainbow flag remains a >> powerful and potent symbol of not only current gay rights struggles, but >> the history of gay rights in America. So why isn?t it available as an >> emoji? The flag is in the public domain, so it certainly isn?t being held >> up by copyright issues. And the current range of rainbow-related emoji show >> that the technology to jam all those colors distinctly into a very tiny >> space is available. Numerous national flags have been emojified. And given >> that the flag has recently been added to the Museum of Modern Art?s design >> collection, everyone is in agreement about its ongoing cultural >> significance. So what gives? >> >> >> http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our >> >> This article also includes an example (via screenshot) of how many people >> ?make do? without the rainbow flag. Typically, they use U+1F308 RAINBOW. >> This can be seen by searching on Twitter (or any other social media >> platform) for that character. >> >> Indeed, GitHub uses RAINBOW for this: >> >> http://i.imgur.com/KaKQzIC.png >> >> Facebook did the same sort of thing, as seen here: >> >> http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ >> >> They also did this: >> >> >> http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ >> >> These emojis are *derivative* of the rainbow flag, or include characters >> displaying the rainbow flag. >> >> While it can be argued that the RAINBOW emoji itself is usable as a >> stand-in (as above), it usually requires some sort of additional context to >> work. There is a clear need for a rainbow flag that unambiguously >> symbolises queer pride. >> >> This is already going on, with some platforms choosing to use a custom >> emoji shim where no Unicode code-point exists. >> >> This is Twitter?s rainbow flag: >> >> https://twitter.com/ericajoy/status/614822988609794048 >> >> Screenshot: http://i.imgur.com/1kewdN1.png >> >> Slack has one too: >> >> https://twitter.com/SlackHQ/status/602779337784430592 >> >> Screenshot: http://i.imgur.com/8cOK8MH.png >> >> Reddit also offers one: >> >> http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ >> >> Screenshot: http://i.imgur.com/p6YDRkF.png >> >> In all three examples, the symbol is being used in running text. >> >> I found this: >> >> > [...] the UTC does not wish to entertain further proposals for encoding >> of symbol characters for flags, whether national, state, regional, >> international, or otherwise. References to UTC Minutes: [134-C2], January >> 28, 2013. >> >> http://www.unicode.org/alloc/nonapprovals.html >> >> I looked up the minutes, but could not find a more detailed explanation. >> My guess is that these concerns related to geopolitical issues. Hopefully >> the same rationale does not apply to the rainbow flag. >> >> Looking at: >> >> http://unicode.org/reports/tr51/#Selection_Factors >> >> Here's a quick list of summary answers: >> >> a. Compatibility: yes. There are existing platform-specific rainbow flag >> emojis, as demonstrated above. To build a Twitter or Slack client that >> replicated the native functionality, you would have to use an image instead >> of a Unicode code point. >> >> b. Expected usage level: the rainbow emoji is listed at #168 on >> emojitracker.com, and as demonstrated, the rainbow flag has been in wide >> use since the 1970s. >> >> c. Image distinctiveness: the rainbow flag is visually distinct. >> >> d. Disparity: the rainbow flag is a missing flag. >> >> e. Frequently requested: unsure. I could organise a petition if this >> would help to sway the decision. >> >> f. Generality: the rainbow flag is not overly specific. Indeed it is the >> most general of all the pride flags. >> >> g. Open-ended: the rainbow flag is open ended, being the most general of >> all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols >> page, but there are many more in the wild.) >> >> h. Representable already: a rainbow can be represented, but it is >> ambiguous. The RAINBOW emoji cannot be combined with anything pictorial >> that makes the meaning clear. Context is required, such as paring it with >> the word "pride". >> >> i. Logos, Brands, UI icons, signage, specific people, deities: the image >> is suitable for for encoding as a character. >> >> What is the best thing for me to do next? >> >> My proposal is that we add RAINBOW FLAG to Unicode, and that we use the >> ?six-color version popular since 1979?. >> >> I only found one official proposal for a single emoji: >> >> http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf >> >> I couldn?t find any templates for proposals, though I did look through a >> number of different examples. >> >> I noticed that a number of them include the ISO/IEC form at the end. Can >> someone explain that to me? Does it make sense to submit a proposal to the >> UTC without one of these? >> >> I also notice that it looks like I have to provide (or find a person to >> provide) a font for the character. Is there any guidance on that? I am >> happy to pay someone to prepare such a thing for me. >> >> Thank you in advance for your help. >> >> Noah Slater >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From mark at macchiato.com Sat Jun 27 14:31:23 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 27 Jun 2015 21:31:23 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Take a look at http://unicode.org/reports/tr51/ for details. Mark *? Il meglio ? l?inimico del bene ?* On Sat, Jun 27, 2015 at 9:12 PM, Noah Slater wrote: > Mark, are there any other instances of a ZERO WIDTH JOINER being used in > this way? (i.e. Outside of its intended use with Arabic and Indic scripts, > etc.) Please excuse my ignorance. > > On 27 June 2015 at 20:06, Mark Davis [image: ?]? > wrote: > >> Nothing really needs to be added to Unicode; vendors could already use: >> >> ???[image: ??] >> U+1F3F3, U+200D, U+1F308 >> WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW >> >> credit to Shervin for the idea >> >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater >> wrote: >> >>> Hello! >>> >>> It is Pride Month and the US just legalised queer marriage in every >>> state. No better time to start a conversation about including the >>> internationally recognised rainbow flag in Unicode! >>> >>> Here?s some background reading on the flag itself: >>> >>> *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) >>> * >>> >>> Here's Bustle on the inclusion of the rainbow flag: >>> >>> > Nearly 40 years after it was first flown, the rainbow flag remains a >>> powerful and potent symbol of not only current gay rights struggles, but >>> the history of gay rights in America. So why isn?t it available as an >>> emoji? The flag is in the public domain, so it certainly isn?t being held >>> up by copyright issues. And the current range of rainbow-related emoji show >>> that the technology to jam all those colors distinctly into a very tiny >>> space is available. Numerous national flags have been emojified. And given >>> that the flag has recently been added to the Museum of Modern Art?s design >>> collection, everyone is in agreement about its ongoing cultural >>> significance. So what gives? >>> >>> >>> http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our >>> >>> This article also includes an example (via screenshot) of how many >>> people ?make do? without the rainbow flag. Typically, they use U+1F308 >>> RAINBOW. This can be seen by searching on Twitter (or any other social >>> media platform) for that character. >>> >>> Indeed, GitHub uses RAINBOW for this: >>> >>> http://i.imgur.com/KaKQzIC.png >>> >>> Facebook did the same sort of thing, as seen here: >>> >>> http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ >>> >>> They also did this: >>> >>> >>> http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ >>> >>> These emojis are *derivative* of the rainbow flag, or include characters >>> displaying the rainbow flag. >>> >>> While it can be argued that the RAINBOW emoji itself is usable as a >>> stand-in (as above), it usually requires some sort of additional context to >>> work. There is a clear need for a rainbow flag that unambiguously >>> symbolises queer pride. >>> >>> This is already going on, with some platforms choosing to use a custom >>> emoji shim where no Unicode code-point exists. >>> >>> This is Twitter?s rainbow flag: >>> >>> https://twitter.com/ericajoy/status/614822988609794048 >>> >>> Screenshot: http://i.imgur.com/1kewdN1.png >>> >>> Slack has one too: >>> >>> https://twitter.com/SlackHQ/status/602779337784430592 >>> >>> Screenshot: http://i.imgur.com/8cOK8MH.png >>> >>> Reddit also offers one: >>> >>> http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ >>> >>> Screenshot: http://i.imgur.com/p6YDRkF.png >>> >>> In all three examples, the symbol is being used in running text. >>> >>> I found this: >>> >>> > [...] the UTC does not wish to entertain further proposals for >>> encoding of symbol characters for flags, whether national, state, regional, >>> international, or otherwise. References to UTC Minutes: [134-C2], January >>> 28, 2013. >>> >>> http://www.unicode.org/alloc/nonapprovals.html >>> >>> I looked up the minutes, but could not find a more detailed explanation. >>> My guess is that these concerns related to geopolitical issues. Hopefully >>> the same rationale does not apply to the rainbow flag. >>> >>> Looking at: >>> >>> http://unicode.org/reports/tr51/#Selection_Factors >>> >>> Here's a quick list of summary answers: >>> >>> a. Compatibility: yes. There are existing platform-specific rainbow flag >>> emojis, as demonstrated above. To build a Twitter or Slack client that >>> replicated the native functionality, you would have to use an image instead >>> of a Unicode code point. >>> >>> b. Expected usage level: the rainbow emoji is listed at #168 on >>> emojitracker.com, and as demonstrated, the rainbow flag has been in >>> wide use since the 1970s. >>> >>> c. Image distinctiveness: the rainbow flag is visually distinct. >>> >>> d. Disparity: the rainbow flag is a missing flag. >>> >>> e. Frequently requested: unsure. I could organise a petition if this >>> would help to sway the decision. >>> >>> f. Generality: the rainbow flag is not overly specific. Indeed it is the >>> most general of all the pride flags. >>> >>> g. Open-ended: the rainbow flag is open ended, being the most general of >>> all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols >>> page, but there are many more in the wild.) >>> >>> h. Representable already: a rainbow can be represented, but it is >>> ambiguous. The RAINBOW emoji cannot be combined with anything pictorial >>> that makes the meaning clear. Context is required, such as paring it with >>> the word "pride". >>> >>> i. Logos, Brands, UI icons, signage, specific people, deities: the image >>> is suitable for for encoding as a character. >>> >>> What is the best thing for me to do next? >>> >>> My proposal is that we add RAINBOW FLAG to Unicode, and that we use the >>> ?six-color version popular since 1979?. >>> >>> I only found one official proposal for a single emoji: >>> >>> http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf >>> >>> I couldn?t find any templates for proposals, though I did look through a >>> number of different examples. >>> >>> I noticed that a number of them include the ISO/IEC form at the end. Can >>> someone explain that to me? Does it make sense to submit a proposal to the >>> UTC without one of these? >>> >>> I also notice that it looks like I have to provide (or find a person to >>> provide) a font for the character. Is there any guidance on that? I am >>> happy to pay someone to prepare such a thing for me. >>> >>> Thank you in advance for your help. >>> >>> Noah Slater >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 2776 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From mark at macchiato.com Sat Jun 27 14:36:52 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sat, 27 Jun 2015 21:36:52 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: On Sat, Jun 27, 2015 at 9:29 PM, Philippe Verdy wrote: > A zero-width joiner between two spacing symbols does not mean that they > should overlap completely, even if it allows some limtied form of ligature > (but mostly for true letters or letter like symbols, such as between a long > dash and an arrow head to connect them together in a long arrow...) > Your idea would mean that the oiner changes the width of the rainbow to > zero, using in fact a negative placement to overlap the flag, and then > cutting the rainbox exactly to its dimension. > The use ?of joiner ? with emoji ?can be rather different. See ?http://unicode.org/reports/tr51/ for details. > > Also the rainbow symbol alone U+1F308 is more like the one in the sky: it > is circular, and has a central uncolored area. > But the flag is means to be fully covered (not like the Jewish autonomous > republic in Russia) and should be using parallel horizontal bands. > ?Vendors have a fair degree of latitude as far as shapes, and the resulting glyph can be shown with a shape similar to the national flags, and with horizontal bands. ? > If it is encoded the flag will become certainly part of the emoji set (it > certainly has support for it in instant messaging, soon many apps for > mobile phones will feature it in US, Google will include it as well for > Android and Hangouts applications, Apple for iOS. And various IRC tools. > > Mobile phone providers will include it even if on such LGBT topic the > Japanese manufacturers were more "discrete" (there's still a social taboo > even if there's some level of acceptation). It is already sent via MMS only > as bitmap icons, but users will want to pay less to send them using SMS, or > to send them in Twitter. > ? Mark *? Il meglio ? l?inimico del bene ?* -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 27 15:14:33 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 27 Jun 2015 22:14:33 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: This UTR just addresses the case of a combining coloring symbol for faces and those color symbols were designed since the begining to be combined as much as possible (and not meant to be used in isolation), this is not the case of the rainbow symbol which is much more figurative). Why would associating a flag and a rainbow this way means the flag will just be recolored (but the rainbox form itself is completely lost)? Couldn't this be to display a flying flag over a sky with a rainbow? Compare this to the association of the sun and the rainbow symbols, or the cloud and a rainbow (and compare to the sun or moon and a cloud associated the same way, or the association of two clouds: none of them will overlap completely). Imagine the use in a weather application, I don't wee why the rainbox would disappear when the flying flag is just there to mean the windy condition, and the rainbox meant for variable weather mixing rainy and sunny periods. Your proposed use of ZWJ to create a complete overlap of one symbol into another is unexpected. ZWJ+symbol does not transfor that symbol into a "emoi modifier" (this is not anywhere in UTF51). It may just create a small partial overlap of one symbol into the other, but each one is still clearly identifiable separately. The examples shown are for grouping multiple persons in Annex E but each person is still separately visible and recognizable as such even if they are combined in the same final glyph. Annexe E even requires some specific orders (e.g. for families: the man can only come before a woman, and is then necessarily visible to the left side of the icon, i.e. to the right of the woman; children are necessarily after and below adults...). 2015-06-27 21:31 GMT+02:00 Mark Davis ?? : > Take a look at http://unicode.org/reports/tr51/ for details. > >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From olopierpa at gmail.com Sat Jun 27 16:23:53 2015 From: olopierpa at gmail.com (Pierpaolo Bernardi) Date: Sat, 27 Jun 2015 23:23:53 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: On Sat, Jun 27, 2015 at 10:14 PM, Philippe Verdy wrote: > > Why would associating a flag and a rainbow this way means the flag will > just be recolored (but the rainbox form itself is completely lost)? > Couldn't this be to display a flying flag over a sky with a rainbow? > Compare this to the association of the sun and the rainbow symbols, or the > cloud and a rainbow (and compare to the sun or moon and a cloud associated > the same way, or the association of two clouds: none of them will overlap > completely). > > Imagine the use in a weather application, I don't wee why the rainbox > would disappear when the flying flag is just there to mean the windy > condition, and the rainbox meant for variable weather mixing rainy and > sunny periods. > > Your proposed use of ZWJ to create a complete overlap of one symbol into > another is unexpected. > A ZWJ does not cause two random characters to overlap. It creates a ligature, and the ligature can be rendered in any way the font designers prefer. If there's a need for this character, font designers could agree to render this ligature in the desired way. In case there's the need, the Unicode Consortium could hint at the intended meaning of this ligature, I think? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Sat Jun 27 16:28:10 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sat, 27 Jun 2015 21:28:10 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: I think it's a bit of a stretch to propose that a rainbow flag is a "white flag" and "rainbow" ligature. That's certainly well beyond any understanding I have of what a ligature is, from a typographical perspective. On Sat, 27 Jun 2015 at 22:23 Pierpaolo Bernardi wrote: > On Sat, Jun 27, 2015 at 10:14 PM, Philippe Verdy > wrote: > >> >> Why would associating a flag and a rainbow this way means the flag will >> just be recolored (but the rainbox form itself is completely lost)? >> Couldn't this be to display a flying flag over a sky with a rainbow? >> Compare this to the association of the sun and the rainbow symbols, or the >> cloud and a rainbow (and compare to the sun or moon and a cloud associated >> the same way, or the association of two clouds: none of them will overlap >> completely). >> >> Imagine the use in a weather application, I don't wee why the rainbox >> would disappear when the flying flag is just there to mean the windy >> condition, and the rainbox meant for variable weather mixing rainy and >> sunny periods. >> >> Your proposed use of ZWJ to create a complete overlap of one symbol into >> another is unexpected. >> > > A ZWJ does not cause two random characters to overlap. It creates a > ligature, and the ligature can be rendered in any way the font designers > prefer. If there's a need for this character, font designers could agree > to render this ligature in the desired way. > > In case there's the need, the Unicode Consortium could hint at the > intended meaning of this ligature, I think? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ritt.ks at gmail.com Sat Jun 27 16:46:07 2015 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Sun, 28 Jun 2015 01:46:07 +0400 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: U+1F3F3, U+200D, U+2620 WAVING WHITE FLAG, ZERO WIDTH JOINER, SKULL AND CROSSBONES Wanna this one, too :) Konstantin 2015-06-27 23:06 GMT+04:00 Mark Davis ?? : > Nothing really needs to be added to Unicode; vendors could already use: > > ???[image: ??] > U+1F3F3, U+200D, U+1F308 > WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW > > credit to Shervin for the idea > > > > Mark > > *? Il meglio ? l?inimico del bene ?* > > On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater wrote: > >> Hello! >> >> It is Pride Month and the US just legalised queer marriage in every >> state. No better time to start a conversation about including the >> internationally recognised rainbow flag in Unicode! >> >> Here?s some background reading on the flag itself: >> >> *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) >> * >> >> Here's Bustle on the inclusion of the rainbow flag: >> >> > Nearly 40 years after it was first flown, the rainbow flag remains a >> powerful and potent symbol of not only current gay rights struggles, but >> the history of gay rights in America. So why isn?t it available as an >> emoji? The flag is in the public domain, so it certainly isn?t being held >> up by copyright issues. And the current range of rainbow-related emoji show >> that the technology to jam all those colors distinctly into a very tiny >> space is available. Numerous national flags have been emojified. And given >> that the flag has recently been added to the Museum of Modern Art?s design >> collection, everyone is in agreement about its ongoing cultural >> significance. So what gives? >> >> >> http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our >> >> This article also includes an example (via screenshot) of how many people >> ?make do? without the rainbow flag. Typically, they use U+1F308 RAINBOW. >> This can be seen by searching on Twitter (or any other social media >> platform) for that character. >> >> Indeed, GitHub uses RAINBOW for this: >> >> http://i.imgur.com/KaKQzIC.png >> >> Facebook did the same sort of thing, as seen here: >> >> http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ >> >> They also did this: >> >> >> http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ >> >> These emojis are *derivative* of the rainbow flag, or include characters >> displaying the rainbow flag. >> >> While it can be argued that the RAINBOW emoji itself is usable as a >> stand-in (as above), it usually requires some sort of additional context to >> work. There is a clear need for a rainbow flag that unambiguously >> symbolises queer pride. >> >> This is already going on, with some platforms choosing to use a custom >> emoji shim where no Unicode code-point exists. >> >> This is Twitter?s rainbow flag: >> >> https://twitter.com/ericajoy/status/614822988609794048 >> >> Screenshot: http://i.imgur.com/1kewdN1.png >> >> Slack has one too: >> >> https://twitter.com/SlackHQ/status/602779337784430592 >> >> Screenshot: http://i.imgur.com/8cOK8MH.png >> >> Reddit also offers one: >> >> http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ >> >> Screenshot: http://i.imgur.com/p6YDRkF.png >> >> In all three examples, the symbol is being used in running text. >> >> I found this: >> >> > [...] the UTC does not wish to entertain further proposals for encoding >> of symbol characters for flags, whether national, state, regional, >> international, or otherwise. References to UTC Minutes: [134-C2], January >> 28, 2013. >> >> http://www.unicode.org/alloc/nonapprovals.html >> >> I looked up the minutes, but could not find a more detailed explanation. >> My guess is that these concerns related to geopolitical issues. Hopefully >> the same rationale does not apply to the rainbow flag. >> >> Looking at: >> >> http://unicode.org/reports/tr51/#Selection_Factors >> >> Here's a quick list of summary answers: >> >> a. Compatibility: yes. There are existing platform-specific rainbow flag >> emojis, as demonstrated above. To build a Twitter or Slack client that >> replicated the native functionality, you would have to use an image instead >> of a Unicode code point. >> >> b. Expected usage level: the rainbow emoji is listed at #168 on >> emojitracker.com, and as demonstrated, the rainbow flag has been in wide >> use since the 1970s. >> >> c. Image distinctiveness: the rainbow flag is visually distinct. >> >> d. Disparity: the rainbow flag is a missing flag. >> >> e. Frequently requested: unsure. I could organise a petition if this >> would help to sway the decision. >> >> f. Generality: the rainbow flag is not overly specific. Indeed it is the >> most general of all the pride flags. >> >> g. Open-ended: the rainbow flag is open ended, being the most general of >> all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols >> page, but there are many more in the wild.) >> >> h. Representable already: a rainbow can be represented, but it is >> ambiguous. The RAINBOW emoji cannot be combined with anything pictorial >> that makes the meaning clear. Context is required, such as paring it with >> the word "pride". >> >> i. Logos, Brands, UI icons, signage, specific people, deities: the image >> is suitable for for encoding as a character. >> >> What is the best thing for me to do next? >> >> My proposal is that we add RAINBOW FLAG to Unicode, and that we use the >> ?six-color version popular since 1979?. >> >> I only found one official proposal for a single emoji: >> >> http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf >> >> I couldn?t find any templates for proposals, though I did look through a >> number of different examples. >> >> I noticed that a number of them include the ISO/IEC form at the end. Can >> someone explain that to me? Does it make sense to submit a proposal to the >> UTC without one of these? >> >> I also notice that it looks like I have to provide (or find a person to >> provide) a font for the character. Is there any guidance on that? I am >> happy to pay someone to prepare such a thing for me. >> >> Thank you in advance for your help. >> >> Noah Slater >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From pedberg at apple.com Sat Jun 27 16:48:22 2015 From: pedberg at apple.com (Peter Edberg) Date: Sat, 27 Jun 2015 14:48:22 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Philippe and others, You are missing the relevant parts of UTR #51. See: ? http://www.unicode.org/reports/tr51/#Multi_Person_Groupings ? http://www.unicode.org/reports/tr51/#ZWJ_Sequences This type of behavior with ZWJ for emoji is already in use. - Peter E > On Jun 27, 2015, at 1:14 PM, Philippe Verdy wrote: > > This UTR just addresses the case of a combining coloring symbol for faces and those color symbols were designed since the begining to be combined as much as possible (and not meant to be used in isolation), this is not the case of the rainbow symbol which is much more figurative). > > Why would associating a flag and a rainbow this way means the flag will just be recolored (but the rainbox form itself is completely lost)? > Couldn't this be to display a flying flag over a sky with a rainbow? Compare this to the association of the sun and the rainbow symbols, or the cloud and a rainbow (and compare to the sun or moon and a cloud associated the same way, or the association of two clouds: none of them will overlap completely). > > Imagine the use in a weather application, I don't wee why the rainbox would disappear when the flying flag is just there to mean the windy condition, and the rainbox meant for variable weather mixing rainy and sunny periods. > > Your proposed use of ZWJ to create a complete overlap of one symbol into another is unexpected. > > ZWJ+symbol does not transfor that symbol into a "emoi modifier" (this is not anywhere in UTF51). It may just create a small partial overlap of one symbol into the other, but each one is still clearly identifiable separately. The examples shown are for grouping multiple persons in Annex E but each person is still separately visible and recognizable as such even if they are combined in the same final glyph. Annexe E even requires some specific orders (e.g. for families: the man can only come before a woman, and is then necessarily visible to the left side of the icon, i.e. to the right of the woman; children are necessarily after and below adults...). > > > 2015-06-27 21:31 GMT+02:00 Mark Davis ?? >: > Take a look at http://unicode.org/reports/tr51/ for details. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ritt.ks at gmail.com Sat Jun 27 16:48:13 2015 From: ritt.ks at gmail.com (Konstantin Ritt) Date: Sun, 28 Jun 2015 01:48:13 +0400 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Actually, U+1F3F4, U+200D, U+2620 WAVING BLACK FLAG, ZERO WIDTH JOINER, SKULL AND CROSSBONES Konstantin 2015-06-28 1:46 GMT+04:00 Konstantin Ritt : > U+1F3F3, U+200D, U+2620 > WAVING WHITE FLAG, ZERO WIDTH JOINER, SKULL AND CROSSBONES > > Wanna this one, too :) > > > Konstantin > > 2015-06-27 23:06 GMT+04:00 Mark Davis [image: ?]? : > >> Nothing really needs to be added to Unicode; vendors could already use: >> >> ???[image: ??] >> U+1F3F3, U+200D, U+1F308 >> WAVING WHITE FLAG, ZERO WIDTH JOINER, RAINBOW >> >> credit to Shervin for the idea >> >> >> >> Mark >> >> *? Il meglio ? l?inimico del bene ?* >> >> On Sat, Jun 27, 2015 at 7:26 PM, Noah Slater >> wrote: >> >>> Hello! >>> >>> It is Pride Month and the US just legalised queer marriage in every >>> state. No better time to start a conversation about including the >>> internationally recognised rainbow flag in Unicode! >>> >>> Here?s some background reading on the flag itself: >>> >>> *https://en.wikipedia.org/wiki/Rainbow_flag_(LGBT_movement) >>> * >>> >>> Here's Bustle on the inclusion of the rainbow flag: >>> >>> > Nearly 40 years after it was first flown, the rainbow flag remains a >>> powerful and potent symbol of not only current gay rights struggles, but >>> the history of gay rights in America. So why isn?t it available as an >>> emoji? The flag is in the public domain, so it certainly isn?t being held >>> up by copyright issues. And the current range of rainbow-related emoji show >>> that the technology to jam all those colors distinctly into a very tiny >>> space is available. Numerous national flags have been emojified. And given >>> that the flag has recently been added to the Museum of Modern Art?s design >>> collection, everyone is in agreement about its ongoing cultural >>> significance. So what gives? >>> >>> >>> http://www.bustle.com/articles/93227-wheres-the-rainbow-pride-flag-emoji-why-the-iconic-gay-rights-symbol-should-be-on-our >>> >>> This article also includes an example (via screenshot) of how many >>> people ?make do? without the rainbow flag. Typically, they use U+1F308 >>> RAINBOW. This can be seen by searching on Twitter (or any other social >>> media platform) for that character. >>> >>> Indeed, GitHub uses RAINBOW for this: >>> >>> http://i.imgur.com/KaKQzIC.png >>> >>> Facebook did the same sort of thing, as seen here: >>> >>> http://mashable.com/2013/06/27/facebook-rainbow-pride-emoji-doma/ >>> >>> They also did this: >>> >>> >>> http://www.newnownext.com/facebook-adds-lgbt-emojis-for-pride-month/06/2014/ >>> >>> These emojis are *derivative* of the rainbow flag, or include characters >>> displaying the rainbow flag. >>> >>> While it can be argued that the RAINBOW emoji itself is usable as a >>> stand-in (as above), it usually requires some sort of additional context to >>> work. There is a clear need for a rainbow flag that unambiguously >>> symbolises queer pride. >>> >>> This is already going on, with some platforms choosing to use a custom >>> emoji shim where no Unicode code-point exists. >>> >>> This is Twitter?s rainbow flag: >>> >>> https://twitter.com/ericajoy/status/614822988609794048 >>> >>> Screenshot: http://i.imgur.com/1kewdN1.png >>> >>> Slack has one too: >>> >>> https://twitter.com/SlackHQ/status/602779337784430592 >>> >>> Screenshot: http://i.imgur.com/8cOK8MH.png >>> >>> Reddit also offers one: >>> >>> http://www.reddit.com/r/bisexual/comments/2lc2rc/can_you_see_the_emoji/ >>> >>> Screenshot: http://i.imgur.com/p6YDRkF.png >>> >>> In all three examples, the symbol is being used in running text. >>> >>> I found this: >>> >>> > [...] the UTC does not wish to entertain further proposals for >>> encoding of symbol characters for flags, whether national, state, regional, >>> international, or otherwise. References to UTC Minutes: [134-C2], January >>> 28, 2013. >>> >>> http://www.unicode.org/alloc/nonapprovals.html >>> >>> I looked up the minutes, but could not find a more detailed explanation. >>> My guess is that these concerns related to geopolitical issues. Hopefully >>> the same rationale does not apply to the rainbow flag. >>> >>> Looking at: >>> >>> http://unicode.org/reports/tr51/#Selection_Factors >>> >>> Here's a quick list of summary answers: >>> >>> a. Compatibility: yes. There are existing platform-specific rainbow flag >>> emojis, as demonstrated above. To build a Twitter or Slack client that >>> replicated the native functionality, you would have to use an image instead >>> of a Unicode code point. >>> >>> b. Expected usage level: the rainbow emoji is listed at #168 on >>> emojitracker.com, and as demonstrated, the rainbow flag has been in >>> wide use since the 1970s. >>> >>> c. Image distinctiveness: the rainbow flag is visually distinct. >>> >>> d. Disparity: the rainbow flag is a missing flag. >>> >>> e. Frequently requested: unsure. I could organise a petition if this >>> would help to sway the decision. >>> >>> f. Generality: the rainbow flag is not overly specific. Indeed it is the >>> most general of all the pride flags. >>> >>> g. Open-ended: the rainbow flag is open ended, being the most general of >>> all the pride flags. (Wikipedia lists 18 pride flags on the LGBT symbols >>> page, but there are many more in the wild.) >>> >>> h. Representable already: a rainbow can be represented, but it is >>> ambiguous. The RAINBOW emoji cannot be combined with anything pictorial >>> that makes the meaning clear. Context is required, such as paring it with >>> the word "pride". >>> >>> i. Logos, Brands, UI icons, signage, specific people, deities: the image >>> is suitable for for encoding as a character. >>> >>> What is the best thing for me to do next? >>> >>> My proposal is that we add RAINBOW FLAG to Unicode, and that we use the >>> ?six-color version popular since 1979?. >>> >>> I only found one official proposal for a single emoji: >>> >>> http://www.unicode.org/L2/L2014/14298-whisky-emoji.pdf >>> >>> I couldn?t find any templates for proposals, though I did look through a >>> number of different examples. >>> >>> I noticed that a number of them include the ISO/IEC form at the end. Can >>> someone explain that to me? Does it make sense to submit a proposal to the >>> UTC without one of these? >>> >>> I also notice that it looks like I have to provide (or find a person to >>> provide) a font for the character. Is there any guidance on that? I am >>> happy to pay someone to prepare such a thing for me. >>> >>> Thank you in advance for your help. >>> >>> Noah Slater >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 2776 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From everson at evertype.com Sat Jun 27 16:56:53 2015 From: everson at evertype.com (Michael Everson) Date: Sat, 27 Jun 2015 22:56:53 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: On 27 Jun 2015, at 22:46, Konstantin Ritt wrote: > > U+1F3F3, U+200D, U+2620 > WAVING WHITE FLAG, ZERO WIDTH JOINER, SKULL AND CROSSBONES And thus the slippery slope is well and truly discovered. Gosh, I wish we could add capital equivalents to all (or most of) the un-cased lower-case letters we?ve got for Latin. That at least would be practical. Michael Everson * http://www.evertype.com/ From verdy_p at wanadoo.fr Sat Jun 27 17:15:13 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Jun 2015 00:15:13 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Me too. Not becuse the semantic of the flag is not lost but because here the relation with the rainbow is much less evident, as the flag does not mean the meteorological object or the internaction of solar light with the atomosphere, but only a few of its colors ordered not just like what happens in a rainbow but also in an optical prism, but here the rainbow is disposed in clearly contrasting bands (sothrign that never happens in true rainbows). We are too far from the ligature, as we don't see that as a flag *and* a rainbow; the subject is in fact unbreakable If it had to be broken we would also need to add the semantics for the horizontal contrasting stripes (completely missing in the rainbow symbol), and something to rmean that we don't wan't to include the arch form, or any sun ray, or cloud possibly raining, or the earth ground that the rainbow is cutting In fact the form of the rainbox is not the form of the Earth, but the intersection of a cone centered on the observer eye which is near the ground because the direct sunlight behind you is almost parallel and focused at infinite distance: the rainbow is in fact a circle at a well defined distance, but part of it is masked by the ground which is nearer from the observer (and at this shorter observable distance, the angle of light on the cone inserting there is not correct to see the rainbow light effect); however a small part of the arc falls in front of the ground on the horizon, if your horizon is far enough (only the bottom part of the cicle is masked). If you observe the rainbow directly from the ground level, you'll see only an half-circle, but if you climb a few meters up on a high scale, you can see the full circle with the correct opening angle in the air, provided that the sun is not too high in the sky. You cannot observe any rainbow when the sun is at the zenith because the circle of the rainbow is fully below the ground level, so the best and largest rainbows are observed in early morning or late evenings. 2015-06-27 23:28 GMT+02:00 Noah Slater : > I think it's a bit of a stretch to propose that a rainbow flag is a "white > flag" and "rainbow" ligature. That's certainly well beyond any > understanding I have of what a ligature is, from a typographical > perspective. > > On Sat, 27 Jun 2015 at 22:23 Pierpaolo Bernardi > wrote: > >> On Sat, Jun 27, 2015 at 10:14 PM, Philippe Verdy >> wrote: >> >>> >>> Why would associating a flag and a rainbow this way means the flag will >>> just be recolored (but the rainbox form itself is completely lost)? >>> Couldn't this be to display a flying flag over a sky with a rainbow? >>> Compare this to the association of the sun and the rainbow symbols, or the >>> cloud and a rainbow (and compare to the sun or moon and a cloud associated >>> the same way, or the association of two clouds: none of them will overlap >>> completely). >>> >>> Imagine the use in a weather application, I don't wee why the rainbox >>> would disappear when the flying flag is just there to mean the windy >>> condition, and the rainbox meant for variable weather mixing rainy and >>> sunny periods. >>> >>> Your proposed use of ZWJ to create a complete overlap of one symbol into >>> another is unexpected. >>> >> >> A ZWJ does not cause two random characters to overlap. It creates a >> ligature, and the ligature can be rendered in any way the font designers >> prefer. If there's a need for this character, font designers could agree >> to render this ligature in the desired way. >> >> In case there's the need, the Unicode Consortium could hint at the >> intended meaning of this ligature, I think? >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 27 17:40:32 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Jun 2015 00:40:32 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: No I had read it, the persons are still clearly separate. The "rainbow" on the flag is not in fact a rainbow, only its colours. The groups of persons are showing persons themselves, side by side, not one into the other one or one indirectly drawn on the face of another one. This new proposal of use of ZWJ is *definitely NOT in use*, it assumes a strong alteration of semantics. What is represented is NOT a flag and a rainbow side by side. The physical natural phenomenum (and its real 3D cone shape) is NOT represented at all on the flag, but the flag also adds parallel stripes (not encoded by the rainbow symbols itself and not by the white flag symbol alone: if you need to use a country flag to have these bands, you'll add a country-specific semantic that is not part of the international flag). The case is different from the black flag wih skulls and cross bones: what is represented is effectively a realistic skull and cross bones, not just the color or impression left by these bones. The nearest equivalent you can see is with the fitzpatrick modifers for skin colours (that do NOT use any ZWJ, because the fitzepatrick are intended to be modifiers and have no shape semantics by themselves). The only interpretation of FLAG + ZWJ + RAINBOW can only be two separate objects side by side (or one partly covering the other one) (like in the Family examples). (there's already a trong resistance for just embedding letters on a flag, country flags had then to be encoded differently, and letters enclosed in other shapes such as boxes (similar to the flag) are using combining boxes: we would need a combining flag character to do that (but before this happens we need a way to create "cartouches" for hieroglyphs or sinograms or even latin letters: this doid not occur, and instead emojis are encoding these enclosed letters distinctly without using any sequence (with combining characters or with joiners). For the same reason overstriking combining characters are best avoided for letters (this causes interpretation problems). You can expect interpretation problems if you intend to use ZWJ to create a ligature that completely drops the esssential shape of the rainbow to keep only its colors in a tiny part of it. by evidence this flag is NOT a ligature. Or otherwise, country flags are ALL ligatures (even if they don't represent the two letters with which they were internally encoded, they don't contain these letters and don't have the semantics of these letters, all that is meant is an assopciation with a country name, and then with its current colors). If we only wanted to include the semantics of the colour, then we would not even need fitzpatrick modifiers, we would have used ZWJ with white or black filled shapes (boxes, discs, independantly of their size and shapes...). ZWJ is NOT a semantics killer. 2015-06-27 23:48 GMT+02:00 Peter Edberg : > Philippe and others, > You are missing the relevant parts of UTR #51. See: > ? http://www.unicode.org/reports/tr51/#Multi_Person_Groupings > ? http://www.unicode.org/reports/tr51/#ZWJ_Sequences > > This type of behavior with ZWJ for emoji is *already in use.* > > - Peter E > > > > On Jun 27, 2015, at 1:14 PM, Philippe Verdy wrote: > > This UTR just addresses the case of a combining coloring symbol for faces > and those color symbols were designed since the begining to be combined as > much as possible (and not meant to be used in isolation), this is not the > case of the rainbow symbol which is much more figurative). > > Why would associating a flag and a rainbow this way means the flag will > just be recolored (but the rainbox form itself is completely lost)? > Couldn't this be to display a flying flag over a sky with a rainbow? > Compare this to the association of the sun and the rainbow symbols, or the > cloud and a rainbow (and compare to the sun or moon and a cloud associated > the same way, or the association of two clouds: none of them will overlap > completely). > > Imagine the use in a weather application, I don't wee why the rainbox > would disappear when the flying flag is just there to mean the windy > condition, and the rainbox meant for variable weather mixing rainy and > sunny periods. > > Your proposed use of ZWJ to create a complete overlap of one symbol into > another is unexpected. > > ZWJ+symbol does not transfor that symbol into a "emoi modifier" (this is > not anywhere in UTF51). It may just create a small partial overlap of one > symbol into the other, but each one is still clearly identifiable > separately. The examples shown are for grouping multiple persons in Annex E > but each person is still separately visible and recognizable as such even > if they are combined in the same final glyph. Annexe E even requires some > specific orders (e.g. for families: the man can only come before a woman, > and is then necessarily visible to the left side of the icon, i.e. to the > right of the woman; children are necessarily after and below adults...). > > > 2015-06-27 21:31 GMT+02:00 Mark Davis [image: ?]? : > >> Take a look at http://unicode.org/reports/tr51/ for details. >> >>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u2615.png Type: image/png Size: 2776 bytes Desc: not available URL: From nslater at tumbolia.org Sat Jun 27 17:51:27 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sat, 27 Jun 2015 22:51:27 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Thanks to Philippe for the addition of technical arguments in favour of a new code point. This is... a little beyond me. (Though fascinating reading!) I would particularly like to add me +1 to the knock-on effect this would have on downstream vendors. (I would like to see this become a standard emoji, and I'd like us to take whatever action increases the chances of that.) I did just want to respond to the "slippery slope" comment. Firstly to note that this is the name of a logical fallacy :) and that it is a fallacy because it presume that people are unable to make reasonable judgements calls. As it happens, the Consortium appears to have mechanisms in place precisely to handle this sort of thing. When I mentioned my email to a queer friend, they asked if I might propose other pride flags (as there are many). As I pointed out to them, I would be happy to do so, should I be able to justify their inclusion in accordance with Annex C. (As it stands, I am not sure any of them receive wide enough applicable use for that, though perhaps there is evidence to the contrary) -------------- next part -------------- An HTML attachment was scrubbed... URL: From petercon at microsoft.com Sat Jun 27 18:37:59 2015 From: petercon at microsoft.com (Peter Constable) Date: Sat, 27 Jun 2015 23:37:59 +0000 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <20150626110243.GB18139@ebed.etf.cuni.cz> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <20150626110243.GB18139@ebed.etf.cuni.cz> Message-ID: Marcel: Can you please clarify in what way Windows 7 is not supporting U+2060. Peter -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Petr Tomasek Sent: Friday, June 26, 2015 4:48 PM To: Marcel Schneider Cc: Unicode Mailing List Subject: Re: WORD JOINER vs ZWNBSP On Fri, Jun 26, 2015 at 12:48:39PM +0200, Marcel Schneider wrote: > > However, despite of the word joiner having been encoded and recommended since version?3.2 of the Standard, it is still not implemented on Windows?7. Therefore I must use the traditional zero width no-break space U+FEFF instead. Therefore you should complain by Microsoft, not here. > Supposing that Microsoft choose not to implement U+2060?WJ Then you should probably choose another operating system which does... Petr Tomasek From doug at ewellic.org Sat Jun 27 19:33:51 2015 From: doug at ewellic.org (Doug Ewell) Date: Sat, 27 Jun 2015 18:33:51 -0600 Subject: Adding RAINBOW FLAG to Unicode Message-ID: Noah Slater wrote: > I found this: > >> [...] the UTC does not wish to entertain further proposals for >> encoding of symbol characters for flags, whether national, state, >> regional, international, or otherwise. References to UTC Minutes: >> [134-C2], January 28, 2013. > > http://www.unicode.org/alloc/nonapprovals.html I think the phrase "or otherwise" above might have been intended to mean "or otherwise." > I looked up the minutes, but could not find a more detailed > explanation. My guess is that these concerns related to geopolitical > issues. Hopefully the same rationale does not apply to the rainbow > flag. My guess is that one reason certain rejected requests are added to the Archive of Notices of Non-Approval is so that the UTC doesn't have to haul out their original explanation or re-argue the same points when the same request, or a similar one, is made again. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From nslater at tumbolia.org Sat Jun 27 20:35:32 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sun, 28 Jun 2015 02:35:32 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: On 28 June 2015 at 01:33, Doug Ewell wrote: > > I think the phrase "or otherwise" above might have been intended to mean > "or otherwise." > Perhaps. I'm hoping not. I think there is a strong case for the inclusion of the symbol given that Twitter (one of the largest electronic communication platforms, and archived by the US Library of Congress) is using a non-Unicode rainbow flag in running text. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 27 22:46:22 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Jun 2015 05:46:22 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: 2015-06-28 2:33 GMT+02:00 Doug Ewell : > Noah Slater wrote: > > I found this: >> >> [...] the UTC does not wish to entertain further proposals for >>> encoding of symbol characters for flags, whether national, state, >>> regional, international, or otherwise. References to UTC Minutes: >>> [134-C2], January 28, 2013. >>> >> >> http://www.unicode.org/alloc/nonapprovals.html >> > > I think the phrase "or otherwise" above might have been intended to mean > "or otherwise." But this statement of early 2013 was contradicted by the addition of hundreds of new national flags (only because a few national flags were part of some Japanese emojis sets, and it was not admissible to have just a handlful of countries with flags but not all others). > I looked up the minutes, but could not find a more detailed >> explanation. My guess is that these concerns related to geopolitical >> issues. Hopefully the same rationale does not apply to the rainbow >> flag. >> > > My guess is that one reason certain rejected requests are added to the > Archive of Notices of Non-Approval is so that the UTC doesn't have to haul > out their original explanation or re-argue the same points when the same > request, or a similar one, is made again. > As soon as Unicode accepted the Japanese emojis sets promoted by its local telcos, including the few national flags the argument was dead. In fact there are also lot of redundant emojis from these sets that were accepted or were just minor variants of other existing Dings already encoded. Now we see an explosion of emojis, but less efforts for historic scripts found in our museums and libraries. The reason being that popular demand won (e.g. look at the Japanese-specific symbol for newbie: a yellow & blue open book: for most others looking at the symbol it will look just like a bicolor tick vertical arrow and will wonder why it is restricted to those colors which are not even part of the name; others will wonder why they can't just have a neutral symbol for an open book, when we have an open envelope, or why there's no incription on this book, i.e. just 2 blank pages or covers without any title). Many emojis are in fact either very centered to Japanese or US culture, including in their descriptions (this is notable on topics about cooking, beverages, animals, buildings, road signals, vehicles, equipements not much used in other places, imaginary characters/creatures...). The historic origin of cultures is almost ignored around the Mediterrean Sea between Europe, Western Asia and Africa, even if these topics are also existing everywhere else and probably more universal (but just less used). -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Sun Jun 28 02:43:14 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Sun, 28 Jun 2015 15:43:14 +0800 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: 2015?6?28? ??11:49? "Philippe Verdy" ??? > > 2015-06-28 2:33 GMT+02:00 Doug Ewell : >> >> Noah Slater wrote: >> >>> I found this: >>> >>>> [...] the UTC does not wish to entertain further proposals for >>>> encoding of symbol characters for flags, whether national, state, >>>> regional, international, or otherwise. References to UTC Minutes: >>>> [134-C2], January 28, 2013. >>> >>> >>> http://www.unicode.org/alloc/nonapprovals.html >> >> >> I think the phrase "or otherwise" above might have been intended to mean "or otherwise." > > > But this statement of early 2013 was contradicted by the addition of hundreds of new national flags (only because a few national flags were part of some Japanese emojis sets, and it was not admissible to have just a handlful of countries with flags but not all others). > Wouldn't the existence of Regional Indicator Symbols(=those flag symbols) themselves avoided the need of adding new regional/national/international flags already? and the 2013 addition do not add flag themselves to the unicode, just some special form of letters that can be used to form flags. >>> >>> I looked up the minutes, but could not find a more detailed >>> explanation. My guess is that these concerns related to geopolitical >>> issues. Hopefully the same rationale does not apply to the rainbow >>> flag. >> >> >> My guess is that one reason certain rejected requests are added to the Archive of Notices of Non-Approval is so that the UTC doesn't have to haul out their original explanation or re-argue the same points when the same request, or a similar one, is made again. > > > As soon as Unicode accepted the Japanese emojis sets promoted by its local telcos, including the few national flags the argument was dead. In fact there are also lot of redundant emojis from these sets that were accepted or were just minor variants of other existing Dings already encoded. Now we see an explosion of emojis, but less efforts for historic scripts found in our museums and libraries. > > The reason being that popular demand won (e.g. look at the Japanese-specific symbol for newbie: a yellow & blue open book: for most others looking at the symbol it will look just like a bicolor tick vertical arrow and will wonder why it is restricted to those colors which are not even part of the name; others will wonder why they can't just have a neutral symbol for an open book, when we have an open envelope, or why there's no incription on this book, i.e. just 2 blank pages or covers without any title). > > Many emojis are in fact either very centered to Japanese or US culture, including in their descriptions (this is notable on topics about cooking, beverages, animals, buildings, road signals, vehicles, equipements not much used in other places, imaginary characters/creatures...). The historic origin of cultures is almost ignored around the Mediterrean Sea between Europe, Western Asia and Africa, even if these topics are also existing everywhere else and probably more universal (but just less used). -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Sun Jun 28 07:31:51 2015 From: costello at mitre.org (Costello, Roger L.) Date: Sun, 28 Jun 2015 12:31:51 +0000 Subject: Applying Postel's Law to XML, from a Unicode perspective? Message-ID: Hi Folks, Postel's Law says: Be liberal in what you accept, and conservative in what you send. How might Postel's Law be applied to web services that receive XML and sends out XML? Here's one idea: a web service is willing to receive UTF-8 XML documents containing a pseudo-BOM; the web service sends out UTF-8 XML documents without the pseudo-BOM. Can you think of Unicode errors in inbound XML documents that a web service might be willing to accept? /Roger From daniel.buenzli at erratique.ch Sun Jun 28 08:25:24 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sun, 28 Jun 2015 14:25:24 +0100 Subject: Applying Postel's Law to XML, from a Unicode perspective? In-Reply-To: References: Message-ID: Le dimanche, 28 juin 2015 ? 13:31, Costello, Roger L. a ?crit : > Can you think of Unicode errors in inbound XML documents that a web service might be willing to accept? It depends a bit on your use case and setting (e.g. on the web, security may need to be taken into account), but one thing that could be done is to not have hard failures on character stream decoding errors but simply notify the user of the problem and continue by replacing the offending bytes by the Unicode replacement character U+FFFD until you manage to resynchronize the UTF-{8,16} byte stream and see if you manage to still get the parsing done. In practice such semi-broken XML documents can be produced by the export procedures of legacy software which fail to correctly encode some of the more special characters they have in another legacy encoding. It's better to eventually correct these documents and as such this should not be done *silently*, but it's nicer to the user if your import procedures are "best-effort" and can recover from these kinds of error conditions. Best, Daniel From verdy_p at wanadoo.fr Sun Jun 28 08:26:22 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Jun 2015 15:26:22 +0200 Subject: Applying Postel's Law to XML, from a Unicode perspective? In-Reply-To: References: Message-ID: For XML there's in fact no problem at all: XML (but also JSON) requires for its validity a single root element. If there's a BOM followed by another element, it is not a conforming XML document if that BOM is interpreted as part of a text element. If there's a BOM followed by an XML declaration, it cannot be a text element (the XML declaration must come before any other element). The only possiblity of ambiguity is an XML document that consists only in a single text element (possibly embedding comments) and no other element and no XML declaration. Such document is purely plain-text in fact (with the only exception of the predefined named or numeric character entities starting by "&" and terminated by ";". In summary, there's no problem at all for XML, (or JSON, or other text-encoded syntaxes including javascript, where a leading ZWNBSP cannot be valid in its syntax). The theoretical ambiguity only exists with (unstructured) plain text (that have no defined syntax to restrict their validity), and for that plain texts should include a MIME document type in its transport headers to define the behavior of the BOM. And if possible if there's a leading ZWNBSP starting this text, it should be doubled to make sure it will be interpreted correctly, as part of the transport layer. But in practice, unstructured plain text documents never need to start with ZWNBSP (the only exception being in short individual plain text database fields, which are still rarely needed without a container (this includes CSV files where texts fields should be surrounded by quotation marks, or start with a leading row defining names of columns that never need an y leading ZWNBSP). Being liberal does not really introduces a security issue, including for digitally signed texts (signed plain texts also have other requirements related to the interpretation of loine breaks and whitespaces: the simple fix is to start this text by an empty line., and linebreaks and whitespaces are collapsed to a single space prior to computing the diginatl signature (hash / digest). 2015-06-28 14:31 GMT+02:00 Costello, Roger L. : > Hi Folks, > > Postel's Law says: > > Be liberal in what you accept, and > conservative in what you send. > > How might Postel's Law be applied to web services that receive XML and > sends out XML? > > Here's one idea: a web service is willing to receive UTF-8 XML documents > containing a pseudo-BOM; the web service sends out UTF-8 XML documents > without the pseudo-BOM. > > Can you think of Unicode errors in inbound XML documents that a web > service might be willing to accept? > > /Roger > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Jun 28 08:49:45 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 28 Jun 2015 15:49:45 +0200 Subject: Applying Postel's Law to XML, from a Unicode perspective? In-Reply-To: References: Message-ID: On Sun, Jun 28, 2015 at 2:31 PM, Costello, Roger L. wrote: > How might Postel's Law be applied to web services that receive XML and > sends out XML? > > Here's one idea: a web service is willing to receive UTF-8 XML documents > containing a pseudo-BOM; the web service sends out UTF-8 XML documents > without the pseudo-BOM. > > Can you think of Unicode errors in inbound XML documents that a web > service might be willing to accept? > ?Your question is not at all in the scope of Unicode. It is an XML issue, so should be directed to the W3C, not here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Jun 28 12:43:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 28 Jun 2015 11:43:38 -0600 Subject: Adding RAINBOW FLAG to Unicode Message-ID: <1F5DD19ABA0F4CC6A1A43ADB184EA572@DougEwell> gfb hjjhjh wrote: > Wouldn't the existence of Regional Indicator Symbols(=those flag > symbols) themselves avoided the need of adding new regional/national/ > international flags already? and the 2013 addition do not add flag > themselves to the unicode, just some special form of letters that can > be used to form flags. And in fact, the Regional Indicator Symbols were added in Unicode 6.0 (October 2010), more than a year before the proposal to encode US FLAG as a unitary character was even written. And the non-approval text in 2013 specifically mentioned the RIS as one of the reasons for rejecting the unitary character. There's no contradiction. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From eric.muller at efele.net Sun Jun 28 13:28:12 2015 From: eric.muller at efele.net (Eric Muller) Date: Sun, 28 Jun 2015 11:28:12 -0700 Subject: UDHR in Unicode: 400 translations in text form! Message-ID: <55903CBC.9050900@efele.net> I am pleased to announce that the UDHR in Unicode project (http://unicode.org/udhr) has reached a notable milestone: we now have 400 translations of the Universal Declaration of Human Rights in text form. The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu de Silva and Sascha Brawer. Many thanks to them and to all the contributors. There is still plenty of work: most translations would benefit from a review, and there are 55 translations for which we have PDFs or images, but not yet the text form (look for stage 2 translations). The site has also been revamped a bit, with a more functional map, and a more functional table of the translations. The mapping to ISO 639-3 and BCP 47 have been updated to take into account the evolution of those standards. Again, thanks to all the contributors, past, present and future, Eric. PS: I believe I have taken care of all the backlog of contributions and comments. If I missed something, sorry, and please ping me again. From doug at ewellic.org Sun Jun 28 13:59:27 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 28 Jun 2015 12:59:27 -0600 Subject: Adding RAINBOW FLAG to Unicode Message-ID: More: >> [...] the UTC does not wish to entertain further proposals for >> encoding of symbol characters for flags, whether national, state, >> regional, international, or otherwise. References to UTC Minutes: >> [134-C2], January 28, 2013. This is also why U+1F3C1 CHEQUERED FLAG doesn't set a precedent for encoding additional flags as single characters: it was also introduced in Unicode 6.0, more than two years earlier. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sun Jun 28 14:20:32 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Jun 2015 21:20:32 +0200 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: <55903CBC.9050900@efele.net> References: <55903CBC.9050900@efele.net> Message-ID: Note: The marker icons showing languages in the Leaflet component (over the OSM map) are not working (broken links) GET http://www.unicode.org/udhr/cdn/cdnjs.cloudflare.com/ajax/libs/leaflet/0.7.3/images/marker-icon.png : HTTP error 404 (Not Found) Also the locations assigned of some international languages is strange: Esperanto is mapped in France at the location where we would expect Picard [pcd], Picard is located in a location just near the border of Belgium, where this is actually the local "ch'ti" variant, spoken in the French Flanders aound Lille. Standard French is located not in Paris but near Orleans where we would expect the Orleanais regional variant of French. In fact the nearer location for French gives us Interlingua instead, whose usage in France is much more rare than in other countries (may be it was created there and there's still some local associations promoting it from there). I was expecting to find Interlingua somewhere between South America and Asia. But in fact I would have placed those international languages somewhere in the middle of an ocean, just aligned vertically in a list along a meridian (across the Atlantic or Pacific for example) ---- Some languages do have an ISO 639-3 code. E.g. - Tetum, official in Timor-Leste, is currently "coded" as "010" (mapped to "und" in ISO 639-3), it should be "tet". - Forro (Saotomense) is a Portuguese-based creole in Sao Tome, currently "coded" as "007" (mapped to "und"), it should use "cri". - Kimbundu should also use "kmb" and not "009" - Umbundo (Umbundu) should also use "umb" and not "011" 2015-06-28 20:28 GMT+02:00 Eric Muller : > I am pleased to announce that the UDHR in Unicode project ( > http://unicode.org/udhr) has reached a notable milestone: we now have 400 > translations of the Universal Declaration of Human Rights in text form. > > The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu > de Silva and Sascha Brawer. Many thanks to them and to all the contributors. > > There is still plenty of work: most translations would benefit from a > review, and there are 55 translations for which we have PDFs or images, but > not yet the text form (look for stage 2 translations). > > The site has also been revamped a bit, with a more functional map, and a > more functional table of the translations. The mapping to ISO 639-3 and BCP > 47 have been updated to take into account the evolution of those standards. > > Again, thanks to all the contributors, past, present and future, > > Eric. > > PS: I believe I have taken care of all the backlog of contributions and > comments. If I missed something, sorry, and please ping me again. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ken.shirriff at gmail.com Sun Jun 28 14:30:07 2015 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Sun, 28 Jun 2015 12:30:07 -0700 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: <55903CBC.9050900@efele.net> References: <55903CBC.9050900@efele.net> Message-ID: I don't mean to be critical, but I find the UDHR page is really hard to use. Observed behavior: I click on the map. I get circles. I click on a circle and get more circles. I click again and get more circles. Keep clicking and I get weird image icons with letters. I click on one and I get a popup with mysterious sh X C T H OHCHR. I click on a language name and get a description of the language. I hit back and need to go through the entire circle thing again. I click on sh and get "Status: no known problems". I hit back and go through the circle thing again. I click on X and get an XML file. I do the back and circle thing again. I click C and get a list of Unicode characters. I click on the list of tables and get the same thing, except without the multiple layers of circles. After several minutes of clicking, I haven't seen any translations. Expected behavior: I click on the map and see a translation of the UDHR into an interesting language with a cool font. Ken On Sun, Jun 28, 2015 at 11:28 AM, Eric Muller wrote: > I am pleased to announce that the UDHR in Unicode project ( > http://unicode.org/udhr) has reached a notable milestone: we now have 400 > translations of the Universal Declaration of Human Rights in text form. > > The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu > de Silva and Sascha Brawer. Many thanks to them and to all the contributors. > > There is still plenty of work: most translations would benefit from a > review, and there are 55 translations for which we have PDFs or images, but > not yet the text form (look for stage 2 translations). > > The site has also been revamped a bit, with a more functional map, and a > more functional table of the translations. The mapping to ISO 639-3 and BCP > 47 have been updated to take into account the evolution of those standards. > > Again, thanks to all the contributors, past, present and future, > > Eric. > > PS: I believe I have taken care of all the backlog of contributions and > comments. If I missed something, sorry, and please ping me again. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Sun Jun 28 14:51:22 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sun, 28 Jun 2015 19:51:22 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: Sorry to be a pain. I mentioned I looked up the minutes and couldn't find anything apropos. Could someone explain the rational behind 134-C2 and how it might apply to the rainbow flag proposal ? On Sun, 28 Jun 2015 at 20:04 Doug Ewell wrote: > More: > > >> [...] the UTC does not wish to entertain further proposals for > >> encoding of symbol characters for flags, whether national, state, > >> regional, international, or otherwise. References to UTC Minutes: > >> [134-C2], January 28, 2013. > > This is also why U+1F3C1 CHEQUERED FLAG doesn't set a precedent for > encoding additional flags as single characters: it was also introduced > in Unicode 6.0, more than two years earlier. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Sun Jun 28 15:16:29 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Mon, 29 Jun 2015 04:16:29 +0800 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: <55691764.4030802@att.net> Message-ID: 2015?5?30? ??5:19? "Andrew West" wrote? > > On 30 May 2015 at 02:50, Ken Whistler wrote: > > > > 1. I have seen a chinese character ??? from a Vietnamese dictionary NHAT > > DUNG THUONG DAM DICTIONARY > > > > Extension F is harder to track down, because it has not yet been > > approved by the UTC, and comes in two pieces, with different > > progression so far in the ISO committee. Perhaps somebody on this list > > who has better access to the relevant documents can let you > > know whether ??? can be found in those sets. > > It's not in my lists of F1 and F2 characters. oh and by the way, could you (or someone else) please help look for the character ??? also? Just seen a Chinese Wikipedia article introducing an ethnic group with the character as partvof its name https://zh.m.wikipedia.org/wiki/(??)?? but without a proper character for so. The article sourced a CCTV program for ots origin. And there seem to be a dozen more wikipedia article that contain unencoded han characters, as listed in https://zh.wikipedia.org/wiki/Category:?????????? -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Jun 28 15:23:33 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 28 Jun 2015 14:23:33 -0600 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: Message-ID: <84968C090B5F47409EF2006CF5309985@DougEwell> Noah Slater wrote: > Sorry to be a pain. I mentioned I looked up the minutes and couldn't > find anything apropos. > > Could someone explain the rational behind 134-C2 and how it might > apply to the rainbow flag proposal ? The following is informal and dilettante, since only a UTC officer can give a formal rationale for what happened in this 2013 meeting. According to the minutes, consensus decision 134-C2, by itself, says only: "Consensus: The Unicode Technical Committee does not approve encoding a United States flag symbol." That refers only to the one symbol proposed in L2/12-094. But the same discussion also led to an action item, 134-A5: "Action Item for Ken Whistler: Add the United States Flag symbol to notices of non-approval." And that notice says, in full (not elided): "Disposition: The UTC rejected the proposal. The mapping to an existing emoji symbol for the US flag is already possible by using pairs of regional indicator symbols. Additionally, the domain of flags is generally not amenable to representation by encoded characters, and the UTC does not wish to entertain further proposals for encoding of symbol characters for flags, whether national, state, regional, international, or otherwise. References to UTC Minutes: [134-C2], January 28, 2013." The last clause is the relevant one here: "whether national, state, regional, international, or otherwise." The words "or otherwise" could be interpreted as saying that no *specific* flag of any kind will be encoded in the future as a single character, partly because the domain of flags is so open-ended. That would include flags associated with or representing specific groups of individuals or social causes. Now, we know that this is all flexible and subject to momentary change. Trying to predict what will and will not be considered "in scope" is more difficult today than ever. Perhaps your best bet is simply to write and submit a proposal, and see what happens. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From nslater at tumbolia.org Sun Jun 28 17:36:19 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sun, 28 Jun 2015 22:36:19 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <84968C090B5F47409EF2006CF5309985@DougEwell> References: <84968C090B5F47409EF2006CF5309985@DougEwell> Message-ID: Thanks for summarising that in an email, Doug. I really wish they'd provided a justification for this statement! :) I guess that this is the right list for a UTC officer to give some sort of feedback. On Sun, 28 Jun 2015 at 21:23 Doug Ewell wrote: > Noah Slater wrote: > > > Sorry to be a pain. I mentioned I looked up the minutes and couldn't > > find anything apropos. > > > > Could someone explain the rational behind 134-C2 and how it might > > apply to the rainbow flag proposal ? > > The following is informal and dilettante, since only a UTC officer can > give a formal rationale for what happened in this 2013 meeting. > > According to the minutes, consensus decision 134-C2, by itself, says > only: "Consensus: The Unicode Technical Committee does not approve > encoding a United States flag symbol." That refers only to the one > symbol proposed in L2/12-094. > > But the same discussion also led to an action item, 134-A5: "Action Item > for Ken Whistler: Add the United States Flag symbol to notices of > non-approval." > > And that notice says, in full (not elided): > > "Disposition: The UTC rejected the proposal. The mapping to an existing > emoji symbol for the US flag is already possible by using pairs of > regional indicator symbols. Additionally, the domain of flags is > generally not amenable to representation by encoded characters, and the > UTC does not wish to entertain further proposals for encoding of symbol > characters for flags, whether national, state, regional, international, > or otherwise. References to UTC Minutes: [134-C2], January 28, 2013." > > The last clause is the relevant one here: "whether national, state, > regional, international, or otherwise." The words "or otherwise" could > be interpreted as saying that no *specific* flag of any kind will be > encoded in the future as a single character, partly because the domain > of flags is so open-ended. That would include flags associated with or > representing specific groups of individuals or social causes. > > Now, we know that this is all flexible and subject to momentary change. > Trying to predict what will and will not be considered "in scope" is > more difficult today than ever. Perhaps your best bet is simply to write > and submit a proposal, and see what happens. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steve at swales.us Sun Jun 28 17:02:21 2015 From: steve at swales.us (Steve Swales) Date: Sun, 28 Jun 2015 15:02:21 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <84968C090B5F47409EF2006CF5309985@DougEwell> References: <84968C090B5F47409EF2006CF5309985@DougEwell> Message-ID: <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> Another way the Pride Flag might be mapped into Unicode without adding code points would be to use a REGIONAL INDICATOR SYMBOL pair corresponding to an unallocated ISO3166-1 alpha-2 sequence. U+1F1F6 + U+1F1F7, for example, might be an appropriate choice. Sent from my iPhone From nslater at tumbolia.org Sun Jun 28 18:14:13 2015 From: nslater at tumbolia.org (Noah Slater) Date: Sun, 28 Jun 2015 23:14:13 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> References: <84968C090B5F47409EF2006CF5309985@DougEwell> <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> Message-ID: QR for... ? Queer Rainbow? :) On Sun, 28 Jun 2015 at 23:52 Steve Swales wrote: > Another way the Pride Flag might be mapped into Unicode without adding > code points would be to use a REGIONAL INDICATOR SYMBOL pair corresponding > to an unallocated ISO3166-1 alpha-2 sequence. U+1F1F6 + U+1F1F7, for > example, might be an appropriate choice. > > > Sent from my iPhone > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Sun Jun 28 18:20:18 2015 From: everson at evertype.com (Michael Everson) Date: Mon, 29 Jun 2015 00:20:18 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> References: <84968C090B5F47409EF2006CF5309985@DougEwell> <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> Message-ID: On 28 Jun 2015, at 23:02, Steve Swales wrote: > > Another way the Pride Flag might be mapped into Unicode without adding code points would be to use a REGIONAL INDICATOR SYMBOL pair corresponding to an unallocated ISO3166-1 alpha-2 sequence. U+1F1F6 + U+1F1F7, for example, might be an appropriate choice. It would be poor standardization to do this. Nothing would prevent the 3166 MA from assigning any unassigned code. Michael Everson * http://www.evertype.com/ From doug at ewellic.org Sun Jun 28 18:53:47 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 28 Jun 2015 17:53:47 -0600 Subject: Adding RAINBOW FLAG to Unicode Message-ID: Michael Everson wrote: > On 28 Jun 2015, at 23:02, Steve Swales wrote: > >> Another way the Pride Flag might be mapped into Unicode without >> adding code points would be to use a REGIONAL INDICATOR SYMBOL pair >> corresponding to an unallocated ISO3166-1 alpha-2 sequence. U+1F1F6 >> + U+1F1F7, for example, might be an appropriate choice. > > It would be poor standardization to do this. Nothing would prevent the > 3166 MA from assigning any unassigned code. QM through QZ (among others) are user-assigned code elements. But I'm not sure whether the RIS are defined to use them that way. At the least, it would probably call for some sort of private agreement, similar to using the Unicode PUAs. In general, Michael's right; assuming that it's OK to use an "unallocated" 3166-1 sequence would be like assuming it's OK to use, say, U+0530 for a privately defined character, just because it's currently unassigned. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From steve at swales.us Sun Jun 28 18:56:42 2015 From: steve at swales.us (Steve Swales) Date: Sun, 28 Jun 2015 16:56:42 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <904C4C87-E2B4-4227-870F-04DD6935FC6B@swales.us> Message-ID: <925CE970-D1EF-42D4-8666-A4E5D3285196@swales.us> QR is actually in the so called "user-assigned" area, so unlikely it will be officially assigned, but also hard to standardize as anything in particular. -steve Sent from my iPhone > On Jun 28, 2015, at 4:20 PM, Michael Everson wrote: > >> On 28 Jun 2015, at 23:02, Steve Swales wrote: >> >> Another way the Pride Flag might be mapped into Unicode without adding code points would be to use a REGIONAL INDICATOR SYMBOL pair corresponding to an unallocated ISO3166-1 alpha-2 sequence. U+1F1F6 + U+1F1F7, for example, might be an appropriate choice. > > It would be poor standardization to do this. Nothing would prevent the 3166 MA from assigning any unassigned code. > > Michael Everson * http://www.evertype.com/ > > From leob at mailcom.com Mon Jun 29 00:24:33 2015 From: leob at mailcom.com (Leo Broukhis) Date: Sun, 28 Jun 2015 22:24:33 -0700 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: <55903CBC.9050900@efele.net> References: <55903CBC.9050900@efele.net> Message-ID: Ukrainian is in Estonia, Estonian is in the Baltic sea. On Sun, Jun 28, 2015 at 11:28 AM, Eric Muller wrote: > I am pleased to announce that the UDHR in Unicode project > (http://unicode.org/udhr) has reached a notable milestone: we now have 400 > translations of the Universal Declaration of Human Rights in text form. > > The latest translation is in Sinhala, thanks to Keshan Sodimana, Pasundu de > Silva and Sascha Brawer. Many thanks to them and to all the contributors. > > There is still plenty of work: most translations would benefit from a > review, and there are 55 translations for which we have PDFs or images, but > not yet the text form (look for stage 2 translations). > > The site has also been revamped a bit, with a more functional map, and a > more functional table of the translations. The mapping to ISO 639-3 and BCP > 47 have been updated to take into account the evolution of those standards. > > Again, thanks to all the contributors, past, present and future, > > Eric. > > PS: I believe I have taken care of all the backlog of contributions and > comments. If I missed something, sorry, and please ping me again. From andrewcwest at gmail.com Mon Jun 29 03:33:14 2015 From: andrewcwest at gmail.com (Andrew West) Date: Mon, 29 Jun 2015 09:33:14 +0100 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References: <55691764.4030802@att.net> Message-ID: On 28 June 2015 at 21:16, gfb hjjhjh wrote: > > oh and by the way, could you (or someone else) please help look for the > character ??? also? Not in the pipeline as far as I can see. > Just seen a Chinese Wikipedia article introducing an > ethnic group with the character as partvof its name > https://zh.m.wikipedia.org/wiki/(??)?? but without a proper character for > so. The article sourced a CCTV program for ots origin. ... which calls them "??", and so is not evidence for the existence of the character "??" (I don't doubt that the character exists, but neither the Wikipedia article nor the CCTV web page are sufficient evidence for it). > And there seem to be a dozen more wikipedia article that contain unencoded > han characters, as listed in > https://zh.wikipedia.org/wiki/Category:?????????? There are some 60 unencoded CJK characters in use on Wikimedia projects (see https://commons.wikimedia.org/wiki/Category:Chinese_characters_not_in_Unicode), which I include in my "BabelStone Han PUA" font (see U+F2D6..U+F2EF, U+F2FD..U+F2FF, U+F3E0, U+F4C0..U+F4E1 listed at http://www.babelstone.co.uk/Fonts/PUA.html). The problem with most of these characters is that Wikipedia is not a suitable source for encoding, and evidence for use of these characters in printed sources needs to be presented to the UTC and IRG for them to have any chance of being encoded. For an example of what you should do to get these characters encoded see the latest revision of Ming Fan's "Proposal to add 94 Chinese characters to UAX #45" (http://www.unicode.org/L2/L2015/15098r3-chinese.pdf). Andrew From eric.muller at efele.net Mon Jun 29 08:26:30 2015 From: eric.muller at efele.net (Eric Muller) Date: Mon, 29 Jun 2015 06:26:30 -0700 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: References: <55903CBC.9050900@efele.net> Message-ID: <55914786.5070805@efele.net> On 6/28/2015 10:24 PM, Leo Broukhis wrote: > Ukrainian is in Estonia, Estonian is in the Baltic sea. I took the locations from glottolog.org. The first error is mine, I mistyped a value. The second error comes from Glottolog, I corrected and reported to them. Will appear in the next update. Thanks, Eric. From eric.muller at efele.net Mon Jun 29 08:49:22 2015 From: eric.muller at efele.net (Eric Muller) Date: Mon, 29 Jun 2015 06:49:22 -0700 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: References: <55903CBC.9050900@efele.net> Message-ID: <55914CE2.8040700@efele.net> On 6/28/2015 12:20 PM, Philippe Verdy wrote: > Note: The marker icons showing languages in the Leaflet component > (over the OSM map) are not working (broken links) Fixed, I believe. > Also the locations assigned of some international languages is strange: > > Esperanto ... Picard ... Standard French These locations for those come from http://glottolog.org. Unless those locations are obviously wrong, I'd prefer to keep them aligned. > But in fact I would have placed those international languages > somewhere in the middle of an ocean, just aligned vertically in a list > along a meridian (across the Atlantic or Pacific for example) A few are already in Antarctica. I'll move Esperanto and Interlingua there. > > Some languages do have an ISO 639-3 code. E.g. > - Tetum, official in Timor-Leste, is currently "coded" as "010" > (mapped to "und" in ISO 639-3), it should be "tet". In general, identification of the language of the translations is not trivial. I have learned to not trust just the names provided with the translations. For this one, there is another translation, [tet], which most likely is tet/Tetun. [010] looks like a fairly different language and it is not clear to me that it is Tetun. I'd rather have some informed recommendation before assigning a language to [010]. It does not help that the source site does not seem accessible right now. > - Forro (Saotomense) is a Portuguese-based creole in Sao Tome, > currently "coded" as "007" (mapped to "und"), it should use "cri". The OHCHR site warns: "not to confuse Crioulo Santomense with Santomense (a variety and dialect of Portuguese in S?o Tom? and Pr?ncipe)" Again, I'd prefer some informed recommendation. > - Kimbundu should also use "kmb" and not "009" > - Umbundo (Umbundu) should also use "umb" and not "011" According to the Ethnologue, both Kimbundu and Umbundu are used both as language names and as family names. Given that I don't really trust the sources of those names, I'd prefer some informed recommendation. Thanks, Eric. From eric.muller at efele.net Mon Jun 29 08:58:10 2015 From: eric.muller at efele.net (Eric Muller) Date: Mon, 29 Jun 2015 06:58:10 -0700 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: References: <55903CBC.9050900@efele.net> Message-ID: <55914EF2.9090607@efele.net> On 6/28/2015 12:30 PM, Ken Shirriff wrote: > I don't mean to be critical, but I find the UDHR page is really hard > to use. > > Thanks for the observations. I'll try to find a better organization. Eric. From kenwhistler at att.net Mon Jun 29 09:50:20 2015 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 29 Jun 2015 07:50:20 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> Message-ID: <55915B2C.3060809@att.net> Noah, Additional information you should have is that the UTC is about to publish a new Public Review Issue on the topic of an extended mechanism for the representation of more flag emoji with sequences of tag characters. (Note: *not* representation as encoded single character symbols.) That PRI, when it is available (should be quite soon -- early this week), will be explicitly addressing concerns about state, regional, and international flags. I don't think it will explicitly address "or otherwise", but additional flag emoji that don't happen to be covered by the regional and sub-regional tag mechanisms in the PRI would certainly be in scope for discussion and feedback on the PRI. Other short notes on comments in this long thread: 1. The claim that Twitter is including a RAINBOW FLAG would be taken into consideration by the Emoji Subcommittee. Compatibility with existing systems in wide use is a strong factor in favor of additions: http://www.unicode.org/reports/tr51/#Selection_Factors_Compatibility 2. But on the other hand the offhand note: "When I mentioned my email to a queer friend, they asked if I might propose other pride flags (*as there are many*)." (emphasis added) illustrates the fundamental problem here. There is no effective end to the "or otherwise" case for flags as symbols, and that is why they are "generally not amenable to representation by encoded characters". Any simple image search for "pride flag" or "pride flag list" illustrates the problem amply: https://s-media-cache-ak0.pinimg.com/236x/69/83/f3/6983f3b9a4f68468bb101383006aa565.jpg https://s-media-cache-ak0.pinimg.com/236x/61/88/95/618895059533cb5b52c55cecd641881d.jpg That is not the realm of *characters* -- it is the realm of graphic design of flags, emblems, and frankly, at this point, heraldry. ;-) So, to sum up, I suggest that this thread about the RAINBOW FLAG be directed to the soon-to-be-posted Public Review Issue about extending the generative mechanisms for representing emoji symbols for flags, but that that feedback carefully consider how such an addition would coexist with other mechanisms for extensions of flag representation *and* how it could be reasonably limited to one instead of 28 (... or 500) more flags. --Ken P.S. While I do think there might be a strong case made for the RAINBOW FLAG to be added to the list of emoji flags representable by *some* kind of extension mechanism in Unicode, there really, really is no end to the "or otherwise" case. I happen to live in the city of Oakland, California. Try an image search on "Oakland flag". You start with a more-or-less official City flag, which kind of fits in the city as sub-region of region paradigm, and which can be spotted flying at the Oakland City Hall, but this quickly tails off into a gazillion variants, and various flags as sports memorabilia. I'm quite certain that an Oakland A's flag emoji would be locally quite popular if it were available on people's phones, for example. On 6/28/2015 3:36 PM, Noah Slater wrote: > > I really wish they'd provided a justification for this statement! :) I > guess that this is the right list for a UTC officer to give some sort > of feedback. > > On Sun, 28 Jun 2015 at 21:23 Doug Ewell > wrote: > > > Additionally, the domain of flags is > generally not amenable to representation by encoded characters, > and the > UTC does not wish to entertain further proposals for encoding of > symbol > characters for flags, whether national, state, regional, > international, > or otherwise. References to UTC Minutes: [134-C2], January 28, 2013." > > The last clause is the relevant one here: "whether national, state, > regional, international, or otherwise." The words "or otherwise" could > be interpreted as saying that no *specific* flag of any kind will be > encoded in the future as a single character, partly because the domain > of flags is so open-ended. That would include flags associated with or > representing specific groups of individuals or social causes. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 29 09:58:00 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 29 Jun 2015 07:58:00 -0700 Subject: UDHR in Unicode: 400 translations in text form! Message-ID: <20150629075800.665a7a7059d7ee80bb4d670165c8327d.69e168f721.wbe@email03.secureserver.net> Eric Muller wrote: > I am pleased to announce that the UDHR in Unicode project > (http://unicode.org/udhr) has reached a notable milestone: we now have > 400 translations of the Universal Declaration of Human Rights in text > form. I'd like to congratulate Eric and his contributors for this achievement. It's a large and complex project, at least 9 years in the making. I use this data (with attribution) in my BCP 47 language-tagging application, to display Article I of the UDHR as sample text in the language denoted by a user-created tag. The extensive language coverage of the UDHR data and its correlation to BCP 47 tags via the XML index are especially helpful. With the complexity of this project, including trying to associate constructed languages with geographical locations and relying on third-party data that may be conflicting or simply wrong, there's bound to be room for improvement. I'm sure the issues reported on this list will be ironed out over time, and in the meantime I hope Eric's announcement encourages even more contributors and translations as well as bug reports. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From rscook at wenlin.com Mon Jun 29 10:57:07 2015 From: rscook at wenlin.com (Richard Cook) Date: Mon, 29 Jun 2015 08:57:07 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <55915B2C.3060809@att.net> References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: Ken, I know that U+1F308 is RAINBOW ... because my nameslist lookup tool tells me so ... T C UTF-8 Codepoint : Name : Annotations 1 ?? C2_A0 1F308 RAINBOW ... but could ?? also be a 'rainbow (flag)'? -Richard [? iMM (iPhone Mangled Message)] -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Jun 29 11:38:00 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 29 Jun 2015 18:38:00 +0200 Subject: UDHR in Unicode: 400 translations in text form! In-Reply-To: <20150629075800.665a7a7059d7ee80bb4d670165c8327d.69e168f721.wbe@email03.secureserver.net> References: <20150629075800.665a7a7059d7ee80bb4d670165c8327d.69e168f721.wbe@email03.secureserver.net> Message-ID: Absolutely; this takes a lot of work, and Eric has done a stellar job of managing the details. (I'm sure he also appreciates any and all of the feedback on items to fix!) Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 29, 2015 at 4:58 PM, Doug Ewell wrote: > Eric Muller wrote: > > > I am pleased to announce that the UDHR in Unicode project > > (http://unicode.org/udhr) has reached a notable milestone: we now have > > 400 translations of the Universal Declaration of Human Rights in text > > form. > > I'd like to congratulate Eric and his contributors for this achievement. > It's a large and complex project, at least 9 years in the making. > > I use this data (with attribution) in my BCP 47 language-tagging > application, to display Article I of the UDHR as sample text in the > language denoted by a user-created tag. The extensive language coverage > of the UDHR data and its correlation to BCP 47 tags via the XML index > are especially helpful. > > With the complexity of this project, including trying to associate > constructed languages with geographical locations and relying on > third-party data that may be conflicting or simply wrong, there's bound > to be room for improvement. I'm sure the issues reported on this list > will be ironed out over time, and in the meantime I hope Eric's > announcement encourages even more contributors and translations as well > as bug reports. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Mon Jun 29 12:06:42 2015 From: nslater at tumbolia.org (Noah Slater) Date: Mon, 29 Jun 2015 17:06:42 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: <55915B2C.3060809@att.net> References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: Thanks for the reply, Ken! Comments inline. On Mon, 29 Jun 2015 at 15:50 Ken Whistler wrote: > There is no effective end to the "or otherwise" case for flags as symbols, > and that is why they are "generally not amenable to representation by > encoded characters". > Well. Arguably, Unicode represents food, and there is no effective end to the "or otherwise" case for food items either. (As I'm sure you're all aware of given the popularity of requests in this category.) As mentioned earlier in the thread, it seems to me that the Consortium has a rigorous (and notoriously hard to satisfy) process for guarding against such things. The rainbow flag is ubiquitous, so much so that it's even become a compat issue with existing communications platforms. The same is most likely not true for the less common flags. It seems to me that the correct thing to do here is to apply the existing process to this proposal (and any subsequent ones, should they occur). I similarly doubt that there is a particularly strong case for the Oakland flag, in accordance with Annex C. That is not the realm of *characters* -- it is the realm of graphic design > of > flags, emblems, and frankly, at this point, heraldry. ;-) > Well, you could say the same about all the emojis. Emojis blur the line between characters (in a typographical sense) and iconography. Again, I would simply point out that Annex C seems to be designed to handle exactly this domain of concerns. > So, to sum up, I suggest that this thread about the RAINBOW FLAG be > directed to the soon-to-be-posted Public Review Issue about extending > the generative mechanisms for representing emoji symbols for flags > How do we/I do that? I will restate that I think that if a RAINBOW FLAG emoji is added to Unicode, I expect wide use. And I am concerned that an alternate proposal would run the risk of not seeing wide use. (Though I have no actual experience here that informs that. I welcome feedback on the topic.) To reply to Richard: I mention this in my first email :) > While it can be argued that the RAINBOW emoji itself is usable as a stand-in (as above), it usually requires some sort of additional context to work. There is a clear need for a rainbow flag that unambiguously symbolises queer pride. -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Mon Jun 29 14:04:11 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Tue, 30 Jun 2015 03:04:11 +0800 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: 2015?6?30? ??1:13? "Noah Slater" wrote? > > Thanks for the reply, Ken! Comments inline. > > On Mon, 29 Jun 2015 at 15:50 Ken Whistler wrote: >> >> There is no effective end to the "or otherwise" case for flags as symbols, and that is why they are "generally not amenable to representation by encoded characters". > > > Well. Arguably, Unicode represents food, and there is no effective end to the "or otherwise" case for food items either. (As I'm sure you're all aware of given the popularity of requests in this category.) > > As mentioned earlier in the thread, it seems to me that the Consortium has a rigorous (and notoriously hard to satisfy) process for guarding against such things. The rainbow flag is ubiquitous, so much so that it's even become a compat issue with existing communications platforms. The same is most likely not true for the less common flags. > > It seems to me that the correct thing to do here is to apply the existing process to this proposal (and any subsequent ones, should they occur). I similarly doubt that there is a particularly strong case for the Oakland flag, in accordance with Annex C. > As an outsider, In my opinion, it is very common for people to write sentences like "?? Really sorry!" or "?? let's meet there tomorroe" or "The ?? is tasty" even before unicode's introduction of these characters, but I can't think of different usecases that the rainbow flag would be used in this way. >> That is not the realm of *characters* -- it is the realm of graphic design of >> flags, emblems, and frankly, at this point, heraldry. ;-) > > > Well, you could say the same about all the emojis. Emojis blur the line between characters (in a typographical sense) and iconography. Again, I would simply point out that Annex C seems to be designed to handle exactly this domain of concerns. > As i typed above. >> >> So, to sum up, I suggest that this thread about the RAINBOW FLAG be >> directed to the soon-to-be-posted Public Review Issue about extending >> the generative mechanisms for representing emoji symbols for flags > > > How do we/I do that? > > I will restate that I think that if a RAINBOW FLAG emoji is added to Unicode, I expect wide use. And I am concerned that an alternate proposal would run the risk of not seeing wide use. (Though I have no actual experience here that informs that. I welcome feedback on the topic.) > As long as an incorporated solution is made like how those US or UK flag currently is presented in unicode emoji, I don't think different mechanism would matter too much as you see people using them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 29 14:14:42 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 29 Jun 2015 21:14:42 +0200 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: The way I see U+1F308 drawn in my browser (using the image linked from Google in the HTML below), is that it represents a rainbow sat on two clouds (not evident at small sizes to se that these are clouds as they just look like blue open curves) This is also strange because rainbows are normally not *above* clouds, but below them (or partly within them near their surface, is they are not too dense) Anyway these clouds and the sky around it are certainly not wanted on the flag itself, but are appropriate for the meteoric object in the sky. 2015-06-29 17:57 GMT+02:00 Richard Cook : > Ken, > > I know that U+1F308 is RAINBOW ... because my nameslist lookup tool tells > me so ... > > TCUTF-8Codepoint : Name : Annotations1[image: ??]C2_A01F308 RAINBOW > > > > ... but could [image: ??] also be a 'rainbow (flag)'? > > -Richard > > > [? iMM (iPhone Mangled Message)] > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From nslater at tumbolia.org Mon Jun 29 19:53:30 2015 From: nslater at tumbolia.org (Noah Slater) Date: Tue, 30 Jun 2015 01:53:30 +0100 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: On 29 June 2015 at 20:04, gfb hjjhjh wrote: > > As an outsider, In my opinion, it is very common for people to write > sentences like "[image: ??] Really sorry!" or "[image: ??] let's meet > there tomorroe" or "The [image: ??] is tasty" even before unicode's > introduction of these characters, but I can't think of different usecases > that the rainbow flag would be used in this way. > Do you mean, people were using these emojis on a platform that supported them before Unicode standardised them? To that I would respond that you only need to search Twitter for people using the #pride hashtag to see how it's used there. Unfortunately, as Slack is a private communication platform, it is hard to get usage examples. All we can state for sure is that people are using the rainbow flag in running text like any other emoji [image: ??] > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f354.png Type: image/png Size: 3280 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f44d.png Type: image/png Size: 2557 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f44c.png Type: image/png Size: 2525 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f647.png Type: image/png Size: 1684 bytes Desc: not available URL: From richard.wordingham at ntlworld.com Tue Jun 30 01:47:46 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 30 Jun 2015 07:47:46 +0100 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> References: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> Message-ID: <20150630074746.79ff7cf7@JRWUBU2> On Sat, 27 Jun 2015 17:48:41 +0200 (CEST) Marcel Schneider wrote: > On Fri, Jun 26, Richard Wordingham wrote: > > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote: >>> Still in French, the letter apostrophe, when used as current >>> apostrophe, prevents the following word from being identified as a >>> word because of the missing word boundary and, subsequently, >>> prevents the autoexpand from working. This can be fixed by adding >>> a word joiner after the apostrophe, thanks to an autocorrect entry >>> that replaces U+02BC inserted by default in typographic mode, with >>> U+02BC U+2060. >> No, this doesn't work. While the primary purpose of U+2060 is to >> prevent line breaks, it is also used to overrule word boundary >> detectors in scriptio continua. (It works quite well for >> spell-checking Thai in LibreOffice). It's name implies to me that it >> is intended to prevent a word boundary being deduced, through the >> strong correlation between word boundaries and line break opportunities. >> There doesn't seem to be a code for 'zero-width word boundary at >> which lines should not normally be broken'. > Well, I extrapolated from U+FEFF, which works fine for me, even in > this particular context. Does the tool misinterpret U+FEFF between Thai characters as a word boundary? Incidentally, which tool are you talking of? Richard. From charupdate at orange.fr Tue Jun 30 04:02:18 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 30 Jun 2015 11:02:18 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP In-Reply-To: References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <20150626110243.GB18139@ebed.etf.cuni.cz> Message-ID: <2104451852.9023.1435654939028.JavaMail.www@wwinf1m18> On Sun, Jun 28, 2015, Peter Constable wrote: > Marcel: Can you please clarify in what way Windows 7 is not supporting U+2060. On my netbook, which is running Windows 7 Starter, U+2060 is not a part of any of the shipped fonts. Arial Unicode MS?does not contain U+2060 because this is posterior to Unicode 2.0 (WJ has been encoded in 3.2), and unfortunately Arial Unicode MS despite of being one of the completest fonts worlwide, seems not to have been updated since its release based on Unicode 2.0. Consolas which is so complete it contains even U+202F while lt's a monospaced font, and which does contain also U+02BC MODIFIER LETTER APOSTROHPE, has no U+2060, but well U+FEFF. Knowing that whenever an unavailable character occurs, Windows searches for it in all fonts that are installed on the machine, I deduced that U+2060 is not a part of Windows 7 Starter and, by legitimate extrapolation, of Windows 7 on the whole. In any case, when your computer is a netbook, you couldn't choose to get another Windows version since Windows Starter is designed for netbooks. I know that other operating systems, say other Windows versions, are shipped with a significantly bigger number of fonts, but I won't program a keyboard layout which cannot work on every machine running any Windows version from 7 upwards. I guess I've been suspected still to blame Microsoft even when there's no reason, so I underscore that I do not have the least need of U+2060, because for word processing purposes, U+FEFF works very well for me and surely for everybody who is using Windows. I?add this precision because outside I use another OS which seemingly does not support U+02BC, by not having this character in any current font. Additionally I?mention that I've read that my netbook does not run well under the other OS, so I've little temptation of using other than Windows. There might be however a reason to prefer U+2060 over U+FEFF, which I?cannot test. The issue is the following: Further tests showed that U+FEFF is an unstable character, even more unstable than U+00A0 which at least is replaced with something (U+0020) when formatted text is converted to plain text, while U+FEFF simply disappears. This phenomenon is observed as well inside a word processor as between this and a text editor (whether the file format be UTF-8 or Unicode). If you wish to reproduce the tests, you may need the information that I used Microsoft Word Starter 2010 and Windows NotePad. Indeed I believe that we are in front of a widespread general misfunctioning. U+00A0 is currently used in French as a punctuation space (by that I mean, current word processors add U+00A0 before?????!?;?:?and?after??. [I know that the Unicode Punctuation Space is U+2008, that this is not designed for use with French punctuations, that U+202F is preferred with punctuations, that U+202F is not present in all fonts, therefore word processors cannot insert it by default, therefore U+00A0 stays in use and readers are accustomed to it.] When such text files with plenty of U+00A0, turning around between processes, end up to be converted to plain text, they become unusable. I mean that before using them, all instances where U+00A0 had be replaced with U+0020, must be corrected, whether by replacing U+0020 with the preferred U+202F, or with U+00A0 again (e.g. inside of names). Well, U+FEFF is roughly the same thing, it must be readded, which may prove much harder to achieve. In my tests, even if not recognized, U+2060 proved to be stable, but I?wonder what would be its fate if the system knew i'is "just" a word joiner. Regards, Marcel Schneider ? > Message du 28/06/15 01:38 > De : "Peter Constable" > A : "Petr Tomasek" , "Marcel Schneider" > Copie ? : "Unicode Mailing List" > Objet : RE: WORD JOINER vs ZWNBSP > > Marcel: Can you please clarify in what way Windows 7 is not supporting U+2060. > > > Peter > > -----Original Message----- > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Petr Tomasek > Sent: Friday, June 26, 2015 4:48 PM > To: Marcel Schneider > Cc: Unicode Mailing List > Subject: Re: WORD JOINER vs ZWNBSP > > On Fri, Jun 26, 2015 at 12:48:39PM +0200, Marcel Schneider wrote: > > > > However, despite of the word joiner having been encoded and recommended since version?3.2 of the Standard, it is still not implemented on Windows?7. Therefore I must use the traditional zero width no-break space U+FEFF instead. > > Therefore you should complain by Microsoft, not here. > > > Supposing that Microsoft choose not to implement U+2060?WJ > > Then you should probably choose another operating system which does... > > Petr Tomasek [If you read this, please refer to my reply at: http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0216.html ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 30 04:25:43 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 30 Jun 2015 11:25:43 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP In-Reply-To: <20150630074746.79ff7cf7@JRWUBU2> References: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> <20150630074746.79ff7cf7@JRWUBU2> Message-ID: <1430770470.10024.1435656344025.JavaMail.www@wwinf1m18> On Mon, Jun 30, 2015, Richard Wordingham wrote: > On Sat, 27 Jun 2015 17:48:41 +0200 (CEST) > Marcel Schneider wrote: > > > On Fri, Jun 26, Richard Wordingham wrote: > > > On Fri, 26 Jun 2015 12:48:39 +0200 (CEST) Marcel Schneider wrote: > > >>> Still in French, the letter apostrophe, when used as current > >>> apostrophe, prevents the following word from being identified as a > >>> word because of the missing word boundary and, subsequently, > >>> prevents the autoexpand from working. This can be fixed by adding > >>> a word joiner after the apostrophe, thanks to an autocorrect entry > >>> that replaces U+02BC inserted by default in typographic mode, with > >>> U+02BC U+2060. > > >> No, this doesn't work. While the primary purpose of U+2060 is to > >> prevent line breaks, it is also used to overrule word boundary > >> detectors in scriptio continua. (It works quite well for > >> spell-checking Thai in LibreOffice). It's name implies to me that it > >> is intended to prevent a word boundary being deduced, through the > >> strong correlation between word boundaries and line break opportunities. > >> There doesn't seem to be a code for 'zero-width word boundary at > >> which lines should not normally be broken'. > > > Well, I extrapolated from U+FEFF, which works fine for me, even in > > this particular context. > > Does the tool misinterpret U+FEFF between Thai characters as a word > boundary? Incidentally, which tool are you talking of? I tested on Microsoft Word 2010 Starter running on Windows 7 Starter, on a netbook. This software being based on the full versions, the interpretation of U+FEFF must be the standard behavior. I?tested in Latin script. You may wish to redo the tests, so please open a new document, input two words, replace the blank with whatever character the word boundaries behavior is to be checked of, and search for one of the two words with the 'whole word' option enabled. If the result is none, the test character indicates the absence of word boundaries; if there is a result, the test character indicates the presence of word boundaries. > >> No, this doesn't work. Right. The letter apostrophe cannot trigger the autocorrect for itself. I must keep U+0027 in the forefront, and get it replaced with U+02BC U+FEFF to keep the autocorrect/autoexpand working for what follows. Or even better, with U+FEFF U+02BC U+FEFF to clarify word boundaries. When there is no autoexpand, we?ll input the apostrophe as U+0027 and the single quotes as U+2018, U+2019, then replace all U+0027 with U+02BC. In the Windows Notepad that works, because the close-quote is presumably not in the equivalence class for the straight apostrophe, so it replaces the U+0027s with U+02BC and lets the U+2019s alone. Given the instability of U+FEFF but also of U+00A0, as I wrote to Peter Constable a few moments ago, it seems as if we were unfortunately reaching the limits of text encoding. The purpose of the encoding design was, if I?m well informed, to get readible text files, and to allow users to mark them up for local printing or PDF conversion. Other usages must have been let out of scope, because today, you cannot exchange and process plain text files as one may wish. As soon as you must use plain text as a raw material for publishing, as you must convert British English quotation marks to US?English quotation marks, as you must do searches including single quotes, as you must input text (especially with leading apostrophes) on keyboards with legacy drivers, and perhaps a few things more, there seems to be no other solution than to use workarounds, hand-process, look up and correct or convert the instances one by one. The nice thing about this is that you become a craftsman again, that you get in touch with text, and you may feel like a linotypist or a lead typesetter who takes care of every detail. As a result, the professions of corrector, typesetter, typographer shall not disappear (as it was feared), and good craftmanship will stay thriving. Another side effect is that the need of hand-processing text files lowers the appeal of copying other peoples? work. It?s even harder when copying text from a PDF file. Sometimes you get whole paragraphs in ready-to-use plain text (let aside the NBSPs), and sometimes (e.g. from TUS) it?s all in small pieces and you need to delete a lot of undue line breaks, as well as to text-transform the character identifiers because their uppercasing was just small caps formatting. Finally you may prefer to provide links to the content, but unfortunately there seems to be no way to copy bookmarks?so that you need to browse the contents and be likely to learn much more by the way. If all this was the goal, let?s say it loud. Then this was a good idea. Very good. Regards, Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Tue Jun 30 04:39:26 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Tue, 30 Jun 2015 11:39:26 +0200 (CEST) Subject: WORD JOINER vs ZWNBSP In-Reply-To: <2104451852.9023.1435654939028.JavaMail.www@wwinf1m18> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <20150626110243.GB18139@ebed.etf.cuni.cz> <2104451852.9023.1435654939028.JavaMail.www@wwinf1m18> Message-ID: <1517349131.10525.1435657166603.JavaMail.www@wwinf1m18> A quarter of an hour ago I wrote: > I?add this precision because outside I use another OS which seemingly does not support U+02BC, by not having this character in any current font. Sorry, U+02BC is in the fonts but not in the Special Characters dialog I opened to look up. Marcel Schneider -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Tue Jun 30 11:11:06 2015 From: gwalla at gmail.com (Garth Wallace) Date: Tue, 30 Jun 2015 09:11:06 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: On Mon, Jun 29, 2015 at 8:57 AM, Richard Cook wrote: > Ken, > > I know that U+1F308 is RAINBOW ... because my nameslist lookup tool tells > me so ... > > TCUTF-8Codepoint : Name : Annotations1[image: ??]C2_A01F308 RAINBOW > > > > ... but could [image: ??] also be a 'rainbow (flag)'? > > -Richard > > > [? iMM (iPhone Mangled Message)] > > I don't think display of U+1F308 as a rainbow flag would be expected behavior. It risks turning a text like "It's a beautiful day! [image: ??]" into a political statement. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From rscook at wenlin.com Tue Jun 30 11:42:32 2015 From: rscook at wenlin.com (Richard Cook) Date: Tue, 30 Jun 2015 09:42:32 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: > On Jun 30, 2015, at 9:11 AM, Garth Wallace wrote: > > I don't think display of U+1F308 as a rainbow flag would be expected behavior. It risks turning a text like "It's a beautiful day! " into a political statement. Garth, Any statement can be a political statement, in the right context. But I think the main point of my earlier comment was that the specific glyph for U+1F308 might be indistinguishable from a flag. For example, this is the glyph in iOS 8: Not a cloud in the sky. ? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.PNG Type: image/png Size: 29744 bytes Desc: not available URL: From nslater at tumbolia.org Tue Jun 30 12:35:48 2015 From: nslater at tumbolia.org (Noah Slater) Date: Tue, 30 Jun 2015 17:35:48 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: That same glyph turns up as this, for me: http://i.imgur.com/3XQ96SA.png Which is part of the problem. "Rainbow" could mean anything. That the Apple version happens to look a bit like a flag (a weird square flag that looks almost nothing like the queer pride flag) is largely immaterial. On Tue, 30 Jun 2015 at 17:47 Richard Cook wrote: > > > > > On Jun 30, 2015, at 9:11 AM, Garth Wallace wrote: > > I don't think display of U+1F308 as a rainbow flag would be expected > behavior. It risks turning a text like "It's a beautiful day! [image: ??]" > into a political statement. > > > Garth, > > Any statement can be a political statement, in the right context. But I > think the main point of my earlier comment was that the specific glyph for U+1F308 > might be indistinguishable from a flag. For example, this is the glyph in > iOS 8: > > [image: image1.PNG] > > Not a cloud in the sky. > > ? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.PNG Type: image/png Size: 29744 bytes Desc: not available URL: From gwalla at gmail.com Tue Jun 30 13:38:10 2015 From: gwalla at gmail.com (Garth Wallace) Date: Tue, 30 Jun 2015 11:38:10 -0700 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: On Tue, Jun 30, 2015 at 9:42 AM, Richard Cook wrote: > > > > > On Jun 30, 2015, at 9:11 AM, Garth Wallace wrote: > > I don't think display of U+1F308 as a rainbow flag would be expected > behavior. It risks turning a text like "It's a beautiful day! [image: ??]" > into a political statement. > > > Garth, > > Any statement can be a political statement, in the right context. But I > think the main point of my earlier comment was that the specific glyph for U+1F308 > might be indistinguishable from a flag. For example, this is the glyph in > iOS 8: > Any statement can be political in the right context, sure, but having a political message added to your own statements without your knowledge is usually not appreciated. > [image: image1.PNG] > > Not a cloud in the sky. > It also doesn't look like any version of the gay pride flag that I've seen. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image1.PNG Type: image/png Size: 29744 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: emoji_u1f308.png Type: image/png Size: 3284 bytes Desc: not available URL: From khaledhosny at eglug.org Tue Jun 30 14:41:31 2015 From: khaledhosny at eglug.org (Khaled Hosny) Date: Tue, 30 Jun 2015 21:41:31 +0200 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <2104451852.9023.1435654939028.JavaMail.www@wwinf1m18> References: <552516479.6107.1435315719474.JavaMail.www@wwinf2229> <20150626110243.GB18139@ebed.etf.cuni.cz> <2104451852.9023.1435654939028.JavaMail.www@wwinf1m18> Message-ID: <20150630194129.GA16879@khaled-laptop> On Tue, Jun 30, 2015 at 11:02:18AM +0200, Marcel Schneider wrote: > On Sun, Jun 28, 2015, Peter Constable > wrote: > > > Marcel: Can you please clarify in what way Windows 7 is not supporting U+2060. > > On my netbook, which is running Windows 7 Starter, U+2060 is not a > part of any of the shipped fonts. It is a control character, it does not need to have a glyph in the font to be properly supported. From c933103 at gmail.com Tue Jun 30 15:04:48 2015 From: c933103 at gmail.com (gfb hjjhjh) Date: Wed, 1 Jul 2015 04:04:48 +0800 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: 2015?6?30? ??8:53? "Noah Slater" ??? > > > > On 29 June 2015 at 20:04, gfb hjjhjh wrote: >> >> As an outsider, In my opinion, it is very common for people to write sentences like " Really sorry!" or " let's meet there tomorroe" or "The is tasty" even before unicode's introduction of these characters, but I can't think of different usecases that the rainbow flag would be used in this way. > > Do you mean, people were using these emojis on a platform that supported them before Unicode standardised them? > > To that I would respond that you only need to search Twitter for people using the #pride hashtag to see how it's used there. Unfortunately, as Slack is a private communication platform, it is hard to get usage examples. All we can state for sure is that people are using the rainbow flag in running text like any other emoji > > Can you attach some screenshots (probably with name removed and no sensitive/private info) as example? -------------- next part -------------- An HTML attachment was scrubbed... URL: From nslater at tumbolia.org Tue Jun 30 16:18:54 2015 From: nslater at tumbolia.org (Noah Slater) Date: Tue, 30 Jun 2015 21:18:54 +0000 Subject: Adding RAINBOW FLAG to Unicode In-Reply-To: References: <84968C090B5F47409EF2006CF5309985@DougEwell> <55915B2C.3060809@att.net> Message-ID: I already did, in my original mail! On Tue, 30 Jun 2015 at 21:12 gfb hjjhjh wrote: > > 2015?6?30? ??8:53? "Noah Slater" ??? > > > > > > > > On 29 June 2015 at 20:04, gfb hjjhjh wrote: > >> > >> As an outsider, In my opinion, it is very common for people to write > sentences like " Really sorry!" or " let's meet there tomorroe" or "The is > tasty" even before unicode's introduction of these characters, but I can't > think of different usecases that the rainbow flag would be used in this way. > > > > Do you mean, people were using these emojis on a platform that supported > them before Unicode standardised them? > > > > To that I would respond that you only need to search Twitter for people > using the #pride hashtag to see how it's used there. Unfortunately, as > Slack is a private communication platform, it is hard to get usage > examples. All we can state for sure is that people are using the rainbow > flag in running text like any other emoji > > > > > > Can you attach some screenshots (probably with name removed and no > sensitive/private info) as example? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Tue Jun 30 16:28:26 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 30 Jun 2015 14:28:26 -0700 Subject: WORD JOINER vs ZWNBSP Message-ID: <20150630142826.665a7a7059d7ee80bb4d670165c8327d.c8a619afc7.wbe@email03.secureserver.net> Khaled Hosny wrote: >> On my netbook, which is running Windows 7 Starter, U+2060 is not a >> part of any of the shipped fonts. > > It is a control character, it does not need to have a glyph in the > font to be properly supported. The problem is the word "supported." Marcel is seeing a visible glyph (a .notdef box) for what is supposed to be an invisible, zero-width character, and that is leading him to conclude that Windows doesn't "support" this character. On my Win 7 machine at work, when I enter the string "one?two" ("one\u2060two") and click on either word, both words are selected. That is exactly what I would expect WJ to do. This works on the built-in Notepad as well as Notepad++ and BabelPad (but not on GoDaddy's Web-based email client). But out of more than 500 fonts on that machine, the only stock Microsoft fonts that show WJ with zero-width, instead of a .notdef glyph, are Javanese Text, Myanmar Text, and Segoe UI Symbol. So while it's inaccurate to extrapolate this to "Microsoft doesn't support WJ," the font support is definitely lacking. The bit about characters being converted to other characters, of course, has nothing to do with Windows and everything to do with particular applications. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From richard.wordingham at ntlworld.com Tue Jun 30 16:33:05 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Tue, 30 Jun 2015 22:33:05 +0100 Subject: WORD JOINER vs ZWNBSP In-Reply-To: <1430770470.10024.1435656344025.JavaMail.www@wwinf1m18> References: <1851400009.9981.1435420121813.JavaMail.www@wwinf1d10> <20150630074746.79ff7cf7@JRWUBU2> <1430770470.10024.1435656344025.JavaMail.www@wwinf1m18> Message-ID: <20150630223305.67b8da0f@JRWUBU2> On Tue, 30 Jun 2015 11:25:43 +0200 (CEST) Marcel Schneider wrote: > At some time in June 2015, Richard Wordingham wrote: > I tested on Microsoft Word 2010 Starter running on Windows 7 Starter, > on a netbook. This software being based on the full versions, the > interpretation of U+FEFF must be the standard behavior. I?tested in > Latin script. You may wish to redo the tests, so please open a new > document, input two words, replace the blank with whatever character > the word boundaries behavior is to be checked of, and search for one > of the two words with the 'whole word' option enabled. If the result > is none, the test character indicates the absence of word boundaries; > if there is a result, the test character indicates the presence of > word boundaries. I did my own tests in word 2010 with Windows 7. Although U+FEFF and U+2060 displayed differently when I enabled the display of 'non-printing' characters (spaces, inactive soft hyphens, non-breaking hyphens, paragraph ends etc.), the behaved the same when embedded in French l'eau and Thai ?? - they changed each word to two words, as detected by ctrl/rt-arrow. However, this is wrong. >> No, this doesn't work. Clarification: It doesn't work in correct software. Correct software would have treated the modified words as single words. Richard. From doug at ewellic.org Tue Jun 30 16:57:19 2015 From: doug at ewellic.org (Doug Ewell) Date: Tue, 30 Jun 2015 14:57:19 -0700 Subject: Representing Additional Types of Flags Message-ID: <20150630145719.665a7a7059d7ee80bb4d670165c8327d.06f042790e.wbe@email03.secureserver.net> Re-posting my comments and questions on this PRI to the list. I've already submitted them as formal feedback. . I support this proposal. I have the following questions: 1. The existing RIS-based flag mechanism is based on ISO 3166-1 (TUS 7.0 ?22.10). In this proposal, "valid" tag sequences would instead be determined by CLDR data and LDML specification. Is there any precedent for CLDR to define the validity of Unicode character sequences? 2. What is the policy on generating flag tags with deprecated unicode_region_subtag or unicode_subdivision_subtag values, such as "[flag]UK"? How "discouraged" would such a tag be? Should tools allow users to create such a tag? 3. The subdivisions.xml file contains a "subtype" hierarchy, reflecting the "parent subdivision" relationship in ISO 3166-2. So region 'FR' contains subdivision 'J' (?le-de-France), which itself contains subdivision '75' (Paris). Is there any significance to the "subtype" hierarchy as far as flag tags are concerned, or are "[flag]FRJ" and "[flag]FR75" equally valid? 4. The entry for "001" in subdivisions.xml contains each of the two-letter codes for regions (countries) that have their own subdivisions. This is less than the set of all regions; for example, Anguilla (AI) does not have ISO 3166-2 subdivisions and so is not listed. This implies that a tag like "[flag]001US" is valid (and equivalent to "US" spelled with RIS, which is preferred) but "[flag]001AI" is not valid. Is this intended? If not, can it be clarified? 5. Will any preliminary examples of CLDR 4-character subdivision codes be made available before any such codes are actually assigned? . The PRI #299 mechanism is clearly and intentionally oriented toward representing flags of well-defined geopolitical entities. Any proposal to extend the mechanism to cover the many other types of flags -- for historical regions, NGOs, maritime, sports, or social or political causes -- must be systematic and well-planned, not ad-hoc or haphazard, to assure interoperability and extensibility. The documentation for the PRI #299 mechanism should state clearly that (e.g.) the Confederate battle flag, the Olympic flag, the Esperanto flag, the LGBT rainbow flag, and the naval flags used to spell out "ENGLAND EXPECTS" can be represented only via a proper extension to the mechanism, not by ad-hoc means such as the use of unassigned or private-use combinations. This is at least as important as ensuring the stable coding of geopolitical flags. -- Doug Ewell | http://ewellic.org | Thornton, CO ????