From A.Schappo at lboro.ac.uk Mon Jun 1 06:29:46 2015 From: A.Schappo at lboro.ac.uk (Andre Schappo) Date: Mon, 1 Jun 2015 11:29:46 +0000 Subject: Some questions about Unicode's CJK Unified Ideograph In-Reply-To: References:

Message-ID: <5685F2CF-041E-4B67-ACF8-CD8CDEE79F21@lboro.ac.uk> On 30 May 2015, at 01:20, gfb hjjhjh wrote: 2. Is combined characters like U+20DD intended to work with all different type of characters, or is it some problem related to implementation ? as I when i write ?? (Japanese Hiragana Letter Yu + Combining Enclosing Circle) appear to be separate on most font I use, but if I change the Hiragana Yu into a conventional = sign or some latin character, most fonts are at least somehow able to put them together. Or, is there any better/alternative representation in unicode that can show japanese hiragana yu in a circle? Japanese Hiragana Letter Yu + Combining Enclosing Circle works fine for me using TextEdit on OSX Andr? Schappo -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsbien at mimuw.edu.pl Mon Jun 1 06:49:48 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Mon, 01 Jun 2015 13:49:48 +0200 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: (David Starner's message of "Mon, 01 Jun 2015 01:29:27 +0000") References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> Message-ID: <864mmrejhf.fsf_-_@mimuw.edu.pl> On Mon, Jun 01 2015 at 3:29 CEST, prosfilaes at gmail.com writes: > On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien > wrote: > > The proposal makes me curious about past and present Unicode > policy, > e.g. would it be accepted if submitted now. > > > Why wouldn't it? Unicode has, if anything, seemed to become more > flexible about adding characters that seeing any sort of use. > On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf The document's author states: Although they could be made up of Letter + overlay diacritic, it is my understanding that the Unicode Consortium would prefer to create unique code points for these types of letters (e.g. recent acceptance of LATIN LETTER SMALL C WITH STROKE). Is this true? On the other hand, according to Wikipedia http://en.wikipedia.org/wiki/Saanich_dialect in 2014 there was "about 5" native speakers of the language. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From prosfilaes at gmail.com Mon Jun 1 07:05:34 2015 From: prosfilaes at gmail.com (David Starner) Date: Mon, 01 Jun 2015 12:05:34 +0000 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: <864mmrejhf.fsf_-_@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> <864mmrejhf.fsf_-_@mimuw.edu.pl> Message-ID: On Mon, Jun 1, 2015 at 4:49 AM Janusz S. Bie? wrote: > The document's author states: > > Although they could be made up of Letter + overlay diacritic, it is > my understanding that the Unicode Consortium would prefer to create > unique code points for these types of letters (e.g. recent > acceptance of LATIN LETTER SMALL C WITH STROKE). > > Is this true? > As far as I know it's still true. Overlay diacritics don't work well, so they're pretty much ignored in encoding new characters. > On the other hand, according to Wikipedia > > http://en.wikipedia.org/wiki/Saanich_dialect > > in 2014 there was "about 5" native speakers of the language. > > It's what you get when you stock the committee who chooses what characters to encode with linguists. In the most general case, there is text in that language, and someone will want to digitize it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eik at iki.fi Mon Jun 1 12:07:24 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Mon, 1 Jun 2015 20:07:24 +0300 Subject: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) In-Reply-To: <864mmrejhf.fsf_-_@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <20150531200549.65196yuxvuorqyrh@mail.mimuw.edu.pl> <864mmrejhf.fsf_-_@mimuw.edu.pl> Message-ID: <001501d09c8d$6ea3ba50$4beb2ef0$@fi> Please note that overlaid diacritics are not used in decomposition of characters in the Unicode Standard, unless they are used for the indication of negation of mathematical rules (see TUS 7.0, section 7.9 Combining Marks and 2.12 Equivalent Sequences). Sincerely Erkki I. Kolehmainen -----Alkuper?inen viesti----- L?hett?j?: Unicode [mailto:unicode-bounces at unicode.org] Puolesta Janusz S. "Bien" L?hetetty: 1. kes?kuuta 2015 14:50 Vastaanottaja: David Starner Kopio: unicode at unicode.org Aihe: Sencoten and Unicode policy (was: the usage of LATIN SMALL LETTER A WITH STROKE) On Mon, Jun 01 2015 at 3:29 CEST, prosfilaes at gmail.com writes: > On Sun, May 31, 2015 at 11:09 AM Janusz S. Bien > wrote: > > The proposal makes me curious about past and present Unicode > policy, > e.g. would it be accepted if submitted now. > > > Why wouldn't it? Unicode has, if anything, seemed to become more > flexible about adding characters that seeing any sort of use. > On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf The document's author states: Although they could be made up of Letter + overlay diacritic, it is my understanding that the Unicode Consortium would prefer to create unique code points for these types of letters (e.g. recent acceptance of LATIN LETTER SMALL C WITH STROKE). Is this true? On the other hand, according to Wikipedia http://en.wikipedia.org/wiki/Saanich_dialect in 2014 there was "about 5" native speakers of the language. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From public at khwilliamson.com Mon Jun 1 13:23:20 2015 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 01 Jun 2015 12:23:20 -0600 Subject: The Oral History Of The Poop Emoji Message-ID: <556CA318.5060705@khwilliamson.com> https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america From mark at macchiato.com Mon Jun 1 13:57:44 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 1 Jun 2015 20:57:44 +0200 Subject: The Oral History Of The Poop Emoji In-Reply-To: <556CA318.5060705@khwilliamson.com> References: <556CA318.5060705@khwilliamson.com> Message-ID: One of many on http://unicode.org/press/emoji.html Mark *? Il meglio ? l?inimico del bene ?* On Mon, Jun 1, 2015 at 8:23 PM, Karl Williamson wrote: > > https://www.fastcompany.com/3037803/the-oral-history-of-the-poop-emoji-or-how-google-brought-poop-to-america > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Mon Jun 1 17:42:00 2015 From: doug at ewellic.org (Doug Ewell) Date: Mon, 01 Jun 2015 15:42:00 -0700 Subject: The Oral History Of The Poop Emoji Message-ID: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> I agree with one of the commenters that certain words just should not be used together in headlines. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Tue Jun 2 00:07:58 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 07:07:58 +0200 Subject: The Oral History Of The Poop Emoji In-Reply-To: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> References: <20150601154200.665a7a7059d7ee80bb4d670165c8327d.72c20e83b1.wbe@email03.secureserver.net> Message-ID: Article de "merde" ? (not an insult, this is a true French word, appropriate to the subject). Bon app?tit ! (if you think about orality...) 2015-06-02 0:42 GMT+02:00 Doug Ewell : > I agree with one of the commenters that certain words just should not be > used together in headlines. > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 2 01:01:25 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 08:01:25 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com> References: <556AEAE6.2040203@ix.netcom.com> <1433075623556.38b645ad@Nodemailer> <556B2DAD.6050204@ix.netcom.com> <2FF69E18-C2E6-4EA2-89D6-323D416EF459@gmail.com> Message-ID: 2015-06-01 1:33 GMT+02:00 Chris : > > Of course, anyone can invent a character set. The difficult bit is having > a standard way of combining custom character sets. That?s why a standard > would be useful. > > And while stuff like this can, to some extent, be recognised by magic > numbers, and unique strings in headers, such things are unreliable. Just > because example.net/mycharset/ appears near the start of a document, > doesn?t necessarily mean it was meant to define a character set. Maybe it > was a document discussing character sets. > That's not what I described. I spoke about using a MIME-compatible private charset identifier, and how such private identifier can be made reasonnably unique by binding it to a domain name or URI. If you had read more carefully I also said that it was absolutely not necessary to dereference that URL: there are many XML schemas binding their namespaces to a URI which is itself not a webpage or to any downloadable DTD or XML schema or XML stylesheet. Google and Microsoft are using this a lot in lots of schemas (which are not described and documented at this URL if they are documented). The URI by itself is just an identifier, it becomes a webpage only when you use it in a web page with an href attribute to create an hyperlink, or to perform some query to a service returning some data. An identifier for a private charset does not need to perform any request to be usable by itself, we just have the identifier which is sufficient by itself. The URI can be also only a base URI for a collection of resources (whose URLs start by this base URI, with conventional extensions appended to get the character properties, or a font; but the best way is to embed this data in your document, in some header or footer, if your document using the private charset is not part of a collection of docs using the same private charset) In that case, you don't need a new UTF: UTF-8 remains usable and you can map your private charset to standard PUAs (and/or to "hacked" characters) according to the private charset needs. The charset indicated in your document (by some meta header) should be sufficient to avoid collisions with other private conventions, it will define the scope of your private charset as the document itself, which will then be interchangeable (and possibly mixable with other documents with some renumbering if there a collisions of assignments between two distinct private charsets: in the document header; add to the charset identifier the range of PUAs which is used, then with two documents colling on this range, you can reencode one automatically by creating a compound charset with subranges of PUAs remapped differently to other ranges). -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Jun 2 04:01:01 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 2 Jun 2015 10:01:01 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> Message-ID: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows. ---- Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used 3z7r means this is local glyph 3 being defined, though not used, at the start of the document as 7 red pixels More than one local glyph could be defined at the start of the document, as desired. ---- This would mean that use of such a glyph within the document would be by just using the quite short base character followed by tag characters sequence using the h request. This would enable document editing to be easier to accomplish. ---- A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. ---- May I mention something that I forgot to mention earlier please? When only one pixel of a particular colour is being specified, it can be specified using just the code for the colour. For example, for 1 red pixel please use r on its own, there is no need to use 1r though 1r should be made to work just in case anyone does use that format. There was a time when I used to use the FORTH programming language and this format of first inputting the number then the operator is based on the way that the FORTH programming language works. William Overington 2 June 2015 ----Original message---- >From : wjgo_10009 at btinternet.com Date : 27/05/2015 - 17:26 (GMTST) To : unicode at unicode.org Subject : Tag characters and in-line graphics (from Tag characters) Tag characters and in-line graphics (from Tag characters) This document suggests a way to use the method of a base character together with tag characters to produce a graphic. The approach is theoretical and has not, at this time, been tried in practice. The application in mind is to enable the graphic for an emoji character to be included within a plain text stream, though there will hopefully be other applications. The base character could be either an existing character, such as U+1F5BC FRAME WITH PICTURE, or a new character as decided. Tests could be carried out using a Private Use Area character as the base character. The explanation here is intended to explain the suggested technique by examples, as a basis for discussion. In each example, please consider for each example that the characters listed are each the tag version of the character used here and that they all as a group follow one base character. The examples are deliberately short so as to explain the idea. A real use example might have around two hundred or so tag characters following the base character, maybe more, sometimes fewer. Examples of displays: Each example is left to right along the line then lines down the page from upper to lower. 7r means 7 pixels red 7r5y means 7 pixels red then 5 pixels yellow 7r5y-3b means 7 pixels red then 5 pixels yellow then next line then 3 pixels blue Examples of colours available: k black n brown r red o orange y yellow g green (0, 255, 0) b blue m magenta e grey w white c cyan p pink d dark grey i light grey (thus avoiding using lowercase l so as to avoid confusion with figure 1) f deeper green (foliage colour) (0, 128, 0) Next line request: - moves to the next line Local palette requests: 192R224G64B2s means store as local palette colour 2 the colour (R=192, G=224, B=64) 7,2u means 7 pixels using local palette colour 2 Local glyph memory, for use in compressing a document where the same glyph is used two or more times in the document: 3t7r means this is local glyph 3 being defined at its first use in the document as 7 red pixels 3h here local glyph 3 is being used The above is for bitmaps. It would be possible to use a similar technique to specify a vector glyph as used in fontmaking using on-curve and off-curve points specified as X, Y coordinates together with N for on-curve and F for off-curve. There would need to be a few other commands so as to specify places in the tag character stream where definition of a contour starts and so as to separate the definitions of the glyphs for a colour font and so on. This could be made OpenType compatible so that a received glyph could be added into a font. Please feel free to suggest improvements. One improvement could be as to how to build a Unicode code point into a picture so that a font could be transmitted. William Overington 27 May 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Tue Jun 2 04:40:18 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Tue, 2 Jun 2015 11:40:18 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! Your project is out of scope. You still want to reinvent the wheel. For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. Even for currency symbols (which are an exception to the demonstrated use, only because once they are created they are extremely rapidly needed by lot of people, in fact most people of a region as large as a country, and many other countries that will reference or use it it). But even in this case, what is encoded is the character itself, not the glyph or new characters used to defined the glyph ! Can you stop proposing out of topic subjects like this on this list ? You are not speaking about Unicode or characters. Another list will be more appropriate. You help no one here because all you want is to change radically the goals of TUS. 2015-06-02 11:01 GMT+02:00 William_J_G Overington : > Perhaps the solution to at least some of the various issues that have been > discussed in this thread is to define a tag letter z as a code within the > local glyph memory requests, as follows. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Tue Jun 2 05:37:06 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 2 Jun 2015 11:37:06 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Responding to Philippe Verdy: > Nothing has been published. It has been published. It is published in this thread for discussion prior to a possible submission to the Unicode Technical Committee that could take place if people on this mailing list feel that it is a good solution to the problem raised in section 8 of the following document. http://www.unicode.org/reports/tr51/tr51-2.html Direct link to 8 Longer Term Solutions http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term William Overington 2 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jcb+unicode at inf.ed.ac.uk Tue Jun 2 06:45:31 2015 From: jcb+unicode at inf.ed.ac.uk (Julian Bradfield) Date: Tue, 2 Jun 2015 12:45:31 +0100 Subject: Tag characters and in-line graphics (from Tag characters) References: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Message-ID: On 2015-06-02, William_J_G Overington wrote: > take place if people on this mailing list feel that it is a good > solution to the problem raised in section 8 of the following document. > http://www.unicode.org/reports/tr51/tr51-2.html That section does not raise a problem. It says what the solution to the emoji problem is: namely that people who want to embed graphics in text should fix their protocols to allow it, instead of subverting Unicode to do it. -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From kenwhistler at att.net Tue Jun 2 09:38:30 2015 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 02 Jun 2015 07:38:30 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: <556DBFE6.3060800@att.net> On 6/2/2015 2:01 AM, William_J_G Overington wrote: > Local glyph memory, for use in compressing a document where the same > glyph is used two or more times in the document: Um, that technology already exists. It is called a "font". > > > A mechanism to be able to use the method to define a glyph linked to a > Unicode code point would be a useful facility to add for use in a > situation where the glyph is for a regular Unicode character. And that mechanism has also already been defined. It is called a "cmap": http://www.microsoft.com/typography/otspec/cmap.htm --Ken From jsbien at mimuw.edu.pl Tue Jun 2 14:38:47 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Tue, 02 Jun 2015 21:38:47 +0200 Subject: reversed Polish-hook o References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> Message-ID: <863829gat4.fsf@mimuw.edu.pl> I've just noticed the comment quoted in the subject in the description of 'LATIN SMALL LETTER TURNED DELTA' (U+018D) and I'm intrigued how it got into the standard. On Sun, May 31 2015 at 18:20 CEST, frederic.grosshans at gmail.com writes: [...] > The upper case was introduces for Sencoten, and the proposal is here > http://www.unicode.org/L2/L2004/04170-sencoten.pdf > > (found by googling sencoten site:unicode.org) I tried to google for the relavant document both unicode.org and std.dkuug.dk but without any success. Actually I intend to look up the history of all the Polonica in Unicode and I will appreciate very much your advice what is the best way to search for information. Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From idou747 at gmail.com Tue Jun 2 17:55:27 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 08:55:27 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: I was asking why the glyphs for right arrow ? are inconsistent in many sources, through a couple of iterations of unicode. Perhaps I might observe that one of the reasons is there is no technical link between the code and the glyph. I can?t realistically write a display engine that goes to unicode.org or wherever, and dynamically finds the right standard glyph for unknown codes. This is also manifest in my seeing empty squares ? for characters my platform doesn?t know about. This isn?t the case with XML where I can send someone a random XML document, and there is a standard way to go out there on the internet and check if that XML is conformant. Why shouldn?t there be a standard way to go out on the net and find the canonical glyph for a code? If there was, then non-standard glyphs would fall out of that technology naturally. So people are talking about all these technologies that are out there, html5, cmap, fonts and so forth, but there is no standard way to construct a list of ?characters?, some of which might be non-standard, and be able to embed that ANYWHERE one might reasonably expect characters, have it processed in a normal way as characters, be sent anywhere and understood. As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? Right now, what happens if you have a domain or locale requirement for a special character? Most likely you suffer without it, because even though you could get it to render in some situations (like hand coding some IMGs into your web site), you just know you won?t be able to realistically input it into emails, word documents, spreadsheets, and whatever other random applications on a daily basis. What I?m saying is it really beyond the unicode consortium?s scope, and/or would it really be a redundant technology to, for example, define a UTF-64 coding format, where 32 bits allow 4 billion businesses and individuals to define their own characters sets (each of up to 4 billion characters), then have standard places on the internet (similar to DNS lookup servers) that can provide anyone with glyphs and fonts for it? Right now, yes there are cmaps, but no standard way to combine characters from different encodings. No standard way to find the cmap for an unknown encoding. There is HTML5, but that doesn?t produce something that is recognisable as a list of characters that can be processed as such. (If there is an IMG in text, is it a ?character? or an illustration in the text? How can you refer to a particular set of characters without having your own web server? How you render that text bigger, with the standard reference glyph without manually searching the internet where to find it? There is a host of problems here). All these problems look unsolved to me, and they also look like encoding technology problems to me too. What other consortium is out there are working on character encoding problems? > On 2 Jun 2015, at 7:40 pm, Philippe Verdy wrote: > > Once again no ! Unicode is a standard for encoding characters, not for encoding some syntaxic element of a glyph definition ! > > Your project is out of scope. You still want to reinvent the wheel. > > For creating syntax, define it within a language (which does not need new characters (you're not creating an APL grammar using specific symbols for some operators more or less based on Greek letters and geometric shapes: they are just like mathematic symbols). Programming languages and data languages (Javascript, XML, JOSN, HTML...) and their syntax are encoded themselves in plain text documents using standard characters) and don't need new characters, APL being an exception only because computers or keyboards were produced to facilitate the input (those that don't have such keyboards used specific editors or the APL runtime envitonment that offer an input method for entering programs in this APL input mode). > > Anf again you want the chicken before the egg: have you only ever read the encoding policy ? The UCS will not encode characters without a demonstrated usage. Nothing in what you propose is really used except being proposed only by you, and used only by you for your private use (or with a few of your unknown friends, but this is invisible and unverifiable). Nothing has been published. > > Even for currency symbols (which are an exception to the demonstrated use, only because once they are created they are extremely rapidly needed by lot of people, in fact most people of a region as large as a country, and many other countries that will reference or use it it). But even in this case, what is encoded is the character itself, not the glyph or new characters used to defined the glyph ! > > Can you stop proposing out of topic subjects like this on this list ? You are not speaking about Unicode or characters. Another list will be more appropriate. You help no one here because all you want is to change radically the goals of TUS. > > 2015-06-02 11:01 GMT+02:00 William_J_G Overington >: > Perhaps the solution to at least some of the various issues that have been discussed in this thread is to define a tag letter z as a code within the local glyph memory requests, as follows. -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Tue Jun 2 20:09:09 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 3 Jun 2015 10:09:09 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost>

Message-ID: <556E53B5.4080404@it.aoyama.ac.jp> On 2015/06/03 07:55, Chris wrote: > As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. > My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's "inline" text may be something similar. The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a "real good use case" for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. Regards, Martin. From duerst at it.aoyama.ac.jp Tue Jun 2 20:22:52 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Wed, 3 Jun 2015 10:22:52 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1432867044809.9dc7c15b@Nodemailer> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> Message-ID: <556E56EC.8010402@it.aoyama.ac.jp> On 2015/05/29 11:37, John wrote: > If I had a large document that reused a particular character thousands of times, Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. > would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way. > Given that its been agreed that private use ranges are a good thing, That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > and given that we can agree that exchanging data is a good thing, Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. > maybe something should bring those two things together. Just a thought. Just a 'non sequitur'. Regards, Martin. From idou747 at gmail.com Tue Jun 2 20:50:19 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 11:50:19 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556E53B5.4080404@it.aoyama.ac.jp> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost>

<556E53B5.4080404@it.aoyama.ac.jp> Message-ID: Martin, you seem to be labouring under the impression that HTML5 is a substitute for character encoding. If it is, why do we need unicode? We could just have documents laden with On 3 Jun 2015, at 11:09 am, Martin J. D?rst wrote: > > On 2015/06/03 07:55, Chris wrote: > >> As you point out, "The UCS will not encode characters without a demonstrated usage.?. But there are use cases for characters that don?t meet UCS?s criteria for a world wide standard, but are necessary for more specific use cases, like specialised regional, business, or domain specific situations. > > Unicode contains *a lot* of characters for specialized regional, business, or domain specific situations. > >> My question is, given that unicode can?t realistically (and doesn?t aim to) encode every possible symbol in the world, why shouldn?t there be an EXTENSIBLE method for encoding, so that people don?t have to totally rearchitect their computing universe because they want ONE non-standard character in their documents? > > As has been explained, there are technologies that allow you to do (more or less) that. Information technology, like many other technologies, works best when finding common cases used by many people. Let's look at some examples: > > Character encodings work best when they are used widely and uniformly. I don't know anybody who actually uses all the characters in Unicode (except the guys that work on the standard itself). So for each individual, a smaller set would be okay. And there were (and are) smaller sets, not for individuals, but for countries, regions, scripts, and so on. Originally (when memory was very limited), these legacy encodings were more efficient overall, but that's no longer the case. So everything is moving towards Unicode. > > Most Website creators don't use all the features in HTML5. So having different subsets for different use cases may seem to be convenient. But overall, it's much more efficient to have one Hypertext Markup Language, so that's were everybody is converging to. > > From your viewpoint, it looks like having something in between character encodings and HTML is what you want. It would only contain the features you need, and nothing more, and would work in all the places you wanted it to work. Asmus's "inline" text may be something similar. > > The problem is that such an intermediate technology only makes sense if it covers the needs of lots and lots of people. It would add a third technology level (between plain text and marked-up text), which would divert energy from the current two levels and make things more complicated. > > Up to now, such as third level hasn't emerged, among else because both existing technologies were good at absorbing the most important use cases from the middle. Unicode continues to encode whatever symbols that gain reasonable popularity, so every time somebody has a "real good use case" for the middle layer with a symbol that isn't yet in Unicode, that use case gets taken away. HTML (or Web technology in general) also worked to improve the situation, with technologies such as SVG and Web Fonts. > > No technology is perfect, and so there are still some gaps between character encoding and markup, some of which may in due time eventually be filled up, but I don't think a third layer in the middle will emerge soon. > > Regards, Martin. From idou747 at gmail.com Tue Jun 2 21:09:17 2015 From: idou747 at gmail.com (Chris) Date: Wed, 3 Jun 2015 12:09:17 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <556E56EC.8010402@it.aoyama.ac.jp> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: > On 3 Jun 2015, at 11:22 am, Martin J. D?rst wrote: > > On 2015/05/29 11:37, John wrote: > >> If I had a large document that reused a particular character thousands of times, > > Then it would be either a very boring document (containing almost only that same character) or it would be a very large document. If you have a daughter, look at her Facebook messenger, and then get back to me. >> would this HTML markup require embedding that character thousands of times, or could I define the character once at the beginning of the sequence, and then refer back to it in a space efficient way? > > If you want space efficiency, the best thing to do is to use generic compression. Many generic compression methods are available, many of them are widely supported, and all of them will be dealing with your case in a very efficient way You can?t ask the entire computing universe to compress everything all the time. And that is what your comment amounts to. Because the whole point under discussion is how can we encode stuff such that you can hope to universally move it around between different documents, formats, applications, input fields and platforms without any massage. > Given that its been agreed that private use ranges are a good thing, > > That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? >> and given that we can agree that exchanging data is a good thing, > > Yes, but there are many other ways to do that besides Unicode. And for many purposes, these other ways are better suited. The point is a universally recognised way. Of course you, me or anybody could design many good ways to solve any problem we might come up with. That doesn?t mean it will interoperate with anybody else though. > >> maybe something should bring those two things together. Just a thought. > > Just a 'non sequitur'. > > Regards, Martin. From verdy_p at wanadoo.fr Wed Jun 3 00:42:31 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 07:42:31 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> References: <18855945.28188.1433241427017.JavaMail.defaultUser@defaultHost> Message-ID: No, nothing about what you propose, which is to encode graphics directly with a custom syntax using specific Unicode characters for this syntax itself. There's no such statement in the UTR, even for "longer term". What is proposed instead is a way to *reference* (not "define") graphics. For the rest, you need a rich-text format to embed graphics (using the syntax of this rich-text format, such as HTML), but this syntax remains out of scope of Unicode which will not standardize any graphic format, or any language by its syntax. Even for CLDR, you will use some JSON or XML rich-text format to create references, or embed some small graphics. But CLDR is NOT part of the Unicode Standard itself, and does not encode new characters (and I've not seen the CLDR requesing additions in the UCS for its own use, instead it uses its own assignments for PUAs where needed, als also for its own private locale tags for internal references within the CLDR data itself). 2015-06-02 12:37 GMT+02:00 William_J_G Overington : > Responding to Philippe Verdy: > > > Nothing has been published. > > It has been published. It is published in this thread for discussion prior > to a possible submission to the Unicode Technical Committee that could > take place if people on this mailing list feel that it is a good solution > to the problem raised in section 8 of the following document. > > http://www.unicode.org/reports/tr51/tr51-2.html > > Direct link to > > 8 Longer Term Solutions > > http://www.unicode.org/reports/tr51/tr51-2.html#Longer_Term > > > William Overington > > 2 June 2015 > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jun 3 03:26:05 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 3 Jun 2015 09:26:05 +0100 (BST) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) Message-ID: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) >> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. For example, http://forum.high-logic.com/viewtopic.php?f=10&t=2957 http://forum.high-logic.com/viewtopic.php?f=10&t=2672 William Overington 3 June 2015 From frederic.grosshans at gmail.com Wed Jun 3 04:28:32 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Wed, 03 Jun 2015 11:28:32 +0200 Subject: reversed Polish-hook o In-Reply-To: <863829gat4.fsf@mimuw.edu.pl> References: <86lhg43ji3.fsf@mimuw.edu.pl> <20150531170332.17444mfm30p68wxw@mail.mimuw.edu.pl> <556B34CF.2040106@gmail.com> <863829gat4.fsf@mimuw.edu.pl> Message-ID: <556EC8C0.1060907@gmail.com> An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 06:38:02 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 04:38:02 -0700 (PDT) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Message-ID: <1433331480845.7b37573e@Nodemailer> Yep, I clicked on your document and saw an empty square where your character should be. F = FAIL. ? Chris On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington wrote: > Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) >>> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). >> They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? > Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. > I have made fonts that include Private Use Area encodings using the High-Logic FontCreator program and then used those fonts in Serif PagePlus, both to produce PDF documents and PNG graphics, as needed for my particular project at the time. > For example, > http://forum.high-logic.com/viewtopic.php?f=10&t=2957 > http://forum.high-logic.com/viewtopic.php?f=10&t=2672 > William Overington > 3 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 08:03:30 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 15:03:30 +0200 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <1433331480845.7b37573e@Nodemailer> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> <1433331480845.7b37573e@Nodemailer> Message-ID: This possibly fails because William possibly forgot to embed his font in the document itself (or Serif PagePlus forgets to do it when it creates the PDF document, and refuses to embed glyphs from the font that are bound to Unicode PUAs when it creates the embeded font). However no such problem when creating PDFs with MS Office, or via the Adobe Acrobat "printer" driver or other printer drivers generating PDF files, including Google Cloud Print). So this could be a misuse of Serif PagePlus when creating the PDF (I don't know this software, may be there are options set up that ells it to not embed fonts from a list of fonts that the recipient is supposed to have installed locally, to save storage space for the document, byt evoiding such embedding). Another reason may be that the font is marked as "not embeddable" within its exposed properties. Another reason may be that John tries to open the document with a software that does not handle embedded fonts, or that ignores it to use only the fonts preinstalled by John in his preferences. And in such case the result depends only on fonts preinstalled on his local system (that does not include the fonts created by William), or his software is setup to use exclusively a specific local "Unicode" font for all PUAs. (Softwares that behaved in this bad way was old versions of Internet Explorer, due to limitation of his text renderers, however this should not happen with PDFs, provided you have used a correct plugion version for displaying PDF in the browser : if this fails in the browser, download the document and view it with Adobe Reader instead of view the plugin: there are many PDF plugins on markets that do not support essential features and just built to display PDF containing scanned bitmaps, but with very poor support of text or vector graphics, or tuned specifically to change the document for another device or paper format). Without citing which softwares are used (and which PDF in the list does not load correctly), it is difficult to tell, but for me I have no problems with a few docs I saw created by William. So: NO F = NO FAIL for me. 2015-06-03 13:38 GMT+02:00 John : > Yep, I clicked on your document and saw an empty square where your > character should be. > > F = FAIL. > > ? > Chris > > > On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < > wjgo_10009 at btinternet.com> wrote: > >> Private Use Area in Use (from Tag characters and in-line graphics (from >> Tag characters)) >> >> >> >> That's not agreed upon. I'd say that the general agreement is that the >> private ranges are of limited usefulness for some very limited use cases >> (such as designing encodings for new scripts). >> >> >> > They are of limited usefulness precisely because it is pathologically >> hard to make use of them in their current state of technological evolution. >> If they were easy to make use of, people would be using them all the time. >> I?d bet good money that if you surveyed a lot of applications where custom >> characters are being used, they are not using private use ranges. Now why >> would that be? >> >> >> Actually, I have used Private Use Area characters a lot, and, once I had >> got used to them, I found them incredibly straightforward to use. >> >> >> I have made fonts that include Private Use Area encodings using the >> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >> both to produce PDF documents and PNG graphics, as needed for my particular >> project at the time. >> >> >> For example, >> >> >> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >> >> >> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >> >> >> William Overington >> >> >> >> >> 3 June 2015 >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 08:20:14 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 15:20:14 +0200 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> <1433331480845.7b37573e@Nodemailer> Message-ID: Note that copy-pasting from a PDF to another document is very tricky, the PDF format requires that embedded fonts use precise glyph naming conventions to map glyphs back to characters, otherwise the Unicode characters sequences associated to a glyph (or multiple glyphs if they are ligatured or in complex layouts or with uncommon decorations, or rendered on a non uniform background, or with glyphs filled with pattern, such as labels over a photograph or cartographic map) will not be recognized. This remark about PDFs is also applicable to PostScript documents. Some PDF readers in that case attempt to perform some OCR (plus dictionary lookups to fix mis readings) for common glyph forms, but will almost always fail if the glyphs are too specific such as when they include swashes, ligatures, or unknown scripts and scripts with complex layouts (such as the invented script created by William for noting sentences with specific "characters" with new glyphs, and a specific syntax and specific layout rules. In other casesn the PDF reader will jsut put in the clipboard only a bitmap for the selection, and it will be another software that will attempt to interpret the bitmap with OCR. The glyph naming conventions are documented in PDF specifications, but many PDF creators do not follow these rules, and copying text from these PDFs fails 2015-06-03 15:03 GMT+02:00 Philippe Verdy : > This possibly fails because William possibly forgot to embed his font in > the document itself (or Serif PagePlus forgets to do it when it creates the > PDF document, and refuses to embed glyphs from the font that are bound to > Unicode PUAs when it creates the embeded font). However no such problem > when creating PDFs with MS Office, or via the Adobe Acrobat "printer" > driver or other printer drivers generating PDF files, including Google > Cloud Print). > > So this could be a misuse of Serif PagePlus when creating the PDF (I don't > know this software, may be there are options set up that ells it to not > embed fonts from a list of fonts that the recipient is supposed to have > installed locally, to save storage space for the document, byt evoiding > such embedding). Another reason may be that the font is marked as "not > embeddable" within its exposed properties. > > Another reason may be that John tries to open the document with a software > that does not handle embedded fonts, or that ignores it to use only the > fonts preinstalled by John in his preferences. And in such case the result > depends only on fonts preinstalled on his local system (that does not > include the fonts created by William), or his software is setup to use > exclusively a specific local "Unicode" font for all PUAs. > > (Softwares that behaved in this bad way was old versions of Internet > Explorer, due to limitation of his text renderers, however this should not > happen with PDFs, provided you have used a correct plugion version for > displaying PDF in the browser : if this fails in the browser, download the > document and view it with Adobe Reader instead of view the plugin: there > are many PDF plugins on markets that do not support essential features and > just built to display PDF containing scanned bitmaps, but with very poor > support of text or vector graphics, or tuned specifically to change the > document for another device or paper format). > > Without citing which softwares are used (and which PDF in the list does > not load correctly), it is difficult to tell, but for me I have no problems > with a few docs I saw created by William. So: > > NO F = NO FAIL for me. > > 2015-06-03 13:38 GMT+02:00 John : > >> Yep, I clicked on your document and saw an empty square where your >> character should be. >> >> F = FAIL. >> >> ? >> Chris >> >> >> On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < >> wjgo_10009 at btinternet.com> wrote: >> >>> Private Use Area in Use (from Tag characters and in-line graphics (from >>> Tag characters)) >>> >>> >>> >> That's not agreed upon. I'd say that the general agreement is that >>> the private ranges are of limited usefulness for some very limited use >>> cases (such as designing encodings for new scripts). >>> >>> >>> > They are of limited usefulness precisely because it is pathologically >>> hard to make use of them in their current state of technological evolution. >>> If they were easy to make use of, people would be using them all the time. >>> I?d bet good money that if you surveyed a lot of applications where custom >>> characters are being used, they are not using private use ranges. Now why >>> would that be? >>> >>> >>> Actually, I have used Private Use Area characters a lot, and, once I had >>> got used to them, I found them incredibly straightforward to use. >>> >>> >>> I have made fonts that include Private Use Area encodings using the >>> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >>> both to produce PDF documents and PNG graphics, as needed for my particular >>> project at the time. >>> >>> >>> For example, >>> >>> >>> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >>> >>> >>> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >>> >>> >>> William Overington >>> >>> >>> >>> >>> 3 June 2015 >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 3 08:24:04 2015 From: prosfilaes at gmail.com (David Starner) Date: Wed, 03 Jun 2015 13:24:04 +0000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: Chris wrote: > There is no way to compare 2 HTML elements and know they are talking about the same character That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ Note that even in Unicode, the set ? ? ? ? s S ? may be considered the same character or up to seven different characters, depending on case-folding, canonization and accent dropping. > Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. On Tue, Jun 2, 2015 at 7:11 PM Chris wrote: > You can?t ask the entire computing universe to compress everything all the > time. Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends? -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Wed Jun 3 08:53:34 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 3 Jun 2015 14:53:34 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> References: <10653675.53720.1432743967958.JavaMail.defaultUser@defaultHost> <28916074.15093.1433235661971.JavaMail.defaultUser@defaultHost> Message-ID: <1565119.42918.1433339614901.JavaMail.defaultUser@defaultHost> Earlier in this thread, on 2 June 2015, I wrote as follows: > A mechanism to be able to use the method to define a glyph linked to a Unicode code point would be a useful facility to add for use in a situation where the glyph is for a regular Unicode character. I have now thought of a mechanism to use. Please imagine the base character followed by a sequence of tag characters, the tag characters here represented by ordinary letters and digits. Here is an example of the mechanism for defining the glyph for U+E702 in a particular document as 7 red pixels. HE702U7r The tag H character switches to hexadecimal input mode, then there are as many tag characters as necessary to express in hexadecimal notation the code point of the character for which the definition is being made, then there is a tag U character to action the definition and go out of hexadecimal input mode. The tag 7r is to express 7 red pixels. In practice the number of tag characters after the tag U character might be around 200, the above tag 7r is just a minimal example so as to explain the concept. ---- While posting, may I mention please one other matter? Previously I mentioned using tag R, tag G and tag B is defining colours. I now add tag A into that defining colour so as to define opacity, that is what is sometimes called transparency, yet 0 means totally transparent and 255 means totally opaque. If no value is stated for A then it should be presumed to have a value of 255, so that the default situation is to define opaque colours. ---- I feel that the information in this thread is now a good basis for the assessment of this suggested format as to whether it could be a useful open source system with good interoperability potential that could usefully be submitted to the Unicode Technical Committee. William Overington 3 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 09:04:34 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 3 Jun 2015 16:04:34 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: Compression is even more important today on mobile networks: mobile apps are very verbose over the net, and you can easily pay the extra volume. In addition, mobile networks are frequently much slower than what they are advertized, even if you pay the extra subscription to get 3G/4G, you depend on antennas and the number of peoples around you. In my home, 3G/4G in faact does not work at all, and this is the case in many places around in my city, even though they are sold to have full coverage (for example, just downloading an application or updating it is simply impossible: I have to be at home connected to my Wifi router, but when its internet link fails (this happens sometimes for several hours, I have extremely slow connections on 3G/4G (which is also overcrowded at the same time, and only delivers 2G speeds). Lot of people have to support frequently low bandwidths on mobile networks, independantly of the price they paid for their subscription. So compressing data is stil lextremely important (even for texts or for the smallest web requests). Thanks, compression is now part of the web transport, but this does not mean that apps must learn to represent their interchanged data efficiently, and develop less verbose protocols and APIs). There are more and more people using mobile networks now than fixed landline internet accesses (or home wifi routers connected to it, and even for them, fiber access is still jsut for a minority of people in dense areas, the others don't get more than an handful of mebgatit/s on their DSL access: if you look at worldwide internet connections a large majority of people don't get more than 2 megabit/s: this is enough for reading/sending SMS or phone calls, or exchanging emails, but not if you need frequent updates to your apps and your apps are too verbose and there are too many apps in the background: many people cannot view videos on their mobile access, or only with very poor quality if they view it "live" (they cannot also download them slowly due to lack of storage space on their mobile device, so videos have to remain short in total volume and duration). So I disagree: compression is absolutely needed (even more today than iut was in the past when mobile Internet accesses were still for a minority. Mobile networks are not really faster today (their bandwidth does not double every three year like local performances of devices ! But with this extra local performance, you can support more complex compression schemes that require more CPU/GPU power which is no longer a bottleneck, when the real bottleneck is the effectively available bandwidth of the mobile network (smaller than the connection bandwidth because this bandwidth is shared... and expensive). 2015-06-03 15:24 GMT+02:00 David Starner : > Chris wrote: > > There is no way to compare 2 HTML elements and know they are talking > about the same character > > That's because character identity is a hard problem. Is the emoji TIGER > the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? > > > http://www.engadget.com/2014/04/30/you-may-be-accidentally-sending-friends-a-hairy-heart-emoji/ > > Note that even in Unicode, the set ? ? ? ? s S ? may be considered the > same character or up to seven different characters, depending on > case-folding, canonization and accent dropping. > > > Similarly, there is no way to search or index html elements. If a HTML > document contained an image of a particular custom character, there would > be no way to ask google or whatever to find all the documents with that > character. Different documents would represent it differently. > > You can index links to images. If two documents represent it differently, > then I go back to the above; we can't know that they're the same thing. > > On Tue, Jun 2, 2015 at 7:11 PM Chris wrote: > >> You can?t ask the entire computing universe to compress everything all >> the time. > > > Anytime we care about how much space text takes up, it should be > compressed. It compresses very well. On the other hand, it's rare that > anyone cares anymore; what's a few hundred kilobytes between friends? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 3 09:56:33 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 03 Jun 2015 07:56:33 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> Chris wrote: > Right now, what happens if you have a domain or locale requirement for > a special character? That's what the PUA is for. Assign a PUA code point to your special character, create a font which implements the PUA character, create a brief "private agreement" which states that this code point refers to that character and which mentions the font, put the private agreement on the web, and publish your document with a reference to the agreement. For most non-professionals, creating the font is the tricky part. Also see Section 23.5 of TUS. Note that I am disagreeing with Martin about the PUA being useful only as a scratch area for standardization. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Wed Jun 3 10:14:39 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 03 Jun 2015 08:14:39 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150603081439.665a7a7059d7ee80bb4d670165c8327d.bec9174c59.wbe@email03.secureserver.net> Chris Why shouldn?t there be a standard way to go out on the net and find > the canonical glyph for a code? Because there isn't one. Glyphs are suggestions, meant to convey the identity of the character. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From idou747 at gmail.com Wed Jun 3 19:17:34 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 17:17:34 -0700 (PDT) Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: References: Message-ID: <1433377053793.5f2c25d8@Nodemailer> I don?t use old software, I use up to date versions of everything on a Mac. Very standard setup.? There?s a lot of links there. Maybe they do work in PDFs, but they certainly don?t work in the browser, and they don?t work when I click the txt files. Basically what you?re saying is that PDFs have a way to make this work. so what? Unless we are proposing that everything in the universe be PDF, this doesn?t really help. There should be a standard way to put custom characters anywhere that characters belong and have things ?just work?. Clearly right now things don?t just work. And without even bothering to try I know if I tried cutting and pasting from those PDFs into somewhere else, it won?t work. ? Chris On Wed, Jun 3, 2015 at 11:20 PM, Philippe Verdy wrote: > Note that copy-pasting from a PDF to another document is very tricky, the > PDF format requires that embedded fonts use precise glyph naming > conventions to map glyphs back to characters, otherwise the Unicode > characters sequences associated to a glyph (or multiple glyphs if they are > ligatured or in complex layouts or with uncommon decorations, or rendered > on a non uniform background, or with glyphs filled with pattern, such as > labels over a photograph or cartographic map) will not be recognized. This > remark about PDFs is also applicable to PostScript documents. > Some PDF readers in that case attempt to perform some OCR (plus dictionary > lookups to fix mis readings) for common glyph forms, but will almost always > fail if the glyphs are too specific such as when they include swashes, > ligatures, or unknown scripts and scripts with complex layouts (such as the > invented script created by William for noting sentences with specific > "characters" with new glyphs, and a specific syntax and specific layout > rules. In other casesn the PDF reader will jsut put in the clipboard only a > bitmap for the selection, and it will be another software that will attempt > to interpret the bitmap with OCR. > The glyph naming conventions are documented in PDF specifications, but many > PDF creators do not follow these rules, and copying text from these PDFs > fails > 2015-06-03 15:03 GMT+02:00 Philippe Verdy : >> This possibly fails because William possibly forgot to embed his font in >> the document itself (or Serif PagePlus forgets to do it when it creates the >> PDF document, and refuses to embed glyphs from the font that are bound to >> Unicode PUAs when it creates the embeded font). However no such problem >> when creating PDFs with MS Office, or via the Adobe Acrobat "printer" >> driver or other printer drivers generating PDF files, including Google >> Cloud Print). >> >> So this could be a misuse of Serif PagePlus when creating the PDF (I don't >> know this software, may be there are options set up that ells it to not >> embed fonts from a list of fonts that the recipient is supposed to have >> installed locally, to save storage space for the document, byt evoiding >> such embedding). Another reason may be that the font is marked as "not >> embeddable" within its exposed properties. >> >> Another reason may be that John tries to open the document with a software >> that does not handle embedded fonts, or that ignores it to use only the >> fonts preinstalled by John in his preferences. And in such case the result >> depends only on fonts preinstalled on his local system (that does not >> include the fonts created by William), or his software is setup to use >> exclusively a specific local "Unicode" font for all PUAs. >> >> (Softwares that behaved in this bad way was old versions of Internet >> Explorer, due to limitation of his text renderers, however this should not >> happen with PDFs, provided you have used a correct plugion version for >> displaying PDF in the browser : if this fails in the browser, download the >> document and view it with Adobe Reader instead of view the plugin: there >> are many PDF plugins on markets that do not support essential features and >> just built to display PDF containing scanned bitmaps, but with very poor >> support of text or vector graphics, or tuned specifically to change the >> document for another device or paper format). >> >> Without citing which softwares are used (and which PDF in the list does >> not load correctly), it is difficult to tell, but for me I have no problems >> with a few docs I saw created by William. So: >> >> NO F = NO FAIL for me. >> >> 2015-06-03 13:38 GMT+02:00 John : >> >>> Yep, I clicked on your document and saw an empty square where your >>> character should be. >>> >>> F = FAIL. >>> >>> ? >>> Chris >>> >>> >>> On Wed, Jun 3, 2015 at 6:30 PM, William_J_G Overington < >>> wjgo_10009 at btinternet.com> wrote: >>> >>>> Private Use Area in Use (from Tag characters and in-line graphics (from >>>> Tag characters)) >>>> >>>> >>>> >> That's not agreed upon. I'd say that the general agreement is that >>>> the private ranges are of limited usefulness for some very limited use >>>> cases (such as designing encodings for new scripts). >>>> >>>> >>>> > They are of limited usefulness precisely because it is pathologically >>>> hard to make use of them in their current state of technological evolution. >>>> If they were easy to make use of, people would be using them all the time. >>>> I?d bet good money that if you surveyed a lot of applications where custom >>>> characters are being used, they are not using private use ranges. Now why >>>> would that be? >>>> >>>> >>>> Actually, I have used Private Use Area characters a lot, and, once I had >>>> got used to them, I found them incredibly straightforward to use. >>>> >>>> >>>> I have made fonts that include Private Use Area encodings using the >>>> High-Logic FontCreator program and then used those fonts in Serif PagePlus, >>>> both to produce PDF documents and PNG graphics, as needed for my particular >>>> project at the time. >>>> >>>> >>>> For example, >>>> >>>> >>>> http://forum.high-logic.com/viewtopic.php?f=10&t=2957 >>>> >>>> >>>> http://forum.high-logic.com/viewtopic.php?f=10&t=2672 >>>> >>>> >>>> William Overington >>>> >>>> >>>> >>>> >>>> 3 June 2015 >>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 19:21:00 2015 From: idou747 at gmail.com (John) Date: Wed, 03 Jun 2015 17:21:00 -0700 (PDT) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> Message-ID: <1433377259559.1a60883d@Nodemailer> So what you?re saying is that the current situation where you see an empty square ? for unknown characters is better than seeing something useful? ? Chris On Thu, Jun 4, 2015 at 12:59 AM, Doug Ewell wrote: > Chris wrote: >> Right now, what happens if you have a domain or locale requirement for >> a special character? > That's what the PUA is for. Assign a PUA code point to your special > character, create a font which implements the PUA character, create a > brief "private agreement" which states that this code point refers to > that character and which mentions the font, put the private agreement on > the web, and publish your document with a reference to the agreement. > For most non-professionals, creating the font is the tricky part. > Also see Section 23.5 of TUS. > Note that I am disagreeing with Martin about the PUA being useful only > as a scratch area for standardization. > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Wed Jun 3 19:46:26 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 10:46:26 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> Message-ID: <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> > On 3 Jun 2015, at 11:24 pm, David Starner wrote: > > Chris wrote: > > There is no way to compare 2 HTML elements and know they are talking about the same character > > That's because character identity is a hard problem. Is the emoji TIGER the same as TONY THE TIGER or as TONY THE TIGER GIVING THE VICTORY SIGN? I personally think emoji should have one, single definitive representation for this exact reason. The subtley of different emotion between one happy face and another can be miles apart. Emoji are a little different to other symbols in that respect. Symbols that are purely symbolic can be changed as much as you like as long as they are recognisable. Emoji have too many shades of meaning for allowing change. Both of these scenarios are an argument that there should be custom characters with at least one official representation. Emoji because you don?t really want variation. Symbols because if you don?t have a local representation, then something is better than nothing. If you don?t have a local Snow Flake for example, any old snow flake will be fine. This is not a hard problem at all. Is one tony the tiger the same as another? The community interested in tony the tiger can make decisions like that. But having made that decision there needs to be a way for generic computer programs that don?t know about that community to do reasonable things with tony the tiger characters. > > You can index links to images. If two documents represent it differently, then I go back to the above; we can't know that they're the same thing. You can?t know because they?re images. That?s my exact point. Anybody talking about HTML5 and images as a solution to custom characters is not proposing a valid solution. > > On Tue, Jun 2, 2015 at 7:11 PM Chris > wrote: > You can?t ask the entire computing universe to compress everything all the time. > > Anytime we care about how much space text takes up, it should be compressed. It compresses very well. On the other hand, it's rare that anyone cares anymore; what's a few hundred kilobytes between friends? You compress things when they are on the move. Between computers and as you are writing it to a file. But you can?t compress generically while it is in memory. You can?t iterate over compressed bits. You can?t process them. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Wed Jun 3 19:57:45 2015 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 03 Jun 2015 17:57:45 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <1433377053793.5f2c25d8@Nodemailer> References: <1433377053793.5f2c25d8@Nodemailer> Message-ID: <556FA289.7070703@att.net> On 6/3/2015 5:17 PM, John wrote: > > > > > so what? > > There should be a standard way to put custom characters anywhere that > characters belong and have things ?just work?. > > Well, that's the rub, isn't it? We (in IT) are still working pretty dang hard on the simpler problem, to wit: There should be a way to put *standard characters* anywhere that characters belong and have things "just work". And even *that* is a hard problem that has taken over 25 years -- and is still a work in progress. What you are asking for is not much removed from: There should be a *standard *way to put "*stuff-I-just-made-up*" anywhere that characters belong and have things "just work". See, the first barrier to getting anywhere with this goal is to get everybody concerned with text in IT (or perhaps even worse, all the hundreds of millions of people who *use* characters in their devices) to agree what a "custom character" is. And if the rollicking "discussions" underway about emoji have taught us much of anything, it includes the fact that people do *not* all agree about what characters are or what should be a candidate for "just working" -- or even what "just work" might mean for them, in any case. So before declaring that your position is self-evidently correct about how things should just work, it might be a good idea to put some real thought into how one would define and standardize the concept of a "custom character" sufficiently precisely that there would be a snowball's chance in hell that all the implementations of text out there would a) know what it was, b) know how it should display and render, c) know how it should be input, stored, and transmitted and d) know how it should be interpreted universally. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Wed Jun 3 19:59:21 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 00:59:21 +0000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: On Wed, Jun 3, 2015 at 5:46 PM Chris wrote: > > I personally think emoji should have one, single definitive representation > for this exact reason. > Then you want an image. I don't see what's hard about that. > The community interested in tony the tiger can make decisions like that. > That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. > You can?t know because they?re images. > You can't know because the only obvious equivalence relation is exact image identity. You can?t iterate over compressed bits. You can?t process them. Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 3 20:27:27 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Jun 2015 03:27:27 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: 2015-06-04 2:59 GMT+02:00 David Starner : > You can?t iterate over compressed bits. You can?t process them. >> > > Why not? In any language I know of that has iterators, there would be no > problem writing one that iterates over compressed input. If you need to > mutate them, that is hard in compressed formats, but a new CPU can store > War in Peace in the on-CPU cache. > You're right, today the CPU is no longer the bottleneck, which is now * the speed of long buses and communcaition links, with their limited (and costly) bandwidth as this is a shared medium used by more and more people but requiring mssive infrastures, or physical constraints even on the fastest serial buses, both implying transmission roundtrip times (limiting random access, which is a severe problem now that we have to access to extremely large volumes of data distributed over multiple devices or over a full network * the storage capacity for the fastest storage medium (such as flash memory, which is the only option for mobile devices, but also the most expensive). In both cases you need compression (the second bottleneck on storage volumes will fade out in a few years, but not the bandwidth constraints). It really pays now to use compression schemes (even the most complex ones such as those used to transmit live video: locally a CPU or GPU will easily handle the compression scheme. Researches on compression schemes is really not ended, it has never been so much active as it is today, including for text because of the explosion of the data volumes, even if now the volume of text is largely overwhelmed by the volume of images, videos and audio (but you can't compute a lot of things from audio/image/video data sources, we still need text for giving semantics to these medias from which you can derive data or perform searches (there is still a lot to do for handling images and audio speech and detect some semantics in them, but you won't get as much info from an audio/video than what can be represented by text: OCR for example is a very heuristic process with lots of false guesses produced, still much more than humain brains can process within a broad ranges of variations that we call "cultures"; computers are still very poor in recognizing cultures with as many variations as those we recognize through social interactions and years of education and *personal* experience). -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Wed Jun 3 21:22:22 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 3 Jun 2015 20:22:22 -0600 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <1433377259559.1a60883d@Nodemailer> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> Message-ID: Chris John wrote: > So what you?re saying is that the current situation where you see an > empty square ? for unknown characters is better than seeing something > useful? No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From idou747 at gmail.com Thu Jun 4 02:43:48 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 17:43:48 +1000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <556FA289.7070703@att.net> References: <1433377053793.5f2c25d8@Nodemailer> <556FA289.7070703@att.net> Message-ID: > > Well, that's the rub, isn't it? > > We (in IT) are still working pretty dang hard on the simpler problem, to wit: > > There should be a way to put standard characters anywhere that characters belong > and have things "just work". > > And even *that* is a hard problem that has taken over 25 years -- and is still a work in > progress. Unicode is 2 things. (1) A binary format? the technology bit. (2) And the social part: agreeing what the characters should be. (1) is, relatively speaking, super easy. Roughly speaking, 16 bit unique numbers in a row. (2) is hard because coming to an agreement is hard. What I?m saying is we can totally bypass (2) for many use cases if people had the power to make their own characters. Yes it is hard to meet in committee and agree on stuff. Don?t force people to do that. You do that by putting more work into (1), and less hand wringing about (2). > See, the first barrier to getting anywhere with this goal is to get everybody concerned > with text in IT (or perhaps even worse, all the hundreds of millions of people who > *use* characters in their devices) to agree what a "custom character" is. There is no need for such thing. Everybody knows roughly what the concept of a custom character is. What is needed is the technology to do it so that everyone can seamlessly enjoy it. > And if > the rollicking "discussions" underway about emoji have taught us much of anything, > it includes the fact that people do *not* all agree about what characters are or > what should be a candidate for "just working" -- or even what "just work" might > mean for them, in any case. That?s because you?re immersed in (2), which is a different kind of problem. You don?t have to agree on details if everybody has the power to create new characters. > So before declaring that your position is self-evidently correct about how things > should just work, it might be a good idea to put some real thought into how > one would define and standardize the concept of a "custom character" sufficiently > precisely that there would be a snowball's chance in hell that all the implementations > of text out there would a) know what it was, b) know how it should display and > render, c) know how it should be input, stored, and transmitted and d) know how it > should be interpreted universally. I already gave several possible implementation suggestions. I?ll repeat one of them again merely to illustrate that it is possible. Characters are 64 bit. 32 bits are stripped off as the ?character set provider ID?. That is sent to one of many canonical servers akin to DNS servers to find the URL owner of those characters. At that location you?d find a number of representations of the character whether TrueType, vector graphics, bitmaps or whatever. The rendering engine would download the representation and display it to the user. All without the user having to know anything about character sets, custom fonts or whatever. So you come across character 12340000000017. The OS asks charset server who owns charset 1234. They reply ?facebook.com/charsets?. The OS asks facebook.com/charsets for facebook.com/charsets/17/truetype/pointsize12 representation. All this happens invisible to the user. Of course if it is already cached on their machine, then it wouldn?t happen. -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 02:57:33 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 17:57:33 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <5567A7D6.6060102@kli.org> <1432867044809.9dc7c15b@Nodemailer> <556E56EC.8010402@it.aoyama.ac.jp> <8E2E3F18-A4D6-4E1E-B751-B8A794AA17B2@gmail.com> Message-ID: <4831F779-0B20-4B54-A85F-40308EEA4D57@gmail.com> > On 4 Jun 2015, at 10:59 am, David Starner wrote: > > On Wed, Jun 3, 2015 at 5:46 PM Chris > wrote: > > I personally think emoji should have one, single definitive representation for this exact reason. > > Then you want an image. I don't see what's hard about that. I already explained why an image and/or HTML5 is not a character. I?ll repeat again. And the world of characters is not limited to emoji. 1. HTML5 doesn?t separate one particular representation (font, size, etc) from the actual meaning of the character. So you can?t paste it somewhere and expect to increase its point size or change its font. 2. It?s highly inefficient in space to drop multi-kilobyte strings into a document to represent one character. 3. The entire design of HTML has nothing to do with characters. So there is no way to process a string of characters interspersed with HTML elements and know which of those elements are a ?character?. This makes programatic manipulation impossible, and means most computer applications simply will not allow HTML in scenarios where they expect a list of ?characters?. 4. There is no way to compare 2 HTML elements and know they are talking about the same character. I could put some HTML representation of a character in my document, you could put a different one in, and there would absolutely no way to know that they are the same character. Even if we are in the same community and agree on the existence of this character. 5. Similarly, there is no way to search or index html elements. If a HTML document contained an image of a particular custom character, there would be no way to ask google or whatever to find all the documents with that character. Different documents would represent it differently. HTML is a rendering technology. It makes things LOOK a particular way, without actually ENCODING anything about it. The only part of of HTML that is actually searchable in a deterministic fashion is the part that is encoded - the unicode part. > > The community interested in tony the tiger can make decisions like that. > > That is a hell of a handwave. In practice, you've got a complex decision that's always going to be a bit controversial, and one a decision that most communities won't bother trying to make. Apparently the world makes decisions all the time without meeting in committee. Strange but true. It?s called making a decision. Facebook have created a lot of emoji characters without consulting any committee and it seems to work fine, albeit restricted to the facebook universe because of a lack of a standard. > > > You can?t know because they?re images. > > You can't know because the only obvious equivalence relation is exact image identity. Because? there is no standard!! If facebook wants to define 2 emoji images, maybe one is bigger than the other, and yet basically the same, to mean the same thing, then that would be their choice. Since I expect they have a lot of smart people working there, I expect it would work rather well. Just like Microsoft issues courier fonts in different point sizes and we all feel they have made that work fairly well. You seem to be arguing the nonsense position that if someone for example, made a snowflake glyph slightly different to the unicode official one, that it is wrong. That of course is nonsense. People can make sensible decisions about this without the unicode committee. > > You can?t iterate over compressed bits. You can?t process them. > > Why not? In any language I know of that has iterators, there would be no problem writing one that iterates over compressed input. If you need to mutate them, that is hard in compressed formats, but a new CPU can store War in Peace in the on-CPU cache. You can?t do it because no standard library, programming language, or operating system is set up to iterate over characters of compressed data. So if you want to shift compressed bits around in your app, it will take an awful lot of work, and the bits won?t be recognised by anyone else. Now if someone wants to define the next version of unicode to be a compressed format, and every platform supports that with standard libraries, computer languages etc, then fine that could work. Yet again I point out, lots of things MIGHT be possible in the real world IF that is how a standard is formulated. But all the chatter about this or that technology is pie in the sky without that standard. -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 03:03:12 2015 From: idou747 at gmail.com (Chris) Date: Thu, 4 Jun 2015 18:03:12 +1000 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer> Message-ID: > > No, that's why you include a reference to the font in the private agreement, so that interested parties can install it and see the special character(s). People with their iphones and ipads and so forth don?t want to have ?private agreements?, they don?t want to ?install character sets?. The want it to ?just work?. I wish Steve Jobs was here to give this lecture. I highly doubt actually that it is even possible to install a private character set font on an iphone such that it would be available to all applications. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. From wjgo_10009 at btinternet.com Thu Jun 4 03:46:05 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Thu, 4 Jun 2015 09:46:05 +0100 (BST) Subject: Custom characters (was: Re: Private Use Area in Use) Message-ID: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Chris expressed an idea, hypothetically starting: > Characters are 64 bit. The following posts might be helpful. http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0277.html http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0307.html For 64 bits, or somewhere in that region, maybe just a few bits less, a longer sequence of high surrogate characters followed by a low surrogate character could possibly be used. I did also find the following post. http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0256.html I thought that I would mention it, though I cannot quite at the moment understand the issue. William Overington 4 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 08:04:49 2015 From: idou747 at gmail.com (John) Date: Thu, 04 Jun 2015 06:04:49 -0700 (PDT) Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Message-ID: <1433423088288.48975bc8@Nodemailer> It occurs to me that the existing DNS system was designed to map 32bit numbers to domain names. So a hypothetical UTF64 format, with 32 bits of provider ID could be co-opted into the DNS system under a different record domain (Similar to how there is A records, and MX records, there could be UTF records.) Then all that would need defining would be some kind of directory hierarchy convention. Like /codepoint-number/font-type/font-size or whatever, and rendering engines could automatically lookup DNS, download from the web site via HTTP the font or bitmap or whatever, and seamlessly show you the right character. It wouldn?t be overly hard to implement, and a format without headers like this one, in the same general style as UTF-16 and UTF-32, wouldn?t upset the normal programming style of working with characters, so programming languages and existing apps wouldn?t have that much difficulty in upgrading. Mostly just a matter of upgrading the character size. I think this stuff could be relatively easy to define and standardise. You could basically define the entire technology in 1 A4 document. People have just got to want it badly enough to agree on it, and give it the imprimatur of the consortium. ? Chris On Thu, Jun 4, 2015 at 6:49 PM, William_J_G Overington wrote: > Chris expressed an idea, hypothetically starting: >> Characters are 64 bit. > The following posts might be helpful. > http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0277.html > http://www.unicode.org/mail-arch/unicode-ml/y2011-m08/0307.html > For 64 bits, or somewhere in that region, maybe just a few bits less, a longer sequence of high surrogate characters followed by a low surrogate character could possibly be used. > I did also find the following post. > http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0256.html > I thought that I would mention it, though I cannot quite at the moment understand the issue. > William Overington > 4 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 4 09:39:27 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 14:39:27 +0000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <1433423088288.48975bc8@Nodemailer> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: On Thu, Jun 4, 2015 at 6:09 AM John wrote: > Mostly just a matter of upgrading the character size. Which totally blows any concern with text size out of the water. Using 30 bytes to define certain very rare characters and 1 byte to define ASCII is way better then using 8 bytes to define all characters. I think this stuff could be relatively easy to define and standardise. You > could basically define the entire technology in 1 A4 document. People have > just got to want it badly enough to agree on it, and give it the imprimatur > of the consortium. > > Then define it. It doesn't need Unicode involved at all, unless nobody really wants it enough to use it without it getting tossed into the Unicode package. -------------- next part -------------- An HTML attachment was scrubbed... URL: From parker at parkerhiggins.net Thu Jun 4 11:43:20 2015 From: parker at parkerhiggins.net (Parker Higgins) Date: Thu, 4 Jun 2015 09:43:20 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <1433377053793.5f2c25d8@Nodemailer> <556FA289.7070703@att.net> Message-ID: On Thu, Jun 4, 2015 at 12:43 AM, Chris wrote: > > Characters are 64 bit. 32 bits are stripped off as the ?character set > provider ID?. That is sent to one of many canonical servers akin to DNS > servers to find the URL owner of those characters. At that location you?d > find a number of representations of the character whether TrueType, vector > graphics, bitmaps or whatever. The rendering engine would download the > representation and display it to the user. All without the user having to > know anything about character sets, custom fonts or whatever. > > So you come across character 12340000000017. The OS asks charset server > who owns charset 1234. They reply ?facebook.com/charsets?. The OS asks > facebook.com/charsets for facebook.com/charsets/17/truetype/pointsize12 > representation. > > All this happens invisible to the user. Of course if it is already cached > on their machine, then it wouldn?t happen. > Just in case you haven't considered this, there are LOTS of circumstances where this could be a problem from a user's perspective, or even abused by the provider. We've already moved largely from automatically displaying *images* of remote origin in email for privacy concerns?I don't really need Facebook (in your example, but substitute for an abusive spouse or a repressive government if it makes you feel better) knowing when I am reading plaintext documents on my own local machine. Thanks, Parker -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Jun 4 14:30:31 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 04 Jun 2015 12:30:31 -0700 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> Message-ID: <5570A757.1010301@ix.netcom.com> On 6/4/2015 1:46 AM, William_J_G Overington wrote: > I thought that I would mention it, though I cannot quite at the moment > understand the issue. I'm long past where I'm sure I understand what the issue is. :) A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Jun 4 14:36:26 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 4 Jun 2015 20:36:26 +0100 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: <20150604203626.64f88aa9@JRWUBU2> On Thu, 04 Jun 2015 14:39:27 +0000 David Starner wrote: > On Thu, Jun 4, 2015 at 6:09 AM John wrote: > > > Mostly just a matter of upgrading the character size. > > > Which totally blows any concern with text size out of the water. > Using 30 bytes to define certain very rare characters and 1 byte to > define ASCII is way better then using 8 bytes to define all > characters. The character size can be increased to 64 bits in such a way that no new surrogates are required, current UTF-8 text remains UTF-8, current UTF-16 text remains UTF-16 and current UTF-32 remains UTF-32, the extended UTF-8 still has 8-bit code units, the extended UTF-16 still has 16-bit units, and the extended UTF-32 still has 32-bit code units. In fact, the character size can be made unbounded. The trick is to extend UTF-8 indefinitely, and then for UTF-16 and UTF-32 repeat the idea of the UTF-8 scheme using sequences of two or more low surrogates (or two or more high surrogates - one must chose) much as UTF-8 uses bytes. Tom Bishop publicised the idea. Richard. From frederic.grosshans at gmail.com Thu Jun 4 15:05:33 2015 From: frederic.grosshans at gmail.com (=?UTF-8?B?RnLDqWTDqXJpYyBHcm9zc2hhbnM=?=) Date: Thu, 04 Jun 2015 20:05:33 +0000 Subject: Another take on the English apostrophe in Unicode Message-ID: An interesting argument for U+02BC MODIFIER LETTER APOSTROPHE as English apostrophe : https://tedclancy.wordpress.com/2015/06/03/which-unicode-character-should-represent-the-english-apostrophe-and-why-the-unicode-committee-is-very-wrong/ Fr?d?ric -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Jun 4 16:34:27 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 4 Jun 2015 14:34:27 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: Looks all wrong to me. "don?t" is a contraction of two words, it is not one word. English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawai?ian ?Okina .) You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Thu Jun 4 20:31:09 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 01:31:09 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer wrote: > "don?t" is a contraction of two words, it is not one word. > But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of "Don?t you mind?" being okay but not *"Do not you mind?", and "fo?c?sle". > You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Thu Jun 4 21:01:56 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 4 Jun 2015 19:01:56 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, "ack" and "ack". Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner wrote: > On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer > wrote: > >> "don?t" is a contraction of two words, it is not one word. >> > > But as he points out, it's not a contraction of don and t; it is, at best, > a contraction of do and n't. It's eliding, not punctuating. In the > comments, he also brings up the examples of "Don?t you mind?" being okay > but not *"Do not you mind?", and "fo?c?sle". > > > You can't use simple regular expressions to find word boundaries. > > Who uses _simple_ regular expressions? You can't use any code to reliably > find word boundaries in English, and that's a problem. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From idou747 at gmail.com Thu Jun 4 22:26:33 2015 From: idou747 at gmail.com (Chris) Date: Fri, 5 Jun 2015 13:26:33 +1000 Subject: Custom characters (was: Re: Private Use Area in Use) In-Reply-To: References: <24574715.13355.1433407565678.JavaMail.defaultUser@defaultHost> <1433423088288.48975bc8@Nodemailer> Message-ID: <1A29B678-92FC-42E4-9B00-4C0F0078112C@gmail.com> > > I think this stuff could be relatively easy to define and standardise. You could basically define the entire technology in 1 A4 document. People have just got to want it badly enough to agree on it, and give it the imprimatur of the consortium. > > Then define it. It doesn't need Unicode involved at all, unless nobody really wants it enough to use it without it getting tossed into the Unicode package. That?s like saying that nobody really wanted anything Unicode published because they could have done it themselves. That?s what the anti-custom character arguments around here claim, so why not disband? The problem at hand is that everybody out there who does have some kind of requirement IS defining their own proprietary solution, which is different to everybody else?s solution. Even on this very thread people can?t decide if the right way to address this is PUAs and custom character maps, or HTML5 snippets, and we?ve had a few other suggestions too! > I don't really need Facebook (in your example, but substitute for an abusive spouse or a repressive government if it makes you feel better) knowing when I am reading plaintext documents on my own local machine. Well? I would think in the vast majority of circumstances there would be no downloading involved. A typical scenario would be you use a Facebook app, or access a Facebook web site for example. That would cause downloading all the associated custom characters. Then you might do something like copy your text into say Microsoft word. No downloading because it?s already on your machine. I would anticipate most apps would choose to, if appropriate, cache them in their file format. So if you send the word document to someone else they also would have no downloading. Maybe then if that person decided to cut the characters out of that document into another app, maybe increase the font size or something, maybe then a download would be required. OK, so what about in that situation, the user takes some action that results in the rendering engine finding an unknown character? I can think of a lot of ways to address that and solve privacy. Here is one possibility. All unknown characters are rendered like this: Then when you click on the character, the OS?s font engine will locate and download it, and display it to the user. So the user had the choice, leave them unrendered, or download. Pretty simple for the user to learn, and gives them the choice. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: attachment.jpeg Type: image/jpeg Size: 3026 bytes Desc: not available URL: From prosfilaes at gmail.com Thu Jun 4 23:25:52 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 04:25:52 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis wrote: > Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for > example, the work ack-ack isn't decomposable into words, or even morphemes, > "ack" and "ack". > > Leo > > On Thu, Jun 4, 2015 at 6:31 PM, David Starner > wrote: > >> On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer >> wrote: >> >>> "don?t" is a contraction of two words, it is not one word. >>> >> >> But as he points out, it's not a contraction of don and t; it is, at >> best, a contraction of do and n't. It's eliding, not punctuating. In the >> comments, he also brings up the examples of "Don?t you mind?" being okay >> but not *"Do not you mind?", and "fo?c?sle". >> >> > You can't use simple regular expressions to find word boundaries. >> >> Who uses _simple_ regular expressions? You can't use any code to reliably >> find word boundaries in English, and that's a problem. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Fri Jun 5 01:01:53 2015 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 4 Jun 2015 23:01:53 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: > Hyphens generally make multiple words into one anyway. There's not really > multiple hyphens the way there's separate quotes and apostrophes. > Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. Leo > On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis wrote: > >> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >> for example, the work ack-ack isn't decomposable into words, or even >> morphemes, "ack" and "ack". >> >> Leo >> >> On Thu, Jun 4, 2015 at 6:31 PM, David Starner >> wrote: >> >>> On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer >>> wrote: >>> >>>> "don?t" is a contraction of two words, it is not one word. >>>> >>> >>> But as he points out, it's not a contraction of don and t; it is, at >>> best, a contraction of do and n't. It's eliding, not punctuating. In the >>> comments, he also brings up the examples of "Don?t you mind?" being okay >>> but not *"Do not you mind?", and "fo?c?sle". >>> >>> > You can't use simple regular expressions to find word boundaries. >>> >>> Who uses _simple_ regular expressions? You can't use any code to >>> reliably find word boundaries in English, and that's a problem. >>> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 01:58:24 2015 From: prosfilaes at gmail.com (David Starner) Date: Thu, 04 Jun 2015 23:58:24 -0700 Subject: Another take on the English apostrophe in Unicode Message-ID: On June 4, 2015, at 11:01 PM, Leo Broukhis wrote: > > >On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: > >Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. > >Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Fri Jun 5 02:16:07 2015 From: leob at mailcom.com (Leo Broukhis) Date: Fri, 5 Jun 2015 00:16:07 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: > But the point was that treating hyphens as parts of words is not generally a wrong thing. That brings us back to my original question: where's MODIFIER LETTER HYPHEN, then? A word is a sequence of letters, isn't it? :) I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Leo On Thu, Jun 4, 2015 at 11:58 PM, David Starner wrote: > On June 4, 2015, at 11:01 PM, Leo Broukhis wrote: > >> >> >>On Thu, Jun 4, 2015 at 9:25 PM, David Starner wrote: >> >>Hyphens generally make multiple words into one anyway. There's not really >> multiple hyphens the way there's separate quotes and apostrophes. >> >>Generally, but not always, just as apostrophes aren't always at a >> contracted word boundary. There is only one hyphen because no language >> (AFAIK) claims it as part of its alphabet. > > But the point was that treating hyphens as parts of words is not generally a > wrong thing. There is one generally consistent rule for hyphens. When > apostrophes and quotes are conflated, there is no one generally acceptable > rule. From qsjn4ukr at gmail.com Fri Jun 5 04:43:49 2015 From: qsjn4ukr at gmail.com (QSJN 4 UKR) Date: Fri, 5 Jun 2015 12:43:49 +0300 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: The conflict is between linguists and programmers. In plain text apostrophe is a punctuation used instead letters (unreadable, one or more) or as separator for avoid connecting letters into ligature or syllable, between parts of composite word as well as inside the simple word, or finally, as quotation mark. Yes it is ambiguous! It is. It just is! Linguists say "It is. We see that. We know that". And programmers say "That's wrong! We can't understand that". Just are you so stupid if you can't! Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion. From wjgo_10009 at btinternet.com Fri Jun 5 04:48:01 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 10:48:01 +0100 (BST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> Markus Scherer wrote: > How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a "show in colour mode" where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively. If people want this facility, maybe it could become published in a Unicode Technical Report so that standardization and interoperability could be achieved. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From duerst at it.aoyama.ac.jp Fri Jun 5 04:49:14 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?B?Ik1hcnRpbiBKLiBEw7xyc3Qi?=) Date: Fri, 5 Jun 2015 18:49:14 +0900 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer>

Message-ID: <5571709A.4010801@it.aoyama.ac.jp> On 2015/06/04 17:03, Chris wrote: > I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution. I'm not sure he would come to the same conclusion as you. > This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. > > Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. You are right that it would be strictly technically possible. Not only that, it has been so for 10 or 20 years. As an example, in 1996 at the WWW Conference in Paris I was participating in a workshop on internationalization for the Web, and by chance I was sitting between the participant from Adobe and the participant from Microsoft. These were the main companies working on font technology at that time, and I asked them how small it would be possible to make a font for a single character using their technologies (the purpose of such a font, as people on this thread should be able to guess, would be as part of a solution to exchange single, "user-defined" characters). I don't even remember their answers. The important thing here that the idea, and the technology, have been around for a long time. So why didn't it take on? Maybe the demand is just not as big as some contributors on this list claim. Also, maybe while the technology itself isn't rocket science, the responsible people at the relevant companies have enough experience with technology deployment to hold back. To give an example of why the deployment aspect is important, there were various Web-like hypertext technologies around when the Web took off in the 1990. One of them was called HyperG. It was technologically 'better' than the Web, in that it avoided broken links. But it was much more difficult to deploy, and so it is forgotten, whereas the Web took off. Regards, Martin. From asmus-inc at ix.netcom.com Fri Jun 5 05:46:10 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 05 Jun 2015 03:46:10 -0700 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <5571709A.4010801@it.aoyama.ac.jp> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer>

<5571709A.4010801@it.aoyama.ac.jp> Message-ID: <55717DF2.4030705@ix.netcom.com> On 6/4/2015 17:03 , "Chris" wrote: > This whole discussion is about the fact that it would be technically > possible to have private character sets and private agreements that > your OS downloads without the user being aware of it. The sticky issues are not the questions of how to make available fonts or images for use by the OS. Instead, they concern the fact that any such a model violates some pretty basic guarantees of plain text that the entire net infrastructure relies on. There are very obvious security issues. The start with tracking; every time you access a custom code point, that fact potentially results in a trackable interaction. This problem affects even the "sticker" solution that people are hoping for for emoji. (On my system, no external resources are displayed when I first open any message, and there is a reason for that). Beyond tracking, and beyond stickers (that is pictures that look like pictures) a generalized custom character set would allow "text" that is no longer really stable. You would be able to deliver identical e-mails to people that display differently, because when you serve the custom fonts, you would be able to customize what you deliver under the same custom character set designator. While this would be a wonderful way to circumvent censorship (other than the "man in the middle" version), you would likewise seriously undermine the ability to filter unwanted or undesirable texts, because the custom character set engine might recognize when a request comes from a filter and not the end user. (Just the other day, I came across a hacked website that responded differently to search engined than to live users, making the hack effective for one and invisible to the other. Custom character sets would seem to just add to the hackers' arsenal here). Finally, custom character sets sound like a great idea when thinking of an extension of an existing character set. But that's not where the issues are. The issues come in when you use the same technology to provide aliases for existing code points or for other custom characters. Aliasing undermines the ability to do search (or any other content-focused processing, from sorting to spell-check). At that point, the circle closes. When Unicode was created, the alternative then was ISO 2022, which was a standard that addressed the issue of how to switch among (albeit pre-defined) character sets to achieve, in principle, coverage equal to the union of these character sets. Unicode was created to address two main deficiencies of that situation. Unification addressed the aliasing issue, so that code points were no longer "opaque" but could be interpreted by software (other than display), which was the second big drawback of the patchwork of character sets. A processing model for opaque code points is possible to define, but it isn't very practical and in the late eighties people had had enough were glad to be quit of it. Seen from this perspective, the discussion about custom character sets presents itself as a giant step backward, undermining the very advances that underlie the rapid acceptance and spread of Unicode. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Jun 5 06:20:33 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 12:20:33 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <55717DF2.4030705@ix.netcom.com> References: <20150603075633.665a7a7059d7ee80bb4d670165c8327d.6a2bca217c.wbe@email03.secureserver.net> <1433377259559.1a60883d@Nodemailer>

<5571709A.4010801@it.aoyama.ac.jp> <55717DF2.4030705@ix.netcom.com> Message-ID: <31341075.26231.1433503233504.JavaMail.defaultUser@defaultHost> Asmus Freytag wrote about security issues. This is interesting reading and I have learned a lot from the post about various security issues. Whilst the post is in this thread and follows from a post in this thread, the topic has seemed to moved to the Custom characters thread. I note that what you write about seems to me that it would not apply to my suggestion in my original post: is that correct? http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html Also the following two posts. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html Whilst the ideas raised by Chris are interesting, they do seem to be distinctly different from what I suggested. So, for clarity, do you regard my suggested format as having any security issues, and if so, what please? I know that some people have opined that my suggested format is out of scope for Unicode, yet the scope of Unicode is what the Unicode Technical Committee decides is the scope of Unicode, and my suggested format does provide a way to include custom glyphs within a Unicode plain text document by using the new base character followed by tag characters method. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 08:15:09 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 13:15:09 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis wrote: > I agree that conflating apostrophes and quotes is a source of > problems, however, existence of the MODIFIER LETTER [same glyph as > used for English contractions] in Unicode is a coincidence which > should not have an effect on usage of apostrophes in English. Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one. -------------- next part -------------- An HTML attachment was scrubbed... URL: From wjgo_10009 at btinternet.com Fri Jun 5 08:51:31 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 5 Jun 2015 14:51:31 +0100 (BST) Subject: Another take on the English apostrophe in Unicode In-Reply-To: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> References: <6828292.17140.1433497681745.JavaMail.defaultUser@defaultHost> Message-ID: <13990923.38289.1433512291110.JavaMail.defaultUser@defaultHost> Markus Scherer wrote: >> How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I replied: > Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a "show in colour mode" where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts I realize that the cyan and red colours cannot be done at present, yet I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. William Overington 5 June 2015 -------------- next part -------------- An HTML attachment was scrubbed... URL: From prosfilaes at gmail.com Fri Jun 5 09:13:04 2015 From: prosfilaes at gmail.com (David Starner) Date: Fri, 05 Jun 2015 14:13:04 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR wrote: > The conflict is between linguists and programmers. No, it's not. > Yes it is ambiguous! > It is. It just is! Linguists say "It is. We see that. We know that". > "Now you programmers find some way to deal with that so you can produce useful corpuses for linguistic work." Which is what this is all about, is producing good linguistic interpretations of plain text, for, among others, linguists whose supply of scanned text has exceeded their ability to hand-process it. > Modifier letter apostrophe is a letter that used as itself and means > itself (ejective sound e.g.) only. Don't use it else. It just make > more confusion. > If you don't know what language a text is in, you can't tell what sounds letters make. Adding this character to English's repertoire won't change that. -------------- next part -------------- An HTML attachment was scrubbed... URL: From KalvesmakiJ at doaks.org Fri Jun 5 09:26:50 2015 From: KalvesmakiJ at doaks.org (Kalvesmaki, Joel) Date: Fri, 5 Jun 2015 14:26:50 +0000 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: I don?t have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. "Cliticization vs. Inflection: English N?T."Language59, no. 3 (1983): 502?513. It?s nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435 From mark at macchiato.com Fri Jun 5 09:47:15 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 5 Jun 2015 16:47:15 +0200 Subject: =?UTF-8?B?aHR0cDovL+KciPCfjrDwn5K4Lndz?= Message-ID: -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Jun 5 09:48:09 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 5 Jun 2015 16:48:09 +0200 Subject: =?UTF-8?B?UmU6IGh0dHA6Ly/inIjwn46w8J+SuC53cw==?= In-Reply-To: References: Message-ID: Whoops, sent too soon. A surprise: http://?????.ws Mark *? Il meglio ? l?inimico del bene ?* On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ?? wrote: > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jun 5 10:36:27 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 05 Jun 2015 08:36:27 -0700 Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> I wrote, crumpled up, and threw away about three different responses. I thought about ISO 2022 and about accessing the web for every PUA character, as Asmus mentioned, and about the size of the user base, as Martin mentioned. I thought about character properties and about ephemerality. I didn't think of the spoofing implications that Asmus described, which would affect both the automatic PUA font download and the inline drawing language. Either of these could be used to spell out, let's say, "paypal.com" rather convincingly and with minimal effort. I might have more experience with the PUA than many list members, having transcribed the 27,000-word "Alice's Adventures in Wonderland" into my constructed alphabet two years ago, in a PUA encoding, so that Michael Everson could publish it in book form. One of the many learning experiences of this project was finding out which software tools play nicely with the PUA and which don't. Some tools "just worked" while others would not give acceptable results with any amount of effort. At no point, however, did I suppose that a font with my alphabet, or any of the jillions of others that have been invented "during a boring day in class" (see Omniglot for tons of examples), should be silently downloaded to a user's computer, consuming bandwidth and disk space, without her knowledge. That's practically malware. Maybe I'm just not enough of a Distinguished Visionary to understand how insanely great this would be (unfortunately, celebrity name-dropping doesn't work with me). Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like "strongly discouraged" and "not interoperable" even in the presence of an agreement. Given this, and given that no system I'm aware of magically downloads fonts for *regularly encoded characters* (I still have no font for Arabic math symbols), I personally would not expect Unicode to perform a 180 on this. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Fri Jun 5 10:40:37 2015 From: doug at ewellic.org (Doug Ewell) Date: Fri, 05 Jun 2015 08:40:37 -0700 Subject: Another take on the English apostrophe in Unicode Message-ID: <20150605084037.665a7a7059d7ee80bb4d670165c8327d.08a0959f19.wbe@email03.secureserver.net> QSJN 4 UKR wrote: > And programmers say "That's wrong! We can't understand that". Just are > you so stupid if you can't! You know, we really aren't all like that. Some of us actually try to meet user needs. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From daniel.buenzli at erratique.ch Fri Jun 5 10:48:13 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 5 Jun 2015 16:48:13 +0100 Subject: ucd beta, stable filenames Message-ID: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Hello, Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like http://www.unicode.org/Public/beta/ and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. Thanks, Daniel From daniel.buenzli at erratique.ch Fri Jun 5 10:53:44 2015 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Fri, 5 Jun 2015 16:53:44 +0100 Subject: ucd beta, stable filenames In-Reply-To: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> References: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Message-ID: <60241C8F9FA14A49B5D230EF3900DC4D@erratique.ch> Le vendredi, 5 juin 2015 ? 16:48, Daniel B?nzli a ?crit : > and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). Or both with and without the suffix of course. Daniel From john at mitre.org Fri Jun 5 12:29:10 2015 From: john at mitre.org (John D. Burger) Date: Fri, 5 Jun 2015 13:29:10 -0400 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: > On Jun 4, 2015, at 17:34 , Markus Scherer wrote: > > Looks all wrong to me. > > "don?t" is a contraction of two words, it is not one word. Yes it is. Is "keyboard" two words? How about "newspaper"? If "don't" is two words, please tell me what two words make up "won't"? (Hint, neither of them is "will".) Linguistically, "don't" and friends pass all the diagnostics that indicate they're single words. - John Burger > English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawai?ian ?Okina.) > > You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. > > Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. > > If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? > > markus From mark at kli.org Fri Jun 5 17:32:08 2015 From: mark at kli.org (Mark E. Shoulson) Date: Fri, 05 Jun 2015 18:32:08 -0400 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> References: <20150605083626.665a7a7059d7ee80bb4d670165c8327d.915f05c4c9.wbe@email03.secureserver.net> Message-ID: <55722368.40703@kli.org> On 06/05/2015 11:36 AM, Doug Ewell wrote: > At no point, however, did I suppose that a font with my alphabet, or any > of the jillions of others that have been invented "during a boring day > in class" (see Omniglot for tons of examples), should be silently > downloaded to a user's computer, consuming bandwidth and disk space, > without her knowledge. That's practically malware. Maybe I'm just not > enough of a Distinguished Visionary to understand how insanely great > this would be (unfortunately, celebrity name-dropping doesn't work with > me). > > Unicode has stated consistently for at least 23 years that it would not > ever standardize PUA usage, and over the years some UTC members have > used terms like "strongly discouraged" and "not interoperable" even in > the presence of an agreement. Given this, and given that no system I'm > aware of magically downloads fonts for *regularly encoded characters* (I > still have no font for Arabic math symbols), I personally would not > expect Unicode to perform a 180 on this.\ Isn't this what webfonts are all about? You specify a font in the stylesheet, give it a URL, and your browser goes and downloads it and displays the text in it. That seems to me to be a perfectly reasonable use of this sort of "evil font trick" in the PUA (and who knows, even in encoded text? No, I can think of some Bad Things that could result). There isn't anything to stop you from making a page with webfonts that looks like it says one thing but when you copy/paste the text it's something completely different. I should do that someday, just for demonstration purposes... ~mark From eric.muller at efele.net Fri Jun 5 20:06:06 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 05 Jun 2015 18:06:06 -0700 Subject: ucd beta, stable filenames In-Reply-To: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> References: <1657354AE6CA4AFE993ED0985B6D5F4A@erratique.ch> Message-ID: <5572477E.6000009@efele.net> On 6/5/2015 8:48 AM, Daniel B?nzli wrote: > Hello, > > Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like > > http://www.unicode.org/Public/beta/ > > and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. > > +1000 Eric. From eric.muller at efele.net Fri Jun 5 20:08:02 2015 From: eric.muller at efele.net (Eric Muller) Date: Fri, 05 Jun 2015 18:08:02 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: <557247F2.9050902@efele.net> On 6/5/2015 10:29 AM, John D. Burger wrote: > Linguistically, "don't" and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french "pomme de terre" also passes the diagnostics. So we need a new space character. Eric. From wjgo_10009 at btinternet.com Sat Jun 6 09:37:28 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 6 Jun 2015 15:37:28 +0100 (BST) Subject: Tag characters and in-line graphics (from Tag characters) Message-ID: <3037447.25167.1433601448238.JavaMail.defaultUser@defaultHost> Doug Ewell wrote: > Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like "strongly discouraged" and "not interoperable" even in the presence of an agreement. I know that Doug and many other on this mailing list will well understand the following already, yet I feel that is helpful to empasise that Unicode does standardize PUA (Private Use Area) usage to the extent that Unicode standardizes which code points are designated as being in the three Private Use Areas and some default properies, such as being left to right. So, whilst Unicode does not standardize which glyphs are used for each code point in any situation, Unicode does standardize the infrastructure so that the Private Use Area can be successfully used. So if, say, a much larger code space were needed wherein end users could among themselves agree how assignments could be made, it would not be unreasonable for Unicode to define the underlying infrastructure. There is a precedent in the way that the alt.* newsgroup hierarchy was incorporated into the Usenet email newsgroups in the time before the world wide web was invented. A person wishing to start a new alt.* newsgroup could post to alt.config and there was discussion for around a week, often with useful advice as to what name to have for the new newsgroup and the new newsgroup was then started. Regular Usenet newsgroups had a long process of votes to get a new newsgroup started, yet the alt.* newsgroups were different, allowing someone to start a new newsgroup on his or her own initiative. That was a very useful facility. William Overington 6 June 2015 From doug at ewellic.org Sun Jun 7 11:39:38 2015 From: doug at ewellic.org (Doug Ewell) Date: Sun, 7 Jun 2015 10:39:38 -0600 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: References: Message-ID: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> "Mark E. Shoulson" wrote: > Isn't this what webfonts are all about? You specify a font in the > stylesheet, give it a URL, and your browser goes and downloads it and > displays the text in it. That's great if you have a stylesheet, a URL, and a browser. HTML is fancy text, and pretty much implies some sort of online connection. I thought we were talking about plain text, and apologize if we weren't or if that important detail was not clear. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From verdy_p at wanadoo.fr Sun Jun 7 22:36:38 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 8 Jun 2015 05:36:38 +0200 Subject: Tag characters and in-line graphics (from Tag characters) In-Reply-To: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> References: <709DCA4BD4764121B6F9FECF99362CD1@DougEwell> Message-ID: 2015-06-07 18:39 GMT+02:00 Doug Ewell : > "Mark E. Shoulson" wrote: > > Isn't this what webfonts are all about? You specify a font in the >> stylesheet, give it a URL, and your browser goes and downloads it and >> displays the text in it. >> > > That's great if you have a stylesheet, a URL, and a browser. HTML is fancy > text, and pretty much implies some sort of online connection. Everything in HTML is embeddable in a standalone document, including graphics. HTML does not imply any online connection. HTML is independant of HTTP or other transports. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gilbert.lozano at gmail.com Mon Jun 8 14:59:50 2015 From: gilbert.lozano at gmail.com (Gilbert Lozano) Date: Mon, 8 Jun 2015 15:59:50 -0400 Subject: Small (minuscule) Message-ID: Can someone help me find the code for the small (minuscule) p with macron above? Many thanks, Gilbert Lozano -------------- next part -------------- An HTML attachment was scrubbed... URL: From gansmann at uni-bonn.de Mon Jun 8 15:40:36 2015 From: gansmann at uni-bonn.de (Gerrit Ansmann) Date: Mon, 08 Jun 2015 22:40:36 +0200 Subject: Small (minuscule) In-Reply-To: References: Message-ID: On Mon, 08 Jun 2015 21:59:50 +0200, Gilbert Lozano wrote: > Can someone help me find the code for the small (minuscule) p with macron above? U+0070: p U+0304: combining macron Put those two characters after each other and you get: p?. From wjgo_10009 at btinternet.com Tue Jun 9 03:17:26 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Tue, 9 Jun 2015 09:17:26 +0100 (BST) Subject: Small (minuscule) In-Reply-To: References: Message-ID: <24946800.8782.1433837846291.JavaMail.defaultUser@defaultHost> Just in case this will help. Years ago I made a font that included a small p with a macron, the glyph for the small p with a macron being located in the plane 0 Private Use Area at U+E727. The font is Quest text. It is available free from the following web page. http://www.users.globalnet.co.uk/~ngo/fonts.htm http://forum.high-logic.com/viewtopic.php?f=10&t=682 The glyph is one of a number for special characters and ligatures in the font. Please note specifically that this is not the official Unicode encoding for the character. I simply mention this font just in case you are wanting to get a print out quickly for some reason. William Overington 9 June 2015 ----Original message---- >From : gilbert.lozano at gmail.com Date : 08/06/2015 - 20:59 (GMTST) To : unicode at unicode.org Subject : Small (minuscule) Can someone help me find the code for the small (minuscule) p with macron above? Many thanks, Gilbert Lozano -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: p_macron.png Type: image/png Size: 6326 bytes Desc: not available URL: From pandey at umich.edu Tue Jun 9 17:07:19 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 15:07:19 -0700 Subject: Accessing the WG2 document register Message-ID: Hello all, I learned today that the WG2 document register is not publicly accessible. This means that I, as a proposal author, have no means of accessing the documents that I contribute. Can someone associated with WG2 or anyone else in the know please tell me why these documents are under lock and key? All the best, Anshuman From pandey at umich.edu Tue Jun 9 18:11:26 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 19:11:26 -0400 Subject: Accessing the WG2 document register In-Reply-To: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> References: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> Message-ID: <29A7ACF0-789D-4B28-BD45-77A7C06568E8@umich.edu> Hi Ken, > On Jun 9, 2015, at 6:38 PM, Ken Lunde wrote: > > Welcome to ISO. ? I think I'll skip that party. ?? I've already started to add copyright statements to my proposals. Now I'll add another statement that says: "This document is intended for encoding the XYZ script in The Unicode Standard. If it and its contents are appropriated for encoding XYZ in ISO 10646, then ISO must make this document openly and publicly accessible to all." Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... All the best, Anshu From pandey at umich.edu Tue Jun 9 18:26:16 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Tue, 9 Jun 2015 19:26:16 -0400 Subject: Accessing the WG2 document register In-Reply-To: <5577745B.2000100@htpassport.com> References: <22110880-ACF0-49D0-86B5-B778F264D7BC@adobe.com> <29A7ACF0-789D-4B28-BD45-77A7C06568E8@umich.edu> <5577745B.2000100@htpassport.com> Message-ID: Shervin, > On Jun 9, 2015, at 7:18 PM, Shervin Afshar wrote: > > Anshuman Pandey observed: > > > Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... > > Hear, hear! I really wanted to punctuate my statement with a STAG emoji, or REINDEER at the very least. But, the closest thing I found was ??. Pragmatically on the dot, but unforch not semantically... Anshu From mailinglists at ngalt.com Tue Jun 9 18:46:06 2015 From: mailinglists at ngalt.com (Nathan Sharfi) Date: Tue, 9 Jun 2015 16:46:06 -0700 Subject: Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) In-Reply-To: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> References: <16867779.11307.1433319965960.JavaMail.defaultUser@defaultHost> Message-ID: <97C666FC-FFC0-42FC-BCAA-F5E01F93BE15@ngalt.com> > On Jun 3, 2015, at 1:26 AM, William_J_G Overington wrote: > > Private Use Area in Use (from Tag characters and in-line graphics (from Tag characters)) > > >>> That's not agreed upon. I'd say that the general agreement is that the private ranges are of limited usefulness for some very limited use cases (such as designing encodings for new scripts). > > >> They are of limited usefulness precisely because it is pathologically hard to make use of them in their current state of technological evolution. If they were easy to make use of, people would be using them all the time. I?d bet good money that if you surveyed a lot of applications where custom characters are being used, they are not using private use ranges. Now why would that be? > > > Actually, I have used Private Use Area characters a lot, and, once I had got used to them, I found them incredibly straightforward to use. That's nice; I've found some persistent annoyances when I use PUA codepoints. A while back I learned Quikscript, an alternate English orthography. Since May 2013, my blog's been in Quikscript using PUA codepoints. I've also joined the Shavian mailing list, sent e-mails in Shavian, and wrote an "I'm switching my Quikscript blog to Shavian" blog post in Shavian for April Fool's Day. To do all this typing, I made both Quikscript and Shavian keyboard layouts for OS X, as well as a Quikscript font. All of my Quikscript stuff is linked to from https://www.frogorbits.com/qs/ if you're interested. I'm something of a Johnny-come-lately to Shavian, so I've only used it in the SMP with fonts others have made. So, how much nicer is dealing with Shavian? - The Keyboard Viewer and input-source preview know what font to use for each key for Shavian; Quikscript keyboard layouts display boxes for the letters because there's no way for the system to guess which font to use for a particular codepoint. - Double-tapping a Shavian word in my browser will select the word; double-tapping a Quikscript word will select just one letter. - Internet Explorer will happily break Quikscript text in the middle of a word; Shavian gets broken at word boundaries just like English. While IE's behavior is unlike other browsers' and Not What I Want, I can't fault the IE team; I could be using PUA code points for a language that doesn't use spaces much, like Japanese. - I can read and write Shavian posts on Twitter on the desktop in a reasonable font for both Shavian and other scripts; if I wanted to do the same in Quikscript, I'd have to have a custom user-supplied stylesheet to override Twitter's own font suggestions. - Scripts already in Unicode attract the attention of talented completionist organizations that PUA communities generally can't attract beforehand. Everson Mono, Noto, and Segoe UI Historic (as of Windows 10) ? all great typefaces ? support Shavian and not Quikscript. This tends to be because: - I could have multiple fonts that have wildly differing meanings and glyphs mapped to the same code point; the OS can't guess which I might mean. - All the information that the OS needs to detect word breaks is in character properties data supplied by the Consortium and handled by the OS. ~ ~ ~ Specialists like us might be able to put up with these things, but we can't control everything about the reading and writing experience online unless we're all resigned to taking pictures of handwritten text. From samjnaa at gmail.com Tue Jun 9 21:18:26 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 07:48:26 +0530 Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: On 6/10/15, Anshuman Pandey wrote: > I learned today that the WG2 document register is not publicly > accessible. Seems that the page http://std.dkuug.dk/jtc1/sc2/wg2/ or the repo it points to ftp://std.dkuug.dk/ftp.anonymous/JTC1/SC2/WG2/docs/ haven't been updated after 2014-10-29. At least there should be a notice saying this is no longer the active register, if this is being maintained for historical purposes! > This means that I, as a proposal author, have no means of > accessing the documents that I contribute. But why would you want to do that? I suppose everyone who submits Unicode proposals would have their own copies of their documents, and certainly the ISO doesn't modify the contents of any of these documents. > I've already started to add copyright statements to my proposals. Now I'll > add another statement that says: "This document is intended for encoding > the XYZ script in The Unicode Standard. If it and its contents are > appropriated for encoding XYZ in ISO 10646, then ISO must make this document > openly and publicly accessible to all." Hm -- I'd be interested to see how they respond. Re your wording: 1) "appropriated"? 2) Unicode and ISO 10646 are only nominally two different standards and effectively (i.e. apart from all those procedural details) the same, no? Now does the UTC still require us proposal authors to forward our docs to WG2 after UTC approval? I fail to see the point in that if whatever is part of Unicode is going to become part of ISO 10646, except for that if by closing its doors to proposal authors, the ISO is going to communicate only with the UTC, then the UTC would have to take upon itself the onus of forwarding all proposals to the ISO saying -- I'm sure the UTC doesn't want that. -- Shriramana Sharma ???????????? ???????????? From wjgo_10009 at btinternet.com Wed Jun 10 02:35:17 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 08:35:17 +0100 (BST) Subject: Accessing the WG2 document register In-Reply-To: References: Message-ID: <5882613.5526.1433921717427.JavaMail.defaultUser@defaultHost> > This means that I, as a proposal author, have no means of accessing the documents that I contribute. I sent in a document some years ago and it was not even allowed to go into the list for discussion. It was said that it was out of scope. It is not clear whether people on the committees that discuss submitted documents were even aware that it had been submitted. I have submitted documents to the Unicode Technical Committee and some have been added to the list and one has been not added as it was said to be out of scope: however it was passed to the Chair of another Unicode Committee and it was considered. William Overington 10 June 2015 ----Original message---- >From : pandey at umich.edu Date : 09/06/2015 - 23:07 (GMTST) To : unicore at unicode.org Cc : unicode at unicode.org Subject : Accessing the WG2 document register Hello all, I learned today that the WG2 document register is not publicly accessible. This means that I, as a proposal author, have no means of accessing the documents that I contribute. Can someone associated with WG2 or anyone else in the know please tell me why these documents are under lock and key? All the best, Anshuman From wjgo_10009 at btinternet.com Wed Jun 10 03:25:19 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 09:25:19 +0100 (BST) Subject: Accessing the WG2 document register Message-ID: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> > Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. The fact that Unicode Inc. provides a valuable public service in making documents and encoding charts freely available to all who access the www.unicode.org website is not in any way the same as the provenance that ISO has of being recognised by governments around the world as providing standards for technological matters. I am not a lawyer, yet as I understand it, the underlying theory of standards work is that it is a legally permitted exception to a general legal prohibition of businesses meeting together to decide and agree what will be applied in industrial activity. Thus, for example, it is fine for businesses to agree that one particular code point will be used for the symbol for the Indian Rupee, as that helps consumers in that a message between computers of different brands can be passed and read successfully. Yet, for example, it is not permitted for businesses to meet together to decide that all computers will be in a grey plastic box, as that hinders choice for consumers. William Overington 10 June 2015 From jsbien at mimuw.edu.pl Wed Jun 10 04:07:32 2015 From: jsbien at mimuw.edu.pl (Janusz S. Bien) Date: Wed, 10 Jun 2015 11:07:32 +0200 Subject: Accessing the WG2 document register In-Reply-To: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> Message-ID: <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> Quote/Cytat - William_J_G Overington (Wed 10 Jun 2015 10:25:19 AM CEST): >> Remind me why Unicode is still taking ISO to the dance? Sometimes >> going stag has its benefits... > > > As I understand it, Unicode Inc. is a recognised guest of ISO in > participating in ISO producing an International Standard. Cf. http://www.unicode.org/L2/L2014/14286-wg2-liaison.pdf Regards Janusz -- Prof. dr hab. Janusz S. Bie? - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bie? - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From pandey at umich.edu Wed Jun 10 04:19:02 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 05:19:02 -0400 Subject: Accessing the WG2 document register In-Reply-To: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> Message-ID: <7970040E-6547-4ECE-9B86-4DCB09D20C6A@umich.edu> On Jun 10, 2015, at 4:25 AM, William_J_G Overington wrote: >> Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... > > > As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. Does Unicode need ISO to exist? Or does ISO need Unicode? > The fact that Unicode Inc. provides a valuable public service in making documents and encoding charts freely available to all who access the www.unicode.org website is not in any way the same as the provenance that ISO has of being recognised by governments around the world as providing standards for technological matters ISO is a profit making business. I worked on an ISO standard for the transliteration of Indic scripts two decades ago and I have yet to see the published standard. Back then I couldn't afford to buy the document and ISO didn't have the heart to give me a copy as a contribute. So, to this day today, I have yet to see the official standard that I helped to develop. ISO needs to function as a non-profit organization with open access to all of its activities and publications. > I am not a lawyer, yet as I understand it, the underlying theory of standards work is that it is a legally permitted exception to a general legal prohibition of businesses meeting together to decide and agree what will be applied in industrial activity. And so ISO functions by relying upon contributions made by the public without granting either authorship or compensation to those who actually build their standards. And now they want to claim ownership of contributed documents... > Thus, for example, it is fine for businesses to agree that one particular code point will be used for the symbol for the Indian Rupee, as that helps consumers in that a message between computers of different brands can be passed and read successfully. This can be done without ISO... > Yet, for example, it is not permitted for businesses to meet together to decide that all computers will be in a grey plastic box, as that hinders choice for consumers. Who exactly is imposing these restrictions? Restriction of choice is an issue for political economy, not standards bodies. All the best, Anshuman From pandey at umich.edu Wed Jun 10 04:49:13 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 05:49:13 -0400 Subject: Accessing the WG2 document register In-Reply-To: <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> References: <29425841.9455.1433924719889.JavaMail.defaultUser@defaultHost> <20150610110732.12656g65zg7v3sk4@mail.mimuw.edu.pl> Message-ID: > On Jun 10, 2015, at 5:07 AM, Janusz S. Bien wrote: > > Quote/Cytat - William_J_G Overington (Wed 10 Jun 2015 10:25:19 AM CEST): > >>> Remind me why Unicode is still taking ISO to the dance? Sometimes going stag has its benefits... >> >> >> As I understand it, Unicode Inc. is a recognised guest of ISO in participating in ISO producing an International Standard. > > Cf. http://www.unicode.org/L2/L2014/14286-wg2-liaison.pdf This document provides further evidence of the irrelevance of ISO in the Unicode world. Deference. Janusz, what was your intention in providing a link to this document? All the best, Anshuman From pandey at umich.edu Wed Jun 10 05:01:44 2015 From: pandey at umich.edu (Anshuman Pandey) Date: Wed, 10 Jun 2015 06:01:44 -0400 Subject: Accessing the WG2 document register In-Reply-To: References:

Message-ID: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Andrew, Thank you for this detailed investigation. It is truly informative. As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: http://linguistics.berkeley.edu/~pandey/ Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. All the best, Anshuman From eik at iki.fi Wed Jun 10 06:51:13 2015 From: eik at iki.fi (Erkki I Kolehmainen) Date: Wed, 10 Jun 2015 14:51:13 +0300 Subject: Accessing the WG2 document register In-Reply-To: References:

Message-ID: <001101d0a373$c06ab6d0$41402470$@fi> Andrew! I honestly believe that Michel as the WG2 Convener has little choice but to follow the JTC1 rules - and work actively to change them (hopefully having to spend less time on this than the years Mike had to spend to achieve the publicly available status for WG2-originated standards). Actually, I believe that a feasible solution would be to make Unicode a JTC1 PAS (Publicly Available Specification) submitter, and thus give the chance for the ISO/IEC JTC1/SC2 National Bodies to vote on the approval of TUS as an ISO standard. IRG (with possibly a somewhat expanded role, could/should still work under SC2 and co-operate with Unicode). Anshuman, I'd recommend that you withdraw your request to withdraw your contributions, because that would be of no help to the user communities involved. Sincerely Erkki I. Kolehmainen Tilkankatu 12 A 3, 00300 Helsinki, Finland Mob: +358400825943, Tel / Fax (by arr.): +358943682643 -----Alkuper?inen viesti----- L?hett?j?: Unicore [mailto:unicore-bounces at unicode.org] Puolesta Andrew West L?hetetty: 10. kes?kuuta 2015 12:18 Vastaanottaja: Anshuman Pandey Kopio: UnicoRe List; unicode Unicode Discussion Aihe: Re: Accessing the WG2 document register In the LiveLink system some document types are open and some document types are restricted, and you can see this in the SC2 document registry where some documents have a key icon against them and some do not. In the case of the WG2 document registry which is what Anshu is referring to, the list of documents is not even visible unless you are logged on to the system, which I believe to be completely unacceptable, and something I have questioned Michel about on several occasions. But even if the list of documents was to be visible to the public, they would all be password protected because of their document type ("Contributions"). I have suggested to Michel that a simple workaround would be to change the document type to one that is open to the public, even if the document type would not accurately reflect what sort of documents they are. The new restrictive rules for committee participation and document access have been forced on the committees by JTC1 (see JTC1 N12468 -- not publicly available, but there is a Google cache of the document if you search), and has caused considerable consternation among experts on the WG2 committee as well as in some national bodies. If you follow the new rules to the letter then WG2 is not allowed to even accept contributions from individuals who are not members of the relevant committee, which is quite ridiculous, and a severe handicap to many JTC1 working groups. I know that the BSI (representing the UK) is very unhappy with the restrictions on who can submit and access documents, and I hope (with little expectation) that the issue of document access will be raised at the next JTC1 plenary, and the rules changed. But in the meantime the rules are alienating experts such as Anshu, which is a great shame. Andrew On 9 June 2015 at 23:07, Anshuman Pandey wrote: > Hello all, > > I learned today that the WG2 document register is not publicly > accessible. This means that I, as a proposal author, have no means of > accessing the documents that I contribute. > > Can someone associated with WG2 or anyone else in the know please tell > me why these documents are under lock and key? > > All the best, > Anshuman From wjgo_10009 at btinternet.com Wed Jun 10 07:33:32 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 13:33:32 +0100 (BST) Subject: Accessing the WG2 document register Message-ID: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> As I am not on the Unicore list, just the public mailing list, I am only picking up bits of what is going on. However, I make the following observations. I followed the link to http://linguistics.berkeley.edu/~pandey/ and from there, having looked at some of the items on that page, to http://unicode.org/conference/bulldog.html where there are some very nice things said about you. > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > http://linguistics.berkeley.edu/~pandey/ > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. The problem is that if you withdraw your contributions, then Unicode will not be as good as it otherwise would have been. May I ask you to reconsider please? You have made a very effective protest in that it has caused people to wonder what is going on. Whether your protest will have any effect on changing the rules is not yet known. Yet even if it has no effect at all on the rules, if you allow your contributions to stand there will be people who are not yet born who will benefit from your contributions. So, will you reconsider please? William Overington 10 June 2015 ----Original message---- >From : pandey at umich.edu Date : 10/06/2015 - 11:01 (GMTST) To : babelstone at gmail.com Cc : unicore at unicode.org, unicode at unicode.org Subject : Re: Accessing the WG2 document register Andrew, Thank you for this detailed investigation. It is truly informative. As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: http://linguistics.berkeley.edu/~pandey/ Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. All the best, Anshuman From everson at evertype.com Wed Jun 10 07:46:43 2015 From: everson at evertype.com (Michael Everson) Date: Wed, 10 Jun 2015 13:46:43 +0100 Subject: Accessing the WG2 document register In-Reply-To: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <624263BD-D22C-4CBE-A448-18AEBEF7DDC0@evertype.com> Anshu, This level of idealism does nobody any good. On 10 Jun 2015, at 11:01, Anshuman Pandey wrote: > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I hereby withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please withdraw all of my contributions. > > All the best, > Anshuman > > Michael Everson * http://www.evertype.com/ From samjnaa at gmail.com Wed Jun 10 10:09:06 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 20:39:06 +0530 Subject: Accessing the WG2 document register In-Reply-To: <4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: On 6/10/15, Anshuman Pandey wrote: > withdraw all of my contributions to Unicode, and reflexively to ISO 10646. A > list of the contributions that I withdraw is given at: > http://linguistics.berkeley.edu/~pandey/ > ... > Whoever has the task of coordinating with ISO, is that you Michel?, please > withdraw all of my contributions. Since a lot of currently encoded scripts owe their encoding to you, it seems the stability policy makes it impossible for your to withdraw *all* of your contributions. Frankly, while you *are* making a point by raising the issue, I don't think this is so serious a problem for you to consider such a drastic step. The ISO hasn't claimed "ownership" of your document, as you mention in another mail. They merely restrict public access to it. Your document is publicly available in another (probably better maintained, thanks to Rick) place -- so where's your worry? I agree that the ISO should have the courtesy to accord contributors special status, but such big organizations are often steeped in bureaucracy, and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... -- Shriramana Sharma ???????????? ???????????? From costello at mitre.org Wed Jun 10 10:10:28 2015 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Jun 2015 15:10:28 +0000 Subject: Unicode Expert's way of Writing Data Specifications? Message-ID: Hi Folks, I seek recommendations from the Unicode experts on how to write data specifications that are precise, from a Unicode perspective. Let's take an example. A (fictitious) data specification says this: The name of the airplane's flight path must take this form: FLTPATH xx, where xx = two digits. Even as a non-expert in Unicode I can see impreciseness: 1. What are the codepoints of these symbols: FLTPATH? Presumably you mean U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048. 2. What are the range of codepoints for the two digits? Presumably you mean U+0030 - U+0039. Here is a revised version of the data specification: The name of the airplane's flight path must take this form: FLTPATH (U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048) xx, where xx = two digits in the range U+0030 - U+0039. Is that revised version precise, from a Unicode expert's perspective? Is there a better way of phrasing it, so that it is more readable? As it stands, reading it is kind of a bumpy ride. /Roger From doug at ewellic.org Wed Jun 10 10:50:55 2015 From: doug at ewellic.org (Doug Ewell) Date: Wed, 10 Jun 2015 08:50:55 -0700 Subject: Unicode Expert's way of Writing Data =?UTF-8?Q?Specifications=3F?= Message-ID: <20150610085055.665a7a7059d7ee80bb4d670165c8327d.5e1a87a700.wbe@email03.secureserver.net> Costello, Roger L. wrote: > 1. What are the codepoints of these symbols: FLTPATH? Presumably you > mean U+0046 U+004C U+0054 U+0050 U+0041 U+0054 U+0048. I would specify, in prose or ABNF, that all keywords are encoded as Basic Latin characters (or Basic Latin plus Latin-1, or whatever range is desired). This would then apply to all subsequent specifications that deal with keywords, so there should be no need to specify U+xxxx code points in each one. If you use ABNF to specify the syntax, you can take advantage of keywords like ALPHA and DIGIT in the core rules (RFC 5234, Section B.1), which are predefined to be Basic Latin. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From john at tiro.ca Wed Jun 10 11:42:14 2015 From: john at tiro.ca (John Hudson) Date: Wed, 10 Jun 2015 09:42:14 -0700 Subject: Accessing the WG2 document register In-Reply-To: References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <557868E6.60307@tiro.ca> Anshu, I simply treat WG2 as a bureaucratic exercise bolted onto the actual work that Unicode does. In 20 years, I have never once had occasion to refer to ISO 10646, while I refer to Unicode every day. When I visit clients, none of them talk about implementing ISO 10646; they all talk about implementing Unicode. My recommendation is simply to ignore WG2 and act as if it doesn't exist. It already might as well not, and with its policies is only likely to become more and more irrelevant. JH -- John Hudson Tiro Typeworks Ltd www.tiro.com Salish Sea, BC tiro at tiro.com Getting Spiekermann to not like Helvetica is like training a cat to stay out of water. But I'm impressed that people know who to ask when they want to ask someone to not like Helvetica. That's progress. -- David Berlow From wjgo_10009 at btinternet.com Wed Jun 10 11:56:35 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Wed, 10 Jun 2015 17:56:35 +0100 (BST) Subject: Accessing the WG2 document register In-Reply-To: References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> Message-ID: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> > ..., and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... Hmm. I opine that ..., and while bureaucracies are commonly known to seem unconcerned as to individual feelings, they are seldom outright malicious of intent, I feel... would be better, as that would not associate a disability with lack of concern for the feelings of others. Some readers might like to search for the word blind in the following web page. http://www.publications.parliament.uk/pa/cm201516/cmhansrd/cm150609/debtext/150609-0001.htm William Overington 10 June 2015 From samjnaa at gmail.com Wed Jun 10 12:11:41 2015 From: samjnaa at gmail.com (Shriramana Sharma) Date: Wed, 10 Jun 2015 22:41:41 +0530 Subject: Accessing the WG2 document register In-Reply-To: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> Message-ID: On 6/10/15, William_J_G Overington wrote: >> ..., and while bureaucracies are commonly known to seem blind > to individual feelings, they are seldom outright malicious of intent, > I feel... > > ..., and while bureaucracies are commonly known to seem unconcerned as to > individual feelings, they are seldom outright malicious of intent, > I feel... > > would be better, as that would not associate a disability with lack of > concern for the feelings of others. While English grammatical debates are out of scope for this list, please note that in my mail the word "blind" only stands in the place of your "unconcerned" and not "unconcerned to individual feelings"... -- Shriramana Sharma ???????????? ???????????? From michel at suignard.com Wed Jun 10 12:45:17 2015 From: michel at suignard.com (Michel Suignard) Date: Wed, 10 Jun 2015 17:45:17 +0000 Subject: Accessing the WG2 document register In-Reply-To: <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <12386511.53032.1433955396005.JavaMail.defaultUser@defaultHost> Message-ID: This is turning on bureaucrat bashing and those of you interested in that topic should turn you focus on another new mail thread with a different title that I can safely ignore. Concerning access to WG2 document, I am (as new WG2 convenor and ongoing project editor for 10646) very unimpressed by new ISO policies concerning access to documents which make WG repository even less accessible than their parent (SC) repository. And they now require ISO Global Directory credential to get meaningful access to anything within the ISO document system. There are ways for national bodies to nominate experts to have access to WG but it is cumbersome. Even I, despite my dual role, had initially no access to the ballots I was creating! I had to be creative to get access. For documents that need to be accessible to both UTC and WG2 I have suggested a new mechanism by which UTC contributions (such as Anshu's) can be referenced by link using simple catch-all WG2 documents (typically done by UTC liaison or Debbie Anderson). I will always post documents directly if it is the author wish but it is not necessary, as long as the UTC link is open and stable (no problem there). I am also considering creating a mirror site of the new WG2 directory under Unicode server but it would have been password protected (password can be simple and easy to find). I have no intent to withdraw anything from the old WG2 website (it is now in archive mode), doing so would create awkward situations for repertoires that have been adopted. For the new site, I would respectfully ask Anshu to reconsider, this is not helping my task but instead making even more complicated (if that's possible). Concerning 10646 usefulness, please understand that there is still a large portion of constituencies (especially in Asia) that can only contribute to an ISO blessed entity. Most of the CJK work, and yet to be encoded Asian minority repertoires can only be done by joint work between UTC and ISO. It is a tad an American centric idea to think that you can totally ignore 10646, especially if you do business in China, Japan, or Korea. Unicode and UTC are very Bay Area centric (mostly for financial reasons, because no one will fund meetings overseas), but it does create impediment for other constituencies to participate. ISO, although imperfect, offer these constituencies a voice. For example the Ideographic Rapporteur Group (IRG) under WG2 is the group where CJK content can be either updated or augmented. Furthermore, many folks in Europe still cherish the additional forum that ISO provides. Unicode officers (to which I also belong) are looking in ways to improve the situation on their side by creating more direct communication with IRG and Asian constituencies but it is a complicated process. For most of the Unicode crowd it is not even in their radar (unless you deal with Asian scripts), but don't think all of you can totally ignore ISO at this stage. Good for you, some of us carry most of the burden of that complicated situation, so that you can do your work in simpler ways. Best Michel From everson at evertype.com Wed Jun 10 14:45:20 2015 From: everson at evertype.com (Michael Everson) Date: Wed, 10 Jun 2015 20:45:20 +0100 Subject: Accessing the WG2 document register In-Reply-To: <55785C69.3050305@htpassport.com> References:

<4F7C34BE-DD7C-4C63-90BA-23677D0B76FC@umich.edu> <55785C69.3050305@htpassport.com> Message-ID: On 10 Jun 2015, at 16:48, Shervin Afshar wrote: > From: Shriramana Sharma > >> The ISO hasn't claimed "ownership" of your document, as you >> mention in another mail. They merely restrict public access to it. > > This is no justification and we should not trivialize an organizational behavior which is not acceptable in this day and age of open access and collaboration. We should also not trash the whole idea of collaboration because people at a higher level in ISO have made poor decisions which are not the fault of the relevant technical committee (SC2). >> I agree that the ISO should have the courtesy to accord contributors special status, but such big organizations are often steeped in bureaucracy, and while bureaucracies are commonly known to seem blind to individual feelings, they are seldom outright malicious of intent, I feel... > > These seem to me as reasons why ISO is of little relevance to Unicode going forward. The SC2/UTC relationship is important because corporate and commercial concerns are not the only concerns worth taking into account. We cooperate and collaborate, and it?s not right to pretend that only the UTC has valuable input into the UCS. > I don't think the concern here is malicious intent; it's rather the bloated bureaucracy of such organizations which makes it virtually impossible to have that "courtesy" you are talking about for individual contributors. The bureaucracy hasn?t changed in size. Some specific decisions were taken at a high level about document distribution and participation. Those weren?t useful for our line of work. Not at all, and none of us in SC2 or WG2 are defending those. Maybe those procedures work well for some sorts of standards; I couldn?t say. But I don?t think it damns the whole ISO process forever, either. All the best, Michael Everson From tclancy at mozilla.com Wed Jun 10 16:10:43 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Wed, 10 Jun 2015 17:10:43 -0400 Subject: Another take on the English apostrophe in Unicode Message-ID: On 4/Jun/2015 14:34 PM, Markus Scherer wrote: > > Looks all wrong to me. > Hi, Markus. I'm the guy who wrote the blog post. I'll respond to your points below. > You can't use simple regular expressions to find word boundaries. That's > why we have UAX #29. > And UAX #29 doesn't work for words which begin or end with apostrophes, whether represented by U+0027 or U+2019. It erroneously thinks there's a word boundary between the apostrophe and the rest of the word. But UAX #29 *would* work if the apostrophes were represented by U+02BC, which is what I'm suggesting. Confusion between apostrophe and quoting -- blame the scribe who came up > with the ambiguous use, not the people who gave it a number. > I'm not trying to blame anyone. I'm trying to fix the problem. I know this problem has a long history. English is taught as that squiggle being punctuation, not a letter. > I think we need make a distinction between the colloquial usage of the word "punctuation" and the Unicode general category "punctuation" which has specific technical implications. I somewhat wish that Unicode had a separate category for "Things that look like punctuation but behave like letters", which might clear up this taxonomic confusion. (I would throw U+02BE (MODIFIER LETTER RIGHT HALF RING) and U+02BF (MODIFIER LETTER LEFT HALF RING), neither of which are actually modifiers, into that category too.) But we don't. And the English apostrophe behaves like a letter, regardless of what your primary school teacher might have told you, so with the options available in Unicode, it needs to be classed as a letter. "don?t" is a contraction of two words, it is not one word. > This is utter nonsense. Should my spell-checker recognise "hasn't" as a valid word? Or should it consider "hasn't" to be the word "hasn" followed by the word "t", and then flag both of them as spelling errors? Is "fo'c'sle" the three separate words "fo", "c", and "sle"? The idea that words with apostrophes aren't valid words is a regrettable myth that exists in English, which has repeatedly led to the apostrophe being an afterthought in computing, leading to situations like this one. If anything, Unicode might have made a mistake in encoding two of these > that look identical. How are normal users supposed to find both U+2019 and > U+02BC on their keyboards, and how are they supposed to deal with > incorrect > usage? > Yeah, and there are fonts where I can't tell the difference between capital I and lower-case l. But my spell-checker will underline a word where I erroneously use an I instead of an l, and I imagine spell-checkers of the future could underline a word where I erroneously use a closing quote instead of an apostrophe, or vice versa. There are other possible solutions too, but I don't want to get into a discussion about UI design. I'll leave that to UI designers. - Ted -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclancy at mozilla.com Wed Jun 10 17:51:45 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Wed, 10 Jun 2015 18:51:45 -0400 Subject: Another take on the English apostrophe in Unicode Message-ID: On 4/Jun/2015 19:01, Leo Broukhis wrote: > > Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for > example, the work ack-ack isn't decomposable into words, or even > morphemes, > "ack" and "ack". > I do think that U+2010 (HYPHEN) is miscategorised. I think it should have General Category = Pc, not Pd. (That is, hyphens are connectors, not dashes.) That would make it a "word" character. Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning it can occur in the middle of numbers or letters). UAX #29 says that U+2010 deliberately does *not* have Word Break = MidNumLet, though an implementation may treat it as if it did. (UAX #29 doesn't give any reasons for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have Word Break = MidNumLet, due to its history of being used as a dash or minus sign, but U+2010 should never be used as a dash or minus sign, so I don't see the problem.) But luckily, the miscategorisation of U+2010 hasn't led to any pressing practical problems, unlike the misuse of U+2019 for the apostrophe. - Ted -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Wed Jun 10 23:37:28 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 06:37:28 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: <557247F2.9050902@efele.net> References:

<557247F2.9050902@efele.net> Message-ID: The French "pomme de terre" ("potato" in English, French vulgar synonym : "patate") is a single lemma in dictionaries, but is still 3 separate words (only the first one takes the plural mark), it is not considered a "nom compos?" (so there's no hyphens). And they are separated by standard spaces (that are breakable, and expansible/compressible like all others in case of justified text)... The lemma is still recognized if there are extra punctuation in the middle such as : ? pomme ? de terre. We don't need any new space character. What you want is to insert markup to exhibit the structure of sentences for grouping words semantically or grammaticaly. But nobody including grammarians will use this "new" space, what they'll use is in fact some additional symbols or presentation features (enclosing boxes, braces above or below, colors...) if they want to exhibit it on top of the standard text. 2015-06-06 3:08 GMT+02:00 Eric Muller : > On 6/5/2015 10:29 AM, John D. Burger wrote: > >> Linguistically, "don't" and friends pass all the diagnostics that >> indicate they're single words. >> > > If I am not mistaken, the french "pomme de terre" also passes the > diagnostics. So we need a new space character. > > Eric. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 11 00:17:11 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 07:17:11 +0200 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: The ASCII punctuations have been ovveriden for a lot of different roles. There's simply no way to map them to a category that matches their semantic role. So the ASCII hyphen and apostrophe-quote can only be given a very weak category that just exhibit their visual role. "Pd" (dash) is then appropriate for the ASCII hyphen-minus. You can't really tell from the character alone if it is a punctuation or a minus sign. If it is a minus sign you can reencode it better using the more specific mathematical minus sign. Otherwise, even if it is not a minus sign, it can be: - a connector between words in compound words (hyphen) - a trailing mark at end of lines for indicating a word has been broken in the middle (but remember that I asked previously for another character for that role because this word-breaking hyphen is not necessarily an horisontal hyphen (in dictionaries I've seen small slanted tildes, or slanted small equal signs, to make the distinction with true hyphens used in compound words, also because sometimes these breaks are not necessarily between two syllables in "pocket books" with very narrow columns and minimized spacing) - a bullet leading items in a vertical list (this should be an en dash, follwoed by some spacing) - a punctuation (not necessarily at begining of line) marking the change of person speaking (very common in litterature, notably in theatre). As a connector between words, there's a demonstrated need of differentiating regular hyphens, longer hyphens (preferably surrounded by thin spaces) for noting intervals (we can use the EN DASH for that), long hyphens between two separate names that are joined (example in propers names, after mariage, there's an example in France, where INSEE encodes it for now using TWO successive hyphens, which are also used in French identity cards, passports, social security green cards...). ---- Still nobody replied to my past comment (about 1 month ago) about the various forms of the word-breaking hyp?en / line-wrapping symbol: * I'm not speaking about the SHY control, but about the real character whose glyph appears when SHY is materialized at end of lines (and which should be neither minus, or en-dash but also not the same as the orthographic hyphen used between words in a compound word). * This character can also be found (and is needed) also for breaking long mathematical formulas and must be clearly distinct from the regular minus. * This character is also needed for rendering long lines of programming code or textual data (it is something that must not be entered in programs but that must be rendered because theses programs or codes have significant line breaks: the glyph indicates that the following rendered line break is to be discarded). Not all programming languages have a syntax allwong to use an escape before the line break (such escaping varies, it may be a backslash in C/C++, or an underscore in Basic, but in data dumps such as CSV files, it is impossible to note such escape in the data language itself, and we need to render some specific glyph). * This character is absolutely needed when rendering on a static medium (i.e. printing or broadcasting) ; for dynamic medium (such as personal displays with a personal UI) we could still use scrolling, but users don't like horizontal scrolls and highly prefer reading the text directly. So they expect to see a distinctive glyph (or icon) to see the distinction between line breaks where there are significant or where they just wrap too long lines, and still see the distinction with other regular hyphens and minus (that are also significant and very frequently distinct) 2015-06-11 0:51 GMT+02:00 Ted Clancy : > On 4/Jun/2015 19:01, Leo Broukhis wrote: >> >> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >> for >> example, the work ack-ack isn't decomposable into words, or even >> morphemes, >> "ack" and "ack". >> > I do think that U+2010 (HYPHEN) is miscategorised. I think it should have > General Category = Pc, not Pd. (That is, hyphens are connectors, not > dashes.) That would make it a "word" character. > > Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning > it can occur in the middle of numbers or letters). UAX #29 says that U+2010 > deliberately does *not* have Word Break = MidNumLet, though an > implementation may treat it as if it did. (UAX #29 doesn't give any reasons > for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have > Word Break = MidNumLet, due to its history of being used as a dash or minus > sign, but U+2010 should never be used as a dash or minus sign, so I don't > see the problem.) > > But luckily, the miscategorisation of U+2010 hasn't led to any pressing > practical problems, unlike the misuse of U+2019 for the apostrophe. > > - Ted > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tclancy at mozilla.com Thu Jun 11 01:08:42 2015 From: tclancy at mozilla.com (Ted Clancy) Date: Thu, 11 Jun 2015 02:08:42 -0400 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References:

Message-ID: On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy wrote: > The ASCII punctuations have been ovveriden for a lot of different roles. > There's simply no way to map them to a category that matches their semantic > role. [...] "Pd" (dash) is then appropriate for the ASCII hyphen-minus. > I agree, but I wasn't talking about the ASCII hyphen, U+002D (HYPHEN-MINUS). I was talking about U+2010 (HYPHEN). I also wasn't talking about changing the properties of U+0027 (APOSTROPHE). > in dictionaries I've seen small slanted tildes, or slanted small equal > signs, to make the distinction with true hyphens used in compound words > This is drifting off-topic, but I wanted to address the thing you just said above. Firstly, in the dictionaries I've seen, the slanted double hyphen is only used when a line break happens to occur at the same place as a "true hyphen". It replaces the "true hyphen". When a line is broken at a hyphenation point between letters, an ordinary-looking hyphen is displayed. Secondly, this character is encoded in Unicode at U+2E17 (DOUBLE OBLIQUE HYPHEN). - Ted On Thu, Jun 11, 2015 at 1:17 AM, Philippe Verdy wrote: > The ASCII punctuations have been ovveriden for a lot of different roles. > There's simply no way to map them to a category that matches their semantic > role. So the ASCII hyphen and apostrophe-quote can only be given a very > weak category that just exhibit their visual role. "Pd" (dash) is then > appropriate for the ASCII hyphen-minus. You can't really tell from the > character alone if it is a punctuation or a minus sign. > > If it is a minus sign you can reencode it better using the more specific > mathematical minus sign. Otherwise, even if it is not a minus sign, it can > be: > - a connector between words in compound words (hyphen) > - a trailing mark at end of lines for indicating a word has been broken in > the middle (but remember that I asked previously for another character for > that role because this word-breaking hyphen is not necessarily an > horisontal hyphen (in dictionaries I've seen small slanted tildes, or > slanted small equal signs, to make the distinction with true hyphens used > in compound words, also because sometimes these breaks are not necessarily > between two syllables in "pocket books" with very narrow columns and > minimized spacing) > - a bullet leading items in a vertical list (this should be an en dash, > follwoed by some spacing) > - a punctuation (not necessarily at begining of line) marking the change > of person speaking (very common in litterature, notably in theatre). > > As a connector between words, there's a demonstrated need of > differentiating regular hyphens, longer hyphens (preferably surrounded by > thin spaces) for noting intervals (we can use the EN DASH for that), long > hyphens between two separate names that are joined (example in propers > names, after mariage, there's an example in France, where INSEE encodes it > for now using TWO successive hyphens, which are also used in French > identity cards, passports, social security green cards...). > > > ---- > > Still nobody replied to my past comment (about 1 month ago) about the > various forms of the word-breaking hyp?en / line-wrapping symbol: > > * I'm not speaking about the SHY control, but about the real character > whose glyph appears when SHY is materialized at end of lines (and which > should be neither minus, or en-dash but also not the same as the > orthographic hyphen used between words in a compound word). > > * This character can also be found (and is needed) also for breaking long > mathematical formulas and must be clearly distinct from the regular minus. > > * This character is also needed for rendering long lines of programming > code or textual data (it is something that must not be entered in programs > but that must be rendered because theses programs or codes have significant > line breaks: the glyph indicates that the following rendered line break is > to be discarded). Not all programming languages have a syntax allwong to > use an escape before the line break (such escaping varies, it may be a > backslash in C/C++, or an underscore in Basic, but in data dumps such as > CSV files, it is impossible to note such escape in the data language > itself, and we need to render some specific glyph). > > * This character is absolutely needed when rendering on a static medium > (i.e. printing or broadcasting) ; for dynamic medium (such as personal > displays with a personal UI) we could still use scrolling, but users don't > like horizontal scrolls and highly prefer reading the text directly. So > they expect to see a distinctive glyph (or icon) to see the distinction > between line breaks where there are significant or where they just wrap too > long lines, and still see the distinction with other regular hyphens and > minus (that are also significant and very frequently distinct) > > > 2015-06-11 0:51 GMT+02:00 Ted Clancy : > >> On 4/Jun/2015 19:01, Leo Broukhis wrote: >>> >>> Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, >>> for >>> example, the work ack-ack isn't decomposable into words, or even >>> morphemes, >>> "ack" and "ack". >>> >> I do think that U+2010 (HYPHEN) is miscategorised. I think it should have >> General Category = Pc, not Pd. (That is, hyphens are connectors, not >> dashes.) That would make it a "word" character. >> >> Or, at the very least, U+2010 should have Word Break = MidNumLet (meaning >> it can occur in the middle of numbers or letters). UAX #29 says that U+2010 >> deliberately does *not* have Word Break = MidNumLet, though an >> implementation may treat it as if it did. (UAX #29 doesn't give any reasons >> for this decision. I can understand why U+002D (HYPHEN-MINUS) doesn't have >> Word Break = MidNumLet, due to its history of being used as a dash or minus >> sign, but U+2010 should never be used as a dash or minus sign, so I don't >> see the problem.) >> >> But luckily, the miscategorisation of U+2010 hasn't led to any pressing >> practical problems, unlike the misuse of U+2019 for the apostrophe. >> >> - Ted >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 11 01:05:24 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Jun 2015 08:05:24 +0200 Subject: Accessing the WG2 document register In-Reply-To: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: As far as I have have seen, you cannot withdraw the irrevocable licence you gave to ISO when submitting the document. ISO requires that you give such licence otherwise your document will be rejected. ISO hwoever does not take the ownership (or authorship) and you keep the right to grant licences yourself to other people (with possibly different licencing terms). ISO just requires that the licence you grant will also expose all other proprietary rights that you claim (including patents), and that you sign it with your name (you take yourself the risks for the claims you make) and a way to contact you in case of problems. All you can do then is to instruct ISO that your past submission should no longer be considerd as relevant in next discussions, but all past discussions and decisions that would be based on your document will remain valid and will borrow the terms of your licence which must be clearly starting the terms of use and which will allow anyone to request a valid licence (not necessarily a free licence, because you caould ask "reasonnable" fees). Unfortunately ISO does not define clearly what is the reasonnable fee you can claim to those that will request a licence. ISO will sell its published standard and will not give you back any dime when it will do that (but it is a fact that fees requested by ISO to get a copy of its standards is not adequate as they are really too much expensive for individual users or small organizations and non-profits). This has a aocnsequence: ISO standards can only be defined and used and by large organizations (and this brings severe doubts about ISO claiming they are building "international standards" for everyone. Even governments in small countries cannot participate, everyone has to pay the same expensive fees to ISO even if their use of the stadnard will not generate (propertionaly) the same revenues (or savings) as those generated by large organization or big governments if they use the standard (the fee requested to them by ISO is ridiculously low, and ISO then is still lacking money to finance its activities). ---- Personally I think that Unicode does a much better job to open its standard to many more people by offering differnet levels of participations and opening a large area open to every individual without paying considerable fees. I consider that the only standard that defines the UCS is TUS, not ISO/IEC 10646 (that is just a piece of junk, badly administered, and inaccessible to most people). If you want examples of really bad standards published by ISO, just consider the MPEG related standards or standards related to "open" documents. Really I don't trust ISO in those domains and most people prefer what the W3C do. I just hope that ISO will withdraw its MPEG and open document standards, to be replaced by those made by other standard bodies (W3C, IETF, CEN, IEEE... For ITU, UPU, IATA, many of their standards are also full of patent restrictions and published with very restrictive terms and very expensive fees just to get a copy of a single document). MPEG should be completely withdrawn too, replaced by really open encodings (such as OGG). And frankly, the Linux community can also create their own standard body (there will be an immediate market for that, notably in mobile and embedded devices where Linux is present almost everywhere, including in Android and significant parts of Apple iOS) and coordinate with other foundations working in the same area of open standards. It is the Linux/Unix world that really promoted and developed the UCS to allow it to reach its current state (before that there were lots of proprietary standards approved by ISO and incorrectly labeled "international standards" even if most of them were incompatible with each other). I can even remember the time when Microsoft did not believe in the Internet and wanted to create "The Microsoft Network" (it was withdrawn, including the ISP service using MS protocols, replaced by MSN services based on the Internet and IETF standards). 2015-06-10 14:33 GMT+02:00 William_J_G Overington : > As I am not on the Unicore list, just the public mailing list, I am only > picking up bits of what is going on. > > However, I make the following observations. > > I followed the link to > http://linguistics.berkeley.edu/~pandey/ > and from there, having looked at some of the items on that page, to > http://unicode.org/conference/bulldog.html > where there are some very nice things said about you. > > > As I am considered an ineligible contributor by ISO, um, standards, I > hereby withdraw all of my contributions to Unicode, and reflexively to ISO > 10646. A list of the contributions that I withdraw is given at: > > > http://linguistics.berkeley.edu/~pandey/ > > > Whoever has the task of coordinating with ISO, is that you Michel?, > please withdraw all of my contributions. > > The problem is that if you withdraw your contributions, then Unicode will > not be as good as it otherwise would have been. > > May I ask you to reconsider please? > > You have made a very effective protest in that it has caused people to > wonder what is going on. > > Whether your protest will have any effect on changing the rules is not yet > known. > > Yet even if it has no effect at all on the rules, if you allow your > contributions to stand there will be people who are not yet born who will > benefit from your contributions. > > So, will you reconsider please? > > William Overington > > 10 June 2015 > > > > > > > > > ----Original message---- > From : pandey at umich.edu > Date : 10/06/2015 - 11:01 (GMTST) > To : babelstone at gmail.com > Cc : unicore at unicode.org, unicode at unicode.org > Subject : Re: Accessing the WG2 document register > > Andrew, > > Thank you for this detailed investigation. It is truly informative. > > As I am considered an ineligible contributor by ISO, um, standards, I > hereby withdraw all of my contributions to Unicode, and reflexively to ISO > 10646. A list of the contributions that I withdraw is given at: > > http://linguistics.berkeley.edu/~pandey/ > > Whoever has the task of coordinating with ISO, is that you Michel?, please > withdraw all of my contributions. > > All the best, > Anshuman > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Thu Jun 11 03:49:51 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 09:49:51 +0100 Subject: Accessing the WG2 document register In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost> Message-ID: On 11 June 2015 at 07:05, Philippe Verdy wrote: > > Personally I think that Unicode does a much better job to open its standard > to many more people by offering differnet levels of participations and > opening a large area open to every individual without paying considerable > fees. I consider that the only standard that defines the UCS is TUS, not > ISO/IEC 10646 (that is just a piece of junk, badly administered, and > inaccessible to most people). You do realise that by insulting ISO/IEC 10646 you are also insulting a number of prominent members of the UTC and officers of the Unicode Consortium who actively participate in the production and editing of ISO/IEC 10646? The latest version of ISO/IEC 10646 is not inaccessible to most people, as it is (and has been since 2006) available for free download from ISO at . Whilst I agree that the standard itself is irrelevent to the vast majority of users, who can get by quite happily just knowing about the Unicode Standard, I believe that the great importance of ISO/IEC 10646 lies in the process that goes to produce it, not in the resultant standard. The Unicode Consortium is largely controlled by a few large American corporations, but ISO is open to participation by standards organizations representing countries across the globe, and there are currently thirty participating members of SC2, the committee which is responsible for ISO/IEC 10646 . The ISO ballot process allows stakeholders in scripts from these countries to participate in the encoding process, and make the views of their experts heard. The ballot process also applies important checks on the encoding process, and prevents scripts and characters being encoded with undue haste if an encoding proposal is not yet mature enough or if there is insufficient consensus among stakeholders. Not least, the ballot process allows for multiple stages of review and correction of errors. If Unicode were to go it alone, professional encoders such as Anshu and Michael, who do not have an inherent stake in most of the scripts they work on, would present their proposals to the UTC, who do not have any expertise in such minority or historic scripts, but on the basis that the proposal seems plausible they would approve it, and six months later it would be in the next version of Unicode. Yes, this speeds up the encoding process enormously (which is usually at least two years), but at what cost? What happens when a couple of years later, users of the script in question in Africa or Asia discover that it has been encoded in Unicode but has a serious flaw or shortcoming that no-one from the user community had an opportunity to correct (and due to stability policies it is now too late to correct)? So whilst ISO/IEC 10646 is certainly irrelevent to most people, I strongly believe that the process whereby the standard is produced is extremely beneficial to the Unicode Standard, and I would urge Anshu and others to support the work of SC2 and WG2 rather than dismiss it as a hindrance or irrelevance. Andrew From jsbien at mimuw.edu.pl Thu Jun 11 04:12:15 2015 From: jsbien at mimuw.edu.pl (Janusz S. =?utf-8?Q?Bie=C5=84?=) Date: Thu, 11 Jun 2015 11:12:15 +0200 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost>

Message-ID: <86vbeua9sg.fsf@mimuw.edu.pl> On Thu, Jun 11 2015 at 10:49 CEST, andrewcwest at gmail.com writes: [...] > The latest version of ISO/IEC 10646 is not inaccessible to most > people, as it is (and has been since 2006) available for free download > from ISO at . The page states clearly The following standards are made freely available for standardization purposes. In consequence I don't feel entitled to download it. Not only my curiosity is not a standarization purpose, but even teaching students about standards also doesn't qualify. I just show them the link and tell them to decide themselves to download or not :-) Best regards Janusz -- , Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej) Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department) jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/ From andrewcwest at gmail.com Thu Jun 11 04:38:41 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 10:38:41 +0100 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: <86vbeua9sg.fsf@mimuw.edu.pl> References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost>

<86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: On 11 June 2015 at 10:12, Janusz S. Bie? wrote: > >> The latest version of ISO/IEC 10646 is not inaccessible to most >> people, as it is (and has been since 2006) available for free download >> from ISO at . > > The page states clearly > > The following standards are made freely available for standardization > purposes. > > In consequence I don't feel entitled to download it. Not only my > curiosity is not a standarization purpose, but even teaching students > about standards also doesn't qualify. I just show them the link and tell > them to decide themselves to download or not :-) I think you are reading far too much into the phrase "for standardization purposes". The license states that you are allowed to store a copy on your personal computer and print off a single copy, but says nothing about what purposes you may use the standards for. In my opinion it is ridiculous to claim that you are not entitled to download the documents. The Unicode terms of use are far more restrictive, and state that "Any person is hereby authorized, without fee, to view, use, reproduce, and distribute all documents and files solely for informational purposes in the creation of products supporting the Unicode Standard, subject to the Terms and Conditions herein." So if you are not planning to create a product supporting the Unicode Standard, you are not legally allowed to view or download any of the files comprising the Unicode Standard ! Andrew From andrewcwest at gmail.com Thu Jun 11 05:00:29 2015 From: andrewcwest at gmail.com (Andrew West) Date: Thu, 11 Jun 2015 11:00:29 +0100 Subject: free download of ISO/IEC 10646 (was: Accessing the WG2 document register) In-Reply-To: References: <19545653.31817.1433939612900.JavaMail.defaultUser@defaultHost>

<86vbeua9sg.fsf@mimuw.edu.pl> Message-ID: On 11 June 2015 at 10:38, Andrew West wrote: > > The Unicode terms of use are far > more restrictive, and state that "Any person is hereby authorized, > without fee, to view, use, reproduce, and distribute all documents and > files solely for informational purposes in the creation of products > supporting the Unicode Standard, subject to the Terms and Conditions > herein." So if you are not planning to create a product supporting > the Unicode Standard, you are not legally allowed to view or download > any of the files comprising the Unicode Standard ! My apologies, according to the "Unicode Consortium and Trademark Usage Policy" I should always refer to "The Unicode? Standard". I hope that everyone on this list will take note of this important policy in future messages. Andrew From billposer2 at gmail.com Thu Jun 11 12:47:39 2015 From: billposer2 at gmail.com (Bill Poser) Date: Thu, 11 Jun 2015 10:47:39 -0700 Subject: Another take on the English apostrophe in Unicode In-Reply-To: References: Message-ID: To add a factor that I think hasn't been mentioned, there are languages in which apostrophe is used both as a letter by itself and as part of a complex letter. Most of the native languages of British Columbia write glottalized consonants as C+', e.g. for an ejective alveolar stop, and many use apostrophe by itself for the glottal stop. (Another common convention, which produces other difficulties, is to use the number <7> for glottal stop.) Bill On Wed, Jun 10, 2015 at 2:10 PM, Ted Clancy