From js_choi at icloud.com Tue Nov 3 12:59:40 2015 From: js_choi at icloud.com (=?utf-8?Q?=22J=2E=C2=A0S=2E_Choi=22?=) Date: Tue, 03 Nov 2015 12:59:40 -0600 Subject: On emoji and the two rightwards black arrows In-Reply-To: References: Message-ID: <119D9C7A-D475-4BA1-BCBC-871AC66649AE@icloud.com> Thanks for the reply! > IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today? > ?these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured). If the compatibility mappings are not normative or guaranteed to be stable, then that would weaken one of the two objections to the changes proposed in my questions 1 and 2. The compatibility-mapping and IDNA issues are merely supplemental to my main questions, though. > Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually. Perhaps this is true, but regardless of whether the disunification in 2014 (of the Zapf Dingbat U+27A1 from the DPRK/Wingding arrows U+2B05?U+2B07) was justified, or whether the creation in 2014 of U+2B95 was justified, they happened nonetheless; the opportunity to object to it seems to have already passed. U+2B95 now exists?and it exists with the express purpose to complete U+2B05?U+2B07, based on Michel Suignard?s new representative glyphs and Mark Davis? comments from earlier this year. However, U+2B95?s current absence from UTR #51 and emoji_data.txt?and its lack of text/emoji standardized variation sequences?are perhaps inconsistent with that purpose. The three questions remain: 1. Should U+B295 be added to the set of emoji characters as given by UTR #51 and emoji-data.txt, in order to complete the harmonization with U+2B05?U+2B07 from 2014? 2. If question 1?s answer is yes, then should U+B295 be given text/emoji standardized variation sequences, just as U+2B05?U+2B07 already do? 3. Regardless of the answers to the above, should notes clarifying the differences in intended usage between U+B295 (the right black arrow completing U+2B05?U+2B07) and U+27A1 (the Zapf Dingbat) be added to their entries in the Standard?s code charts? This might clear up a lot of confusion from users and font creators, and would only make clearer what has already been made explicit by 7.0?s glyph changes. ??I?m also uncertain as to the way I?d even initiate a formal process on this. This isn?t even a proposal for a new character; it?s a proposal the for inclusion of an already added character and for the addition of clarifying information in the code charts. The forms at http://www.unicode.org/L2/summary.html wouldn?t seem to fit this kind of change. J. S. Choi > On Oct 30, 2015, at 7:19 PM, Philippe Verdy wrote: > > IMHO, all mappings from other encodings are just best efforts but not normative. In many cases, those mappings are ambiguous, including for some legacy encodingfs that have been widely used since many decades and still used today (such as CP437): > > The reason for that is that the old registrations for legacy 8-bit charsets only showed charts of glyphs with approximative glyphs (often with poor quality, with low resolution rendering on printed papers, and various polluting dots of inks, later scanned with poor resolution), but no actual properties (and often without even listing any name for them). And for long those charts have been interpreted differently by different vendors (such as printer or screen manufacturers, in a time where dot-matrix printers or displays had poor resolution), and sometimes with glyphs changing slightly between devices models or versions from the same vendor. > > So characters in those mapping tables were widely used to mean different variants of characters that are now distinguished in the UCS (e.g. in CP437, the symbol that looks either like an big epsilon or as a "is member of" math symbol ; the mappings to the UCS for other symbols that look like Greek letters in CP437 charsets and similar are not really in stone, it is not even clear if they will map to UCS symbols or to UCS Greek letters ; the same applies to various geometric symbols, including arrows, and bullets). > > Those mappings are just there to help converting some old documents to the UCS, but the choice is sometimes questionable and some corrections may need to be done to select another character, depending on the context of use. Unfortunately, the existing mappings only document mappings of legacy code positions to a single suggested codepoint, and not their other possible alternatives. > > Then we fall into the categories of characters that are easily confusable: may be these mappings tables do not need to be changed, but used together with the datafiles related to confusable characters (the list was initiated during the development of IDNA). There are other data available (visible in Unicode charts) that also indicate a few related/similar characters, but these are mostly notes are not engraved in stone, and this data is difficult ot use. > > So in summary, those mapping tables are just suggestions and implementers may still map legacy encodings to different subsets of the UCS. But we should be concerned by the conversion to the other direction, from the UCS to legacy mappings : all candidate UCS code points should be reversed mapped to the same legacy code position (as much as possible). Those mapping tables are then not part of the stable standard and there's no stability policy about them (IMHO, such policy should not be adopted). They are just contributions in order to help the transition to the UCS, and they are also subject to updates when needed if there are better mappings developed later, and some applications or vendors will still develop their own preferences. > > If you consider the two UCS characters in question, my opinion is that they are basically the same and mappings from Zapf Dingbats or DPRK or Windings/Webdings are just kept for historical reasons, but not necessarily the best ones. And I would see no violation of the standard if a font was made that mapped both UCS characters to exactly the same glyph, using metrics that create a coherent set of black arrows using either the DPRK metrics for all 4 arrows, or the Zapf Dingbats metrics for all 4 arrows. Their disunification is not really justified, except to work with applications or documents that used fonts not mapping all of them but made to work only with DPRK-encoded documents, or with Dingbats-encoded documents: the disunification is based only on those specific old (defective) fonts, and modern fonts should not be defective and should map all of these characters as if they were aliased, without any need to distinguish them visually. > > But because they are not canonically equivalent, these characters should be explicitly listed in the list of confusables (which version will be preferred, and which versions will be aliased to the prefered form, for applications like IDNA, is a question to develop as this is a possible security concern if some of these characters are allowed in identifiers intended to be secured). > > 2015-10-30 19:51 GMT+01:00 J.S. Choi >: > # On emoji and the two rightwards black arrows > > (?) The post is about two encoded characters: > U+27A1 Black Rightwards Arrow > > and U+2B95 Rightwards Black Arrow >. > > (?) > In any case, I might make a formal proposal in the future, but I first want to determine here how probable that such a proposal would be discussed. What would you say the answers to those three questions are? -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Tue Nov 3 15:35:54 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Tue, 3 Nov 2015 13:35:54 -0800 Subject: Emoji data in UCD xml ? In-Reply-To: References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> <563245DA.2060207@att.net> Message-ID: About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf which has Emoji_Presentation (EP) ? Non_Emoji (NE) ? Default_Text (DT) ? Default_Emoji (DE) ? NA Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for all code points that are not mentioned in the data? markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Tue Nov 3 16:34:34 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Tue, 3 Nov 2015 14:34:34 -0800 Subject: Emoji data in UCD xml ? In-Reply-To: References: <2121253DDDDD4862B07E9F7B762FA59A@erratique.ch> <563245DA.2060207@att.net> Message-ID: We have revised this completely; see the R2 version. Mark On Tue, Nov 3, 2015 at 1:35 PM, Markus Scherer wrote: > About http://www.unicode.org/L2/L2015/15299-ucd-emoji-props.pdf > which has > > Emoji_Presentation (EP) > ? Non_Emoji (NE) > ? Default_Text (DT) > ? Default_Emoji (DE) > ? NA > > > Why do we need both Non_Emoji and NA? Can't Non_Emoji be the default for > all code points that are not mentioned in the data? > > markus > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Nov 5 09:57:16 2015 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 5 Nov 2015 08:57:16 -0700 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <20150327180725.GA9968@math.berkeley.edu> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> Message-ID: <563B7C5C.4000209@khwilliamson.com> Hi, Several of us are wondering about the reason for reserving bits for the extended UTF-8 in perl5. I'm asking you because you are the apparent author of the commits that did this. To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the length of the sequence of bytes that comprise a single character to be 13 bytes. This allows code points up to 2**72 - 1 to be represented. If the length had been instead 12 bytes, code points up to 2**66 - 1 could be represented, which is enough to represent any code point possible in a 64-bit word. The comments indicate that these extra bits are "reserved". So we're wondering what potential use you had thought of for these bits. Thanks Karl Williamson From unicode at mxmerz.de Thu Nov 5 11:10:45 2015 From: unicode at mxmerz.de (Maximilian Merz) Date: Thu, 5 Nov 2015 18:10:45 +0100 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: References: Message-ID: Hello, I did not receive any feedback on my last email, but chose to finalize my proposal anyway ? you can download the PDF (673 KB) here: [1]. I would appreciate feedback of any kind. Best regards, Max Merz PS: Is a ?computerized font (True Type or PostScript)? also necessary for emoji characters or do SVG/PDF/PNG images suffice here? [1]: http://mxmerz.de/unicode/Face_with_One_Eyebrow_Raised.pdf > On 27.10.2015, at 22:03, Max Merz wrote: > > Hello, > > I would like to submit a proposal to encode an emoji depicting a ?face with one eyebrow raised?, as to indicate scepticism, surprise, concern, disagreement. > > The ?Submitting Character Proposals? page on unicode.org recommends to discuss preliminary proposals on this mailing list ? I am currently working on my proposal, but I would appreciate general feedback about whether this idea is doomed from the start, has already been discussed, comes at a bad time, etc.? > > Best regards, > > Max Merz From verdy_p at wanadoo.fr Thu Nov 5 11:25:05 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Nov 2015 18:25:05 +0100 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <563B7C5C.4000209@khwilliamson.com> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> Message-ID: It won't represent any valid Unicode codepoint (no standard scalar value defined), so if you use those leading bytes, don't pretend it is for "UTF-8" (not even "modified UTF-8" which is the variant created in Java for its internal serialization of unrestricted 16-bit strings, including for lone surrogates, and modified also in its representation of U+0000 as <0xC0,0x80> instead of <0x00> in standard UTF-8). You'll have to create your own charset identifier (e.g. "perl5-UTF-8-extended" or some name derived from your Perl5 library) and say it is not fot use for interchange of standard text. The extra code points you'll get are then necessarily for private use (but still not part of the standard PUA set), and have absolutely no defined properties from the standard. They should not be used to represent any Unicode character or character sequence. In any API taking some text input, those code points will never be decoded and will behave on input like encoding errors. But these extra code points could be used to represent someting else such as unique object identifier for internal use in your application, or virtual object pointers, or or shared memory block handles, file/pipe/stream I/O handles, service/API handles, user ids, security tokens, 64-bit content hashes plus some binary flags, placeholders/references for members in an external unencoded collection or for URIs, or internal glyph ids when converting text for rendering with one or more fonts, or some internal serialization of geometric shapes/colors/styles/visual effects...) In the standard UTF-8 those extra byte values are not "reserved" but permanently assigned to be "invalid", and there are no valid encoded sequences as long as 12 or 13 bytes (0xFF was reserved only in the old RFC version of UTF-8 when it allowed code points up to 31 bits, but even this RFC is obsolete and should no longer be used and it has never been approved by Unicode). 2015-11-05 16:57 GMT+01:00 Karl Williamson : > Hi, > > Several of us are wondering about the reason for reserving bits for the > extended UTF-8 in perl5. I'm asking you because you are the apparent > author of the commits that did this. > > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes the > length of the sequence of bytes that comprise a single character to be 13 > bytes. This allows code points up to 2**72 - 1 to be represented. If the > length had been instead 12 bytes, code points up to 2**66 - 1 could be > represented, which is enough to represent any code point possible in a > 64-bit word. > > The comments indicate that these extra bits are "reserved". So we're > wondering what potential use you had thought of for these bits. > > Thanks > > Karl Williamson > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Thu Nov 5 12:15:28 2015 From: markus.icu at gmail.com (Markus Scherer) Date: Thu, 5 Nov 2015 10:15:28 -0800 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> Message-ID: On Thu, Nov 5, 2015 at 9:25 AM, Philippe Verdy wrote: > (0xFF was reserved only in the old RFC version of UTF-8 when it allowed > code points up to 31 bits, but even this RFC is obsolete and should no > longer be used and it has never been approved by Unicode). > No, even in the original UTF-8 definition, "The octet values FE and FF never appear." https://tools.ietf.org/html/rfc2279 The highest lead byte was 0xFD. (For the "really original" version see http://www.unicode.org/L2/Historical/wg20-n193-fss-utf.pdf) In the current definition, "The octet values C0, C1, F5 to FF never appear." https://tools.ietf.org/html/rfc3629 = https://tools.ietf.org/html/std63 markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From richard.wordingham at ntlworld.com Thu Nov 5 13:19:10 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Thu, 5 Nov 2015 19:19:10 +0000 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> Message-ID: <20151105191910.317175e6@JRWUBU2> On Thu, 5 Nov 2015 18:25:05 +0100 Philippe Verdy wrote: > But these extra code points could be used to represent someting else > such as unique object identifier for internal use in your > application, or virtual object pointers, or or shared memory block > handles, file/pipe/stream I/O handles, service/API handles, user ids, > security tokens, 64-bit content hashes plus some binary flags, > placeholders/references for members in an external unencoded > collection or for URIs, or internal glyph ids when converting text > for rendering with one or more fonts, or some internal serialization > of geometric shapes/colors/styles/visual effects...) No-one's claiming it is for a Unicode Transformation Format (UTF). A possibly relevant example of a something else is a non-precomposed grapheme cluster, as in Perl6's NFG. (This isn't a PUA encoding, as the precomposed characters are created on the fly.) Richard. From mark at macchiato.com Thu Nov 5 14:25:46 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 5 Nov 2015 12:25:46 -0800 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: References: Message-ID: The unicode at unicode.org mailing list isn't the right place for submitting proposals; see the top of http://www.unicode.org/emoji/selection.html#submission under "submit as per Document Submission Details ." As for the images, that's also discussed there; they should be PNGs of the specified format. (And by the way, a very nicely documented proposal!) Mark On Thu, Nov 5, 2015 at 9:10 AM, Maximilian Merz wrote: > Hello, > > I did not receive any feedback on my last email, but chose to finalize my > proposal anyway ? you can download the PDF (673 KB) here: [1]. > > I would appreciate feedback of any kind. > > Best regards, > > Max Merz > > PS: Is a ?computerized font (True Type or PostScript)? also necessary for > emoji characters or do SVG/PDF/PNG images suffice here? > > [1]: http://mxmerz.de/unicode/Face_with_One_Eyebrow_Raised.pdf > > > On 27.10.2015, at 22:03, Max Merz wrote: > > > > Hello, > > > > I would like to submit a proposal to encode an emoji depicting a ?face > with one eyebrow raised?, as to indicate scepticism, surprise, concern, > disagreement. > > > > The ?Submitting Character Proposals? page on unicode.org recommends to > discuss preliminary proposals on this mailing list ? I am currently working > on my proposal, but I would appreciate general feedback about whether this > idea is doomed from the start, has already been discussed, comes at a bad > time, etc.? > > > > Best regards, > > > > Max Merz > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Thu Nov 5 14:41:42 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 05 Nov 2015 13:41:42 -0700 Subject: Question about Perl5 extended UTF-8 design Message-ID: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net> Richard Wordingham wrote: > No-one's claiming it is for a Unicode Transformation Format (UTF). Then they ought not to call it "UTF-8" or "extended" or "modified" UTF-8, or anything of the sort, even if the bit-shifting algorithm is based on UTF-8. "UTF-8 encoding form" is defined as a mapping of Unicode scalar values -- not arbitrary integers -- onto byte sequences. [D92] -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From doug at ewellic.org Thu Nov 5 14:47:12 2015 From: doug at ewellic.org (Doug Ewell) Date: Thu, 05 Nov 2015 13:47:12 -0700 Subject: Emoji Proposal: Face With One Eyebrow Raised Message-ID: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> Mark Davis wrote: > The unicode_at_unicode.org mailing list isn't the right place for > submitting proposals; see the top of > http://www.unicode.org/emoji/selection.html#submission under "submit > as per Document Submission Details > ." To be fair, Max did cite his reason for doing so: > The ?Submitting Character Proposals? page on unicode.org recommends > to discuss preliminary proposals on this mailing list That page says: "Experience has shown that it is often helpful to discuss preliminary proposals before submitting a detailed proposal. One open forum for such discussion is the Unicode mail list. (See Public Email Distribution Lists for subscription instructions.) Sponsors are urged to send a message of inquiry or a preliminary proposal there before formal submission. Many problems and questions can be dealt with there, minimizing the severity of later revisions." -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From steve at swales.us Thu Nov 5 14:54:22 2015 From: steve at swales.us (Steve Swales) Date: Thu, 5 Nov 2015 12:54:22 -0800 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> Message-ID: <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us> Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector as well. -steve > On Nov 5, 2015, at 12:47 PM, Doug Ewell wrote: > > Mark Davis wrote: > >> The unicode_at_unicode.org mailing list isn't the right place for >> submitting proposals; see the top of >> http://www.unicode.org/emoji/selection.html#submission under "submit >> as per Document Submission Details >> ." > > To be fair, Max did cite his reason for doing so: > >> The ?Submitting Character Proposals? page on unicode.org recommends >> to discuss preliminary proposals on this mailing list > > That page says: > > "Experience has shown that it is often helpful to discuss preliminary > proposals before submitting a detailed proposal. One open forum for such > discussion is the Unicode mail list. (See Public Email Distribution > Lists for subscription instructions.) Sponsors are urged to send a > message of inquiry or a preliminary proposal there before formal > submission. Many problems and questions can be dealt with there, > minimizing the severity of later revisions." > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > From steve at swales.us Thu Nov 5 15:05:07 2015 From: steve at swales.us (Steve Swales) Date: Thu, 5 Nov 2015 13:05:07 -0800 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us> References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us> Message-ID: <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us> Or perhaps a slightly greenish skin-tone. This would be useful for depicting dark-net hackers and such as well. -steve > On Nov 5, 2015, at 12:54 PM, Steve Swales wrote: > > Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector as well. > > -steve > >> On Nov 5, 2015, at 12:47 PM, Doug Ewell wrote: >> >> Mark Davis wrote: >> >>> The unicode_at_unicode.org mailing list isn't the right place for >>> submitting proposals; see the top of >>> http://www.unicode.org/emoji/selection.html#submission under "submit >>> as per Document Submission Details >>> ." >> >> To be fair, Max did cite his reason for doing so: >> >>> The ?Submitting Character Proposals? page on unicode.org recommends >>> to discuss preliminary proposals on this mailing list >> >> That page says: >> >> "Experience has shown that it is often helpful to discuss preliminary >> proposals before submitting a detailed proposal. One open forum for such >> discussion is the Unicode mail list. (See Public Email Distribution >> Lists for subscription instructions.) Sponsors are urged to send a >> message of inquiry or a preliminary proposal there before formal >> submission. Many problems and questions can be dealt with there, >> minimizing the severity of later revisions." >> >> -- >> Doug Ewell | http://ewellic.org | Thornton, CO ???? >> >> > > From verdy_p at wanadoo.fr Thu Nov 5 15:55:03 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 5 Nov 2015 22:55:03 +0100 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us> References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> <17943E11-32D6-4F67-BFE2-35689EEBE63B@swales.us> <084272C4-AA9A-43A0-9040-D40F734E34AD@swales.us> Message-ID: And blue ? For Martians or Schtroumpfs (original French name of Peyo's comics characters, their name vary across languages: los Pitufos, Smurflars, die Schl?mpfe, el Barrufets, the Smurfs)... However there are also black ans green Schtroumpfs. 2015-11-05 22:05 GMT+01:00 Steve Swales : > Or perhaps a slightly greenish skin-tone. This would be useful for > depicting dark-net hackers and such as well. > > -steve > > > On Nov 5, 2015, at 12:54 PM, Steve Swales wrote: > > > > Idly wondering if we should have a EMOJI_VARIANT_VULCAN variant selector > as well. > > > > -steve > > > >> On Nov 5, 2015, at 12:47 PM, Doug Ewell wrote: > >> > >> Mark Davis wrote: > >> > >>> The unicode_at_unicode.org mailing list isn't the right place for > >>> submitting proposals; see the top of > >>> http://www.unicode.org/emoji/selection.html#submission under "submit > >>> as per Document Submission Details > >>> ." > >> > >> To be fair, Max did cite his reason for doing so: > >> > >>> The ?Submitting Character Proposals? page on unicode.org recommends > >>> to discuss preliminary proposals on this mailing list > >> > >> That page says: > >> > >> "Experience has shown that it is often helpful to discuss preliminary > >> proposals before submitting a detailed proposal. One open forum for such > >> discussion is the Unicode mail list. (See Public Email Distribution > >> Lists for subscription instructions.) Sponsors are urged to send a > >> message of inquiry or a preliminary proposal there before formal > >> submission. Many problems and questions can be dealt with there, > >> minimizing the severity of later revisions." > >> > >> -- > >> Doug Ewell | http://ewellic.org | Thornton, CO ???? > >> > >> > > > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Nov 5 16:11:16 2015 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 5 Nov 2015 14:11:16 -0800 Subject: Emoji Proposal: Face With One Eyebrow Raised In-Reply-To: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> References: <20151105134712.665a7a7059d7ee80bb4d670165c8327d.74507f936f.wbe@email03.secureserver.net> Message-ID: While it is always good to get feedback, I think the advice on that page is outdated. In practice, most proposals to the UTC are not floated on the public discussion list. One certainly can float it, but it shouldn't be "urged". We also should make clear that the discussions on this list are purely personal opinions, and predominantly from people who are not actually involved in the encoding process. Mark On Thu, Nov 5, 2015 at 12:47 PM, Doug Ewell wrote: > Mark Davis wrote: > > > The unicode_at_unicode.org mailing list isn't the right place for > > submitting proposals; see the top of > > http://www.unicode.org/emoji/selection.html#submission under "submit > > as per Document Submission Details > > ." > > To be fair, Max did cite his reason for doing so: > > > The ?Submitting Character Proposals? page on unicode.org recommends > > to discuss preliminary proposals on this mailing list > > That page says: > > "Experience has shown that it is often helpful to discuss preliminary > proposals before submitting a detailed proposal. One open forum for such > discussion is the Unicode mail list. (See Public Email Distribution > Lists for subscription instructions.) Sponsors are urged to send a > message of inquiry or a preliminary proposal there before formal > submission. Many problems and questions can be dealt with there, > minimizing the severity of later revisions." > > -- > Doug Ewell | http://ewellic.org | Thornton, CO ???? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From nospam-abuse at ilyaz.org Thu Nov 5 16:11:37 2015 From: nospam-abuse at ilyaz.org (Ilya Zakharevich) Date: Thu, 5 Nov 2015 14:11:37 -0800 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <563B7C5C.4000209@khwilliamson.com> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> Message-ID: <20151105221137.GA5796@math.berkeley.edu> On Thu, Nov 05, 2015 at 08:57:16AM -0700, Karl Williamson wrote: > Several of us are wondering about the reason for reserving bits for > the extended UTF-8 in perl5. I'm asking you because you are the > apparent author of the commits that did this. To start, the INTERNAL REPRESENTATION of Perl?s strings is the ?utf8? format (not ?UTF-8?, ?extended? or not). [I see that this misprint caused a lot of stir here!] However, outside of a few contexts, this internal representation should not be visible. (However, some of these contexts are close to the default, like read/write in Unicode mode, with -C switch.) Perl?s string is just a sequence of Perl?s unsigned integers. [Depending on the build, this may be, currently, 32-bit or 64-bit.] By convention, the ?meaning? of small integers coincides with what Unicode says. > To refresh your memory, in perl5 UTF-8, a start byte of 0xFF causes > the length of the sequence of bytes that comprise a single character > to be 13 bytes. This allows code points up to 2**72 - 1 to be > represented. If the length had been instead 12 bytes, code points up > to 2**66 - 1 could be represented, which is enough to represent any > code point possible in a 64-bit word. > > The comments indicate that these extra bits are "reserved". So > we're wondering what potential use you had thought of for these > bits. First of all, ?reserved? means that they have no meaning. Right? Second, there are 2 ways in which one may need this INTERNAL format to be extended: ? 128-bit architectures may be at hand (sooner or later). ? One may need to allow ?objects? to be embedded into Perl strings. With embedded objects, one must know how to kill them when the string (or its part) is removed. So, while a pointer can fit into a Perl integer, one needs to specify what to do: call DESTROY, or free(), or a user-defined function. This gives 5 possibilities (3 extra bits) which may be needed with ?slots? in Perl strings. ? Integer (?64 bits) ? Integer (?65 bits) ? Pointer to a Perl object ? Pointer to a malloc()ed memory ? Pointer to a struct which knows how to destroy itself. struct self_destroy { void *content; void destroy(struct self_destroy*); } Why one may need objects embedded into strings? I explained it in http://ilyaz.org/interview (look for ?Emacs? near the middle). Hope this helps, Ilya From verdy_p at wanadoo.fr Thu Nov 5 19:00:54 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 6 Nov 2015 02:00:54 +0100 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <20151105221137.GA5796@math.berkeley.edu> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> <20151105221137.GA5796@math.berkeley.edu> Message-ID: 2015-11-05 23:11 GMT+01:00 Ilya Zakharevich wrote > > ? 128-bit architectures may be at hand (sooner or later). This is specialation for something that is still not envisioned: a global worldwide working space where users and applications would interoperate transparently in a giant virtualized environment. However, this virtualized environment will be supported by 64-bit OSes that will never need native support of more the 64-bit pointers. Those 128-bit entities needed for adressing will not be used to work on units of data but to address some small selection of remote entities. Softwares that would requiring parsing coompletely chunks of memory data larger than 64-bit would be extremely inefficient, instead this data will be internally structured/paged, and only virtually mapped to some 128 bit global reference (such as GUID/UUIDs) only to select smaller chunks within the structure (and in most cases those chunks will remain in a 32-bit space (even in today's 64-bit OSes, the largest pages are 20-bit wide, but typically 10-bit wide (512-byte sectors) to 12-bit wide (standard VMM and I/O page sizes, networking MTUs), or about 16-bit wide (such as transmission window for TCP). This will not eveolve significantly before a major evolution in the worldwide Internet backbones requiring more than about 1Gigabit/s (a speed not even needed for 4K HD video, but needed only in massive computing grids, still built with a complex mesh of much slower data links). With 64-bit we already reach the physical limits of networking links, and higher speeds using large buses are only for extremely local links whose lengths are largely below a few millimters within chips themselves. 128 bit however is possible not for the working spaces (or document sizes) it will be very unlikely that ANSI C/C++ "size_t" type will be more than 64-bit (ecept for a few experimentations which will fail to be more efficient). What is more realist is that internal buses and caches will be 128 bits or even larger (this is already true for GPU memory), only to support more parallelism or massive parallelism (and typically by using vectored instructions working on sets of smaller values). And some data need 128-bit values for their numerical ranges (ALUs in CPU/GPU/APU are already 128-bit, as well as common floating point types) where extra precision is necessary. I doubt we'll ever see any true native 128-bit architecture in any time of our remaining life. We are still very far from the limit of the 64-bit architecture and it won't happend before the next century (if the current sequential binary model for computing is still used at that time, may be computing will use predictive technologies returning only heuristic results with a very high probability of giving a good solution to the problems we'll need to solve extremely rapidly, and those solutions will then be validated using today's binary logic with 64-bit computing). Even in the case where a global 128-bit networking space would appear, users will never be exposed to all that, msot of this content will be unacessible to them (restricted by secuiry concerns or privacy) and simply unmanageable by them : no one on earth is able to have any idea of what 2^64 bits of global data represents, no one will ever need it in their whole life. That amount of data will only be partly implemented by large organisations trying to build a giant cloud and whiching to interoperate by coordinating their addressing spaces (for that we have now IPv6). So your "sooner or later" is very optimistic. IMHO we'll stay with 64-bit architectures for very long, up to the time where our seuqnetial computing model will be deprecated and the concept of native integer sizes will be obsoleted and replaced by other kinds of computing "units" (notably parallel vectors, distributed computing, and heuristic computing, or may be optical computing based on Fourier transforms on analog signals or quantum computing, where our simple notion of "integers" or even "bits" will not even be placeable into individual physically placed units; their persistence will not even be localized, and there will be redundant/fault-tolerant placements). In fact our computing limits wil no longer be in terms of storage space, but in terms of access time, distance and predictability of results. The next technologies for faster computing will be certainly predictive/probabilistic rather than affirmative (with today's Turing/Von Neumann machines). "Algorithms" for working with it will be completely different. Fuzzy logic will be everywhere and we'll even need less the binary logic except for small problems. We'll have to live with the possibility of errors but anyway we already have to live with them evne with our binary logic (due to human bugs, haardware faults, accidents, and so on...) In most problems we don't even need to have 100% proven solutions (e.g. viewing a high-quality video, we already accept the possibility of some "quirks" occuring, and we already accept some minor alterationj of the exact pixel colors in which we can't even note any visible difference from the original ; another example is in what we call a "scientific proof" which is in fact only a solution with the highest probability of being correct in almost all known contexts, because we can never reproduce exactly the same exprimental environment: even a basic binary NAND gate cannot be warrantied at 100% of always returning a "0" state after a defined delay when its inputs are all "1"). We can certainly produce results with the same (or better) probability of giving the expected result using fuzzy logic (or quantum logic) rather then existing binary logic, and certainly with smaller computing delays (and better throughputs and better fault torlerance, incliuding after hardware faults or damages, and even with better security). -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Fri Nov 6 05:48:10 2015 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Fri, 6 Nov 2015 12:48:10 +0100 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <20151105221137.GA5796@math.berkeley.edu> References: <55136F5F.6080808@kli.org> <551373BB.70200@kli.org> <23498913.48204.1427383089966.JavaMail.defaultUser@defaultHost> <1244808.30647.1427461209141.JavaMail.defaultUser@defaultHost> <20150327180725.GA9968@math.berkeley.edu> <563B7C5C.4000209@khwilliamson.com> <20151105221137.GA5796@math.berkeley.edu> Message-ID: <563C937A.6020806@uni-konstanz.de> Am 05.11.2015 um 23:11 schrieb Ilya Zakharevich: > First of all, ?reserved? means that they have no meaning. Right? Almost. ?Reserved? means that they have currently no meaning but may be assigned a meaning, later; hence you ought not use them lest your programs, or data, be invalidated by later amendmends of the pertinent specification. In contrast, ?invalid?, or ?ill-formed? (Unicode term), means that the particular bit pattern may never be used in a sequence that purports to represent Unicode characters. In practice, that means that no programm is allowed to send those ill-formed patterns in Unicode-based data exchange, and every program should refuse to accept those ill-formed patterns, in Unicode-based data exchange. What a program does internally is at the discretion (or should I say: ?whim??) of its author, of course ? as long as the overall effect of the program complies with the standard. Best wishes, Otto Stolz From richard.wordingham at ntlworld.com Fri Nov 6 14:32:20 2015 From: richard.wordingham at ntlworld.com (Richard Wordingham) Date: Fri, 6 Nov 2015 20:32:20 +0000 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net> References: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net> Message-ID: <20151106203220.2b2fd15c@JRWUBU2> On Thu, 05 Nov 2015 13:41:42 -0700 "Doug Ewell" wrote: > Richard Wordingham wrote: > > > No-one's claiming it is for a Unicode Transformation Format (UTF). > > Then they ought not to call it "UTF-8" or "extended" or "modified" > UTF-8, or anything of the sort, even if the bit-shifting algorithm is > based on UTF-8. > "UTF-8 encoding form" is defined as a mapping of Unicode scalar values > -- not arbitrary integers -- onto byte sequences. [D92] If it extends the mapping of Unicode scalar values *into* byte sequences, then it's an extension. A non-trivial extension of a mapping of scalar values has to have a larger domain. I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. Richard. From public at khwilliamson.com Fri Nov 6 22:50:04 2015 From: public at khwilliamson.com (Karl Williamson) Date: Fri, 6 Nov 2015 21:50:04 -0700 Subject: Question about Perl5 extended UTF-8 design In-Reply-To: <20151106203220.2b2fd15c@JRWUBU2> References: <20151105134142.665a7a7059d7ee80bb4d670165c8327d.39cf275f13.wbe@email03.secureserver.net> <20151106203220.2b2fd15c@JRWUBU2> Message-ID: <563D82FC.5060509@khwilliamson.com> On 11/06/2015 01:32 PM, Richard Wordingham wrote: > On Thu, 05 Nov 2015 13:41:42 -0700 > "Doug Ewell" wrote: > >> Richard Wordingham wrote: >> >>> No-one's claiming it is for a Unicode Transformation Format (UTF). >> >> Then they ought not to call it "UTF-8" or "extended" or "modified" >> UTF-8, or anything of the sort, even if the bit-shifting algorithm is >> based on UTF-8. > >> "UTF-8 encoding form" is defined as a mapping of Unicode scalar values >> -- not arbitrary integers -- onto byte sequences. [D92] > > If it extends the mapping of Unicode scalar values *into* byte > sequences, then it's an extension. A non-trivial extension of a > mapping of scalar values has to have a larger domain. > > I'm assuming that 'UTF-8' and 'UTF' are not registered trademarks. > > Richard. > I have no idea how my original message ended up being marked to send to this list. I'm sorry. It was meant to be a personal message for someone who I believe was involved in the original design. From karl-pentzlin at acssoft.de Sat Nov 7 04:38:39 2015 From: karl-pentzlin at acssoft.de (Karl Pentzlin) Date: Sat, 7 Nov 2015 11:38:39 +0100 Subject: Finnish emoji Message-ID: <1802721603.20151107113839@acssoft.de> Just FYI (without any claim of relevance by myself), this site "produced by the [Finnish] Ministry for Foreign Affairs, Department for Communication" about an "own set of country themed emoji": http://finland.fi/life-society/the-headbanger-throws-his-phone-away-and-goes-to-sauna/ - Karl Pentzlin From wjgo_10009 at btinternet.com Sat Nov 7 09:00:41 2015 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Sat, 7 Nov 2015 15:00:41 +0000 (GMT) Subject: Finnish emoji (offlist) In-Reply-To: <1802721603.20151107113839@acssoft.de> References: <1802721603.20151107113839@acssoft.de> Message-ID: <20180342.28942.1446908441289.JavaMail.defaultUser@defaultHost> Hi Thank you for sharing the link. This is an interesting development. Best regards, William Overington 7 November 2015 ----Original message---- >From : karl-pentzlin at acssoft.de Date : 07/11/2015 - 10:38 (GMTST) To : unicode at unicode.org Subject : Finnish emoji Just FYI (without any claim of relevance by myself), this site "produced by the [Finnish] Ministry for Foreign Affairs, Department for Communication" about an "own set of country themed emoji": http://finland.fi/life-society/the-headbanger-throws-his-phone-away-and-goes-to-sauna/ - Karl Pentzlin From peroyomaslists at gmail.com Mon Nov 9 13:32:15 2015 From: peroyomaslists at gmail.com (=?UTF-8?Q?Andr=C3=A9s_Sanhueza?=) Date: Mon, 9 Nov 2015 16:32:15 -0300 Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books Message-ID: Hello. I was looking for info in Spanish about some rare punctuation symbols and found one in some Spaniard XIX century books (v?a Google books) I haven't seen referenced anywhere. It was called "millar", which translates somewhat like "thousand". It seems that it had at least four glyph variants, yet the quality of the scans make it a bit difficult to reproduce exactly. [image: Im?genes integradas 1] A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845). It says (poorly translated): The millar ([symbol]) o millaron as it is commonly called) is the > abbreviation for the zeros, when one types amount of a thousand: so, with > a single numeral and a sign of these it can be read thousands. The description is not very clear, but I understand that the sign is an abbreviation of the three zeros that comes in one thousand. so, instead of writing 40.000, one can write 40[symbol]. In the text the sing is given the look of a turned C with a lighting bold in it, but I can be wrong. [image: Im?genes integradas 2] Another sample from "Gram?tica castellana fundada sobre principios filos?ficos" by Francesc Pons i Argent? (1850), with a more straight-forward description. Among counters the same name is given to each of these signs [symbol1], > [symbol2], [symbol3] to denote thousand. So 20[symbol1] is read twenty > thousand, 30[symbol2], thirty thousand, 40[symbol3], forty thousand. Now there's three glyphs variants. One is an stand-alone turned C. Other is a turned C with two bars as an overlay. The other looks like two f's turned 180?, or two j's with an small bar. Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de la imprenta" by Antoni Serra i Oliveres (1852). [image: Im?genes integradas 4] In this one, the millar looks like an straight C with two overlay bars. The other symbols mentioned look like familiar ones, (the "sueldos" (salaries) one looks like an small s in superscript. I guess is just an abbreviation. I'm a bit confused with the letters with diacritics, but don't seems anything unknown). Anyone has more insight about this? -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13411 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 35960 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 19520 bytes Desc: not available URL: From ken.shirriff at gmail.com Mon Nov 9 15:32:17 2015 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Mon, 9 Nov 2015 13:32:17 -0800 Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books In-Reply-To: References: Message-ID: I took a quick look to see if I could find any other examples, which probably confuses things more. Take a look at this book, which describes millar symbols: 20? and 40JJ (approximately) for 20 thousand and 40 thousand. https://books.google.com/books?id=FBEMAQAAIAAJ&dq=%22denotar%20el%20millar%22&pg=PA161#v=onepage&q=%22denotar%20el%20millar%22&f=false Another document says the calder?n (i.e. pilcrow ?) can be used for thousands. https://books.google.com/books?id=MtxGAAAAIAAJ&dq=%22denotar%20el%20millar%22&pg=PA214#v=onepage&q=%22denotar%20el%20millar%22&f=false Ken On Mon, Nov 9, 2015 at 11:32 AM, Andr?s Sanhueza wrote: > Hello. I was looking for info in Spanish about some rare punctuation > symbols and found one in some Spaniard XIX century books (v?a Google books) > I haven't seen referenced anywhere. It was called "millar", which > translates somewhat like "thousand". It seems that it had at least four > glyph variants, yet the quality of the scans make it a bit difficult to > reproduce exactly. > > [image: Im?genes integradas 1] > > A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845). It says > (poorly translated): > > The millar ([symbol]) o millaron as it is commonly called) is the >> abbreviation for the zeros, when one types amount of a thousand: so, >> with a single numeral and a sign of these it can be read thousands. > > > The description is not very clear, but I understand that the sign is an > abbreviation of the three zeros that comes in one thousand. so, instead of > writing 40.000, one can write 40[symbol]. > > In the text the sing is given the look of a turned C with a lighting bold > in it, but I can be wrong. > > [image: Im?genes integradas 2] > > Another sample from "Gram?tica castellana fundada sobre principios > filos?ficos" by Francesc Pons i Argent? (1850), with a more > straight-forward description. > > Among counters the same name is given to each of these signs [symbol1], >> [symbol2], [symbol3] to denote thousand. So 20[symbol1] is read twenty >> thousand, 30[symbol2], thirty thousand, 40[symbol3], forty thousand. > > > Now there's three glyphs variants. One is an stand-alone turned C. Other > is a turned C with two bars as an overlay. The other looks like two f's > turned 180?, or two j's with an small bar. > > Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de > la imprenta" by Antoni Serra i Oliveres (1852). > > [image: Im?genes integradas 4] > In this one, the millar looks like an straight C with two overlay bars. > The other symbols mentioned look like familiar ones, (the "sueldos" > (salaries) one looks like an small s in superscript. I guess is just an > abbreviation. I'm a bit confused with the letters with diacritics, but > don't seems anything unknown). > > Anyone has more insight about this? > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 19520 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 35960 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 13411 bytes Desc: not available URL: From charupdate at orange.fr Sun Nov 15 07:58:55 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 15 Nov 2015 14:58:55 +0100 (CET) Subject: Latin glottal stop in ID in NWT, Canada Message-ID: <1203727072.8694.1447595935864.JavaMail.www@wwinf1p10> Dear Leo, Thank you for your kind reply. I hope it will meet anticipated expectations. It?s however shocking that when traditional languages require supplemental means for a performative orthography, this is referred to as ?proliferating arbitrary characters defined as "latin letter" in Unicode?. I?feel with your concern about people?s inertia and their attitude of being eager to cut short, root out, and throw away nature?s beauties. Applied to aboriginal languages and to proper names, this practice isn?t like pruning, it?s like spraying herbicide?and even worse: upon flowers. One often forgets that ancient Romans themselves added ?new characters? in order to put themselves into a position to efficiently spell foreign names. We find these additional letters at the very end of the ?Roman alphabet??which turns out to be already a kind of ?extended Latin.? Today, Arabic?and?IPA obviously take over the role that Greek or Phoenician played by the time. The idea that everything must be spelt in US-ASCII, or that everything must be written in Latin-1 or in CP-1252, or that everything must at least be encoded on one single byte, couldn?t arise before the computer age. Today, our mission if we accept, is to help Unicode to bring the invitation to make a smarter use of the worktool. IMHO, ?ease of data interchange? is meant to be ensured by using UTF-8. And even in plain ASCII, non-ASCII characters can be represented as HTML entities. The problem clearly is not interchange, it?s storage and local processing, thus an issue about software and related hardware. Here are the means to implement respectfulness towards *all* individuals who aim at respecting their language, their traditions, and the values of faithfulness, democracy, and humanity. To win the actual war, we best stop unsupporting our aboriginal and other official languages first. French too must stop to be threatened in Canada. Unity in diversity is part of our strength. I believe that once this thread has been launched on the Unicode Mailing List, that is to be added. Today is likely to be the right time. Best regards, Marcel On Thu, 29 Oct 2015 10:20:35 -0700, Leo Broukhis wrote: http://www.unicode.org/mail-arch/unicode-ml/y2015-m10/0225.html [Link provided instead of quotation in conformance to List policies.] -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Mon Nov 16 06:38:48 2015 From: charupdate at orange.fr (Marcel Schneider) Date: Mon, 16 Nov 2015 13:38:48 +0100 (CET) Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books In-Reply-To: References: Message-ID: <563810292.11167.1447677528891.JavaMail.www@wwinf1c25> On Mon, 9 Nov 2015 16:32:15 -0300, Andr?s Sanhueza wrote: > Hello. I was looking for info in Spanish about some rare punctuation symbols and found one in some Spaniard XIX century books (v?a Google books) I haven't seen referenced anywhere. It was called "millar", which translates somewhat like "thousand". It seems that it had at least four glyph variants, yet the quality of the scans make it a bit difficult to reproduce exactly. > A sample from "Manual del cajista" by Jos? Mar?a Palacios (1845). > In the text the sing is given the look of a turned C with a lighting bold in it Upscaled, it looks like a reversed C with two little overlaid solidi. I can't address the challenge to represent it in Unicode. As an approximation, one might suggest a turned C with (one) small solidus overlay: U+0186 LATIN CAPITAL LETTER OPEN O, U+0337 COMBINING SHORT SOLIDUS OVERLAY. Reversed C is available in lowercase only (U+2184 LATIN SMALL LETTER REVERSED C). ? > Another sample from "Gram?tica castellana fundada sobre principios filos?ficos"?by Francesc Pons i Argent? (1850) > Now there's three glyphs variants. One is an stand-alone turned C. Other is a turned C with two bars as an overlay. The other looks like two f's turned 180?, or two j's with an small bar. ? In digital typography, these turned characters could IMO be raised on the baseline like it is current in Unicode. The second is in fact a turned Colon sign. This can be represented fairly well (at the condition that overlay combining diacritics are properly implemented): U+0186 LATIN CAPITAL LETTER OPEN O, U+20E6 COMBINING DOUBLE VERTICAL STROKE OVERLAY The third looks like a turned small ligature ff. I see no other way than using two turned f's (eventually with reduced letter spacing): U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE, U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE ? > Another sample from "Manual de la tipografia espa?ola, ? sea, El Arte de la imprenta" by Antoni Serra i Oliveres (1852). > In this one, the millar looks like an straight C with two overlay bars. ? This being now U+20A1 COLON SIGN, use as thousand sign would be biased. ? ? On Mon, 9 Nov 2015 13:32:17 -0800, Ken Shirriff wrote: ? > Take a look at this book, which describes millar symbols: 20? and 40JJ (approximately) ? U+0254 LATIN SMALL LETTER OPEN O as a thousands sign is straightforward, especially with Elzevirian digits as quoted from "Critica de lenguaje" by F?liz Ramos i Duarte (1896). When for double J uppercase is preferred, I suppose that's to have it dotless. This is available in lowercase: U+0237 LATIN SMALL LETTER DOTLESS J, U+0237 LATIN SMALL LETTER DOTLESS J. ? I'm not sure whether I've replied what Andr?s really intended to learn by launching the thread. In any case I took it as a touchstone for Unicode completeness. ? Best regards, ? Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Nov 16 10:51:11 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 16 Nov 2015 17:51:11 +0100 Subject: Rare "Thousand sign" (or "Millar") in XIX century Spaniard books In-Reply-To: <563810292.11167.1447677528891.JavaMail.www@wwinf1c25> References: <563810292.11167.1447677528891.JavaMail.www@wwinf1c25> Message-ID: Le 16 nov. 2015 13:56, "Marcel Schneider" a ?crit : > The third looks like a turned small ligature ff. I see no other way than using two turned f's (eventually with reduced letter spacing): > > U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE, U+025F LATIN SMALL LETTER DOTLESS J WITH STROKE If this is a ligature of two letters they really should be joined with ZWJ... Otherwisecthectwo letters will not have any signifiance and won't be associated with the millar, they'll juste read as strange two letters which are not even an abbreviation of the millar word. Such hint is needed in this case, semantically, even if the ligature will not necessarily be rendered. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jknappen at web.de Thu Nov 26 02:10:36 2015 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Thu, 26 Nov 2015 09:10:36 +0100 Subject: Aw: New Character Property for Prepended Concatenation Marks In-Reply-To: <56563758.1040906@unicode.org> References: <56563758.1040906@unicode.org> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1743 bytes Desc: not available URL: From verdy_p at wanadoo.fr Thu Nov 26 04:41:47 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 Nov 2015 11:41:47 +0100 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> Message-ID: The root sign is much more complex than just prepending specific sequences of characters (in a limited set): when it embeds some "text", it can it it recursively and unless you use additional parentheses for the linear presentation, it highly depends on the 2D layout of its operand (additionally it could be prefixed itself by a superscripted radix value). Leave it alone: the 2D layout (even in the linear presentation using parentheses where needed) will be mapped using an additional mathematical presentaiton layer and notation. For the basic plain-text, the root sign will just stay alone without using any complex layout, and its operand will simply follow it (using parentheses where needed) without specific rendering. ---- However the proposal for these prepended concatenation marks does not give any hint about how to compute the extent of the following clusters above/over/below/around which they will apply (do they extend over only letters/digits, but not whitespaces or punctuation signs including abbreviation marks? For me this kind of visual interaction should be more explicitly delimited using special marks (working like invisible parentheses) : the absence of these special marks immediately after the prepended concatenation mark should mean that they will not extend after the next (non-whitespace) cluster. So: - will display the isolated number sign WITHOUT extending to the following space and digit - will apply the number sign ONLY to the first digit - will apply the number sign to the two digits - will apply the number sign to the two digits and the separating full stop - will apply the number sign to the two digits and the separating space - will apply the number sign to the first digit only before the newline control, the second digit will appear on the next line outside the number sign complex cluster, the second control will be ignored (or would display with a "visible control glyph". Without the and special controls, it will be necessary anyway to define specific enumerations of characters that can be part of the sequence on which the prepended mark will apply. Another complication: when such prepended sequences are recognized, there are specific tunings to apply in line-breaking algorithms. Word breaking algorithms may not need specific changes if the enumerations of characters that can be part of the prepended sequence cannot contain any word-breaking character. That's why I suggested that, by default, such enumerations should include only letters and digits but not whitespace (and probably not punctuation signs such as the dot), plus their additional combining marks. - For Arabic U+0600, U+0601 and U+0605 (TUS-9.2, page 374), the enumeration is supposed to contain only Arabic-Indic or extended Arabic-Indic digits, but I wonder if it should not include as well number separators, or even Arabic-European digits. - Same remark for the Kaithi number sign U+110BD. - For Syriac U+070F (TUS-9.3, pages 390-391), the enumeration is not so obvious (all Syriac "letter-numbers"?) There are also similar characters in other scripts not listed: one example with the Cyrillic hundred-thousands/millions marks U+0488..U+0489 which enclose possibly more than one digits (currently encoded as combining marks applicable to only one digit?); another with the Egyptian Hieroglyph honorific "Cartouche" which encloses the name of a king; other examples possible as well in other Asian scripts for honorific marks. The system using explicitly delimited sequences would work as well with the Latin script for some honorific "decorators" which are not just ligatures, e.g. for the name of God or Jesus-Christ (which may also be themselves abbreviated), including for Quranic transcriptions. -- Philippe. 2015-11-26 9:10 GMT+01:00 "J?rg Knappen" : > I wonder how this concept relates to mathematical notation, especially the > root sign. > > --J?rg Knappen > > *Gesendet:* Mittwoch, 25. November 2015 um 23:34 Uhr > *Von:* announcements at unicode.org > *An:* announcements at unicode.org > *Betreff:* New Character Property for Prepended Concatenation Marks > > The Unicode Technical Committee is seeking feedback on a proposal to > define a new character property for the class of *prepended concatenation > marks*, also referred to as *prefixed format control characters* or, more > generically, as subtending marks. Characters in that class include U+0600 > ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH. The new property, named > Prepended_Concatenation_Mark and targeted for Unicode 9.0, would provide a > mechanism to handle subtending marks collectively via properties rather > than by hardcoded enumeration. A detailed description of the issue and how > to provide feedback are given in Public Review Issue #310 > . > > http://blog.unicode.org/2015/11/new-character-property-for-prepended.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1743 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Thu Nov 26 04:50:51 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 26 Nov 2015 02:50:51 -0800 Subject: Aw: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> Message-ID: <5656E40B.2050905@ix.netcom.com> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1743 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Thu Nov 26 04:56:44 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 26 Nov 2015 02:56:44 -0800 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> Message-ID: <5656E56C.3090205@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 26 05:08:43 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 Nov 2015 12:08:43 +0100 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> Message-ID: The related definition for extended grapheme clusters says: ( CRLF | *Prepend* *( RI-sequence | Hangul-Syllable | !Control ) ( Grapheme_Extend | *SpacingMark* )* | . ) However I do not understand why it may include only one Hangul-Syllable when applying prepended concatenation marks. And if the definition excludes whitespaces, nothing prevents it to extend to arbitrary sequences of letters/digits/symbols/punctuations (this could span very long sequences of sinograms, or other letters from scripts that do not use whitespaces as word separators. Even in the Latin script it would extend to the punctuation signs that may follow any word, or to an entire mathematical formula such as "1+2*3" but not "sin x"... 2015-11-26 11:41 GMT+01:00 Philippe Verdy : > The root sign is much more complex than just prepending specific sequences > of characters (in a limited set): when it embeds some "text", it can it it > recursively and unless you use additional parentheses for the linear > presentation, it highly depends on the 2D layout of its operand > (additionally it could be prefixed itself by a superscripted radix value). > Leave it alone: the 2D layout (even in the linear presentation using > parentheses where needed) will be mapped using an additional mathematical > presentaiton layer and notation. > For the basic plain-text, the root sign will just stay alone without using > any complex layout, and its operand will simply follow it (using > parentheses where needed) without specific rendering. > > ---- > > However the proposal for these prepended concatenation marks does not give > any hint about how to compute the extent of the following clusters > above/over/below/around which they will apply (do they extend over only > letters/digits, but not whitespaces or punctuation signs including > abbreviation marks? > > For me this kind of visual interaction should be more explicitly delimited > using special marks (working like invisible parentheses) : the absence of > these special marks immediately after the prepended concatenation mark > should mean that they will not extend after the next (non-whitespace) > cluster. > > > So: > > - will display the isolated > number sign WITHOUT extending to the following space and digit > > - will apply the > number sign ONLY to the first digit > > - TWO, END OF SEQUENCE> will apply the number sign to the two digits > > - DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits > and the separating full stop > > - DIGIT TWO, END OF SEQUENCE> will apply the number sign to the two digits > and the separating space > > - DIGIT TWO, END OF SEQUENCE> will apply the number sign to the first digit > only before the newline control, the second digit will appear on the next > line outside the number sign complex cluster, the second control will be > ignored (or would display with a "visible control glyph". > > Without the and special controls, > it will be necessary anyway to define specific enumerations of characters > that can be part of the sequence on which the prepended mark will apply. > > Another complication: when such prepended sequences are recognized, there > are specific tunings to apply in line-breaking algorithms. > > Word breaking algorithms may not need specific changes if the enumerations > of characters that can be part of the prepended sequence cannot contain any > word-breaking character. That's why I suggested that, by default, such > enumerations should include only letters and digits but not whitespace (and > probably not punctuation signs such as the dot), plus their additional > combining marks. > > - For Arabic U+0600, U+0601 and U+0605 (TUS-9.2, page 374), the > enumeration is supposed to contain only Arabic-Indic or extended Arabic > -Indic digits, but I wonder if it should not include as well number > separators, or even Arabic-European digits. > - Same remark for the Kaithi number sign U+110BD. > - For Syriac U+070F (TUS-9.3, pages 390-391), the enumeration is not so > obvious (all Syriac "letter-numbers"?) > > There are also similar characters in other scripts not listed: one example > with the Cyrillic hundred-thousands/millions marks U+0488..U+0489 which > enclose possibly more than one digits (currently encoded as combining marks > applicable to only one digit?); another with the Egyptian Hieroglyph > honorific "Cartouche" which encloses the name of a king; other examples > possible as well in other Asian scripts for honorific marks. > > The system using explicitly delimited sequences would work as well with > the Latin script for some honorific "decorators" which are not just > ligatures, e.g. for the name of God or Jesus-Christ (which may also be > themselves abbreviated), including for Quranic transcriptions. > > -- Philippe. > > > 2015-11-26 9:10 GMT+01:00 "J?rg Knappen" : > >> I wonder how this concept relates to mathematical notation, especially >> the root sign. >> >> --J?rg Knappen >> >> *Gesendet:* Mittwoch, 25. November 2015 um 23:34 Uhr >> *Von:* announcements at unicode.org >> *An:* announcements at unicode.org >> *Betreff:* New Character Property for Prepended Concatenation Marks >> >> The Unicode Technical Committee is seeking feedback on a proposal to >> define a new character property for the class of *prepended >> concatenation marks*, also referred to as *prefixed format control >> characters* or, more generically, as subtending marks. Characters in >> that class include U+0600 ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH. >> The new property, named Prepended_Concatenation_Mark and targeted for >> Unicode 9.0, would provide a mechanism to handle subtending marks >> collectively via properties rather than by hardcoded enumeration. A >> detailed description of the issue and how to provide feedback are given in Public >> Review Issue #310 . >> >> http://blog.unicode.org/2015/11/new-character-property-for-prepended.html >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1743 bytes Desc: not available URL: From asmus-inc at ix.netcom.com Thu Nov 26 05:38:13 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 26 Nov 2015 03:38:13 -0800 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> Message-ID: <5656EF25.9080903@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 26 06:29:41 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 Nov 2015 13:29:41 +0100 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: <5656EF25.9080903@ix.netcom.com> References: <56563758.1040906@unicode.org> <5656EF25.9080903@ix.netcom.com> Message-ID: 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) : > On 11/26/2015 3:08 AM, Philippe Verdy wrote: > > The related definition for extended grapheme clusters says: > > ( CRLF > | *Prepend* *( RI-sequence | Hangul-Syllable | !Control ) > ( Grapheme_Extend | *SpacingMark* )* > | . ) > > However I do not understand why it may include only one Hangul-Syllable > when applying prepended concatenation marks. And if the definition excludes > whitespaces, nothing prevents it to extend to arbitrary sequences of > letters/digits/symbols/punctuations (this could span very long sequences of > sinograms, or other letters from scripts that do not use whitespaces as > word separators. Even in the Latin script it would extend to the > punctuation signs that may follow any word, or to an entire mathematical > formula such as "1+2*3" but not "sin x"... > > > White space is clearly NOT part a grapheme cluster, so I don't see what > your issue is? > No, whitespace is a grapheme cluster by its own, matching (.) The issue is the overlong extended grapheme cluster after any Prepend occurs because ( Grapheme_Extend | *SpacingMark* )* But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore the rare RI-sequences which are still are stil short) and will not match the sequences of digits or letters intended by the prepended concatenation marks, but only one. > BTW, if after careful analysis you think there is a mistake, you should > probably raise a bug on this. > For now the proposal only speaks about listing the prepended characters enumeration with a new defined property , it still does not address what are the sequences of graphemes over which they apply. As these quequences are specific to each prepended character, I don't see how the new property will help if we need to specialize each one of these characters: we still need custom algorithm (possibly tailored by locale) for breaking clusters using them. With the definition given above, the extended grapheme clusters will break after each letter/digit/punctuation and will still break into separated from The proposed new property does not change this : how can we really extend the sequence of digits so that the number sign will span all of them? Use CGJ or explicit sequence delimiters ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Thu Nov 26 06:58:44 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Thu, 26 Nov 2015 04:58:44 -0800 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> <5656EF25.9080903@ix.netcom.com> Message-ID: <56570204.8090209@ix.netcom.com> On 11/26/2015 4:29 AM, Philippe Verdy wrote: > 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) >: > > On 11/26/2015 3:08 AM, Philippe Verdy wrote: >> The related definition for extended grapheme clusters says: >> >> ( CRLF >> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control ) >> ( Grapheme_Extend | *SpacingMark* )* >> | . ) >> >> However I do not understand why it may include only one >> Hangul-Syllable when applying prepended concatenation marks. And >> if the definition excludes whitespaces, nothing prevents it to >> extend to arbitrary sequences of >> letters/digits/symbols/punctuations (this could span very long >> sequences of sinograms, or other letters from scripts that do not >> use whitespaces as word separators. Even in the Latin script it >> would extend to the punctuation signs that may follow any word, >> or to an entire mathematical formula such as "1+2*3" but not "sin >> x"... > > White space is clearly NOT part a grapheme cluster, so I don't see > what your issue is? > > > No, whitespace is a grapheme cluster by its own, matching (.) > > The issue is the overlong extended grapheme cluster after any Prepend > occurs because ( Grapheme_Extend | *SpacingMark* )* > But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we > ignore the rare RI-sequences which are still are stil short) and will > not match the sequences of digits or letters intended by the prepended > concatenation marks, but only one. Prepend in front of an RI-Sequence is really a "defective" cluster in terms of the Arabic number sign's definition. So, one thing the Grapheme cluster specification should be clear about is that it does not describe the breaks in formatting runs needed to implement these characters. Also, for editing (a common use of grapheme clusters) running these together with any following characters is not very useful in my opinion. So, perhaps much of the "Prepend" is a bug after all? > > BTW, if after careful analysis you think there is a mistake, you > should probably raise a bug on this. > > > For now the proposal only speaks about listing the prepended > characters enumeration with a new defined property , it still does not > address what are the sequences of graphemes over which they apply. As > these quequences are specific to each prepended character, I don't see > how the new property will help if we need to specialize each one of > these characters: we still need custom algorithm (possibly tailored by > locale) for breaking clusters using them. correct - I wouldn't call that an "algorithm" -- it's the formatting behavior for that code point (some of them are similar, as I said, I see three patterns: following digit, digit run and word run. > > With the definition given above, the extended grapheme clusters will > break after each letter/digit/punctuation and > > will still break into > separated from > The proposed new property does not change this : how can we really > extend the sequence of digits so that the number sign will span all of > them? Use CGJ or explicit sequence delimiters ? > correct, gives an incorrect specification - we need an actual specification for the format runs. A./ -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Nov 26 07:04:27 2015 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 26 Nov 2015 14:04:27 +0100 Subject: New Character Property for Prepended Concatenation Marks In-Reply-To: References: <56563758.1040906@unicode.org> <5656EF25.9080903@ix.netcom.com> Message-ID: Also, for Kaithi (TUS-15.2 pages 570-571) I note this paragraph: The character U+110BD kaithi number sign is a format control character that interacts with digits, occurring either above or below a digit. The position of the kaithi number South and Central Asia-IV 571 15.2 Kaithi sign indicates its usage: when the mark occurs above a digit, it indicates a number in an itemized list, similar to U+2116 numero sign. If it occurs below a digit, it indicates a numerical reference. Like U+0600 arabic number sign and the other Arabic signs that span numbers (see Section 9.2, Arabic), the kaithi number sign precedes the numbers they graphically interact with, rather than following them, as would combining characters. The U+110BC kaithi enumeration sign is the spacing version of the kaithi number sign, and is used for inline usage. However there's absolutely no indication on how to disambiguate the two usages and presentations if these are unified within the same U+110BD character. In both cases it will be encoded before the Kaithi digits. Note that U+110BC is a separate standalone usage (as a symbol without any number) which is a priori much more limited. Possibly something was forgotten there: - add an additional (joiner) control between it and the digits for the numeric reference (e.g. note calls), and none for itemized lists (including when numbering section headings) ? - or encode a separate character for its usage in numeric reference (below numbers) In the Latin script, both usages are generally distinguished but no specific mark is used (with the exception of the legacy Numero symbol), and there's no need to tweak the default presentation of clusters : - the "numero" symbol or abbreviation (N or n + superscript o) is used for references, or the number itself is put in superscript or between [brackets], - but for itemized lists, the indicator is typically a suffix after the number (e.g. a dot or hyphen punctuation before the item itself, or within the item itself a superscript "o" or "a", or superscripted final abbreviation, such as "e", "er" in French, "st", "nd", "rd" in English...) 2015-11-26 13:29 GMT+01:00 Philippe Verdy : > 2015-11-26 12:38 GMT+01:00 Asmus Freytag (t) : > >> On 11/26/2015 3:08 AM, Philippe Verdy wrote: >> >> The related definition for extended grapheme clusters says: >> >> ( CRLF >> | *Prepend* *( RI-sequence | Hangul-Syllable | !Control ) >> ( Grapheme_Extend | *SpacingMark* )* >> | . ) >> >> However I do not understand why it may include only one Hangul-Syllable >> when applying prepended concatenation marks. And if the definition excludes >> whitespaces, nothing prevents it to extend to arbitrary sequences of >> letters/digits/symbols/punctuations (this could span very long sequences of >> sinograms, or other letters from scripts that do not use whitespaces as >> word separators. Even in the Latin script it would extend to the >> punctuation signs that may follow any word, or to an entire mathematical >> formula such as "1+2*3" but not "sin x"... >> >> >> White space is clearly NOT part a grapheme cluster, so I don't see what >> your issue is? >> > > No, whitespace is a grapheme cluster by its own, matching (.) > > The issue is the overlong extended grapheme cluster after any Prepend > occurs because ( Grapheme_Extend | *SpacingMark* )* > But ( RI-sequence | Hangul-Syllable | !Control ) is bounded (if we ignore > the rare RI-sequences which are still are stil short) and will not match > the sequences of digits or letters intended by the prepended concatenation > marks, but only one. > > >> BTW, if after careful analysis you think there is a mistake, you should >> probably raise a bug on this. >> > > For now the proposal only speaks about listing the prepended characters > enumeration with a new defined property , it still does not address what > are the sequences of graphemes over which they apply. As these quequences > are specific to each prepended character, I don't see how the new property > will help if we need to specialize each one of these characters: we still > need custom algorithm (possibly tailored by locale) for breaking clusters > using them. > > With the definition given above, the extended grapheme clusters will break > after each letter/digit/punctuation and > > will still break into > separated from > The proposed new property does not change this : how can we really extend > the sequence of digits so that the number sign will span all of them? Use > CGJ or explicit sequence delimiters ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From plug.gulp at gmail.com Fri Nov 27 13:55:55 2015 From: plug.gulp at gmail.com (Plug Gulp) Date: Fri, 27 Nov 2015 19:55:55 +0000 Subject: ZWJ, ZWNJ and Markup languages. Message-ID: Hi, The Unicode standard 8.0 states in chapter 23, section titled "Cursive Connection and Ligatures"(printed page #814, PDF page #850) that: "The zero width joiner and non-joiner characters are designed for use in plain text; they should not be used where higher-level ligation and cursive control is available. (See Uni-code Technical Report #20, ?Unicode in XML and Other Markup Languages,? for more information.) " I went through TR#20 and did not find any mention that ZWJ and ZWNJ are not suitable for use with markup languages. On the contrary, ZWJ and ZWNJ are listed in TR#20 under section 4 titled "Format Characters Suitable for Use with Markup". So are ZWJ and ZWNJ characters suitable for use with markup languages such as HTML and XML? Thanks and kind regards, ~Plug From duerst at it.aoyama.ac.jp Fri Nov 27 19:42:15 2015 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Sat, 28 Nov 2015 10:42:15 +0900 Subject: ZWJ, ZWNJ and Markup languages. In-Reply-To: References: Message-ID: <56590677.1020204@it.aoyama.ac.jp> On 2015/11/28 04:55, Plug Gulp wrote: > The Unicode standard 8.0 states in chapter 23, section titled "Cursive > Connection and Ligatures"(printed page #814, PDF page #850) that: > > "The zero width joiner and non-joiner characters are designed for use > in plain text; they should not be used where higher-level ligation and > cursive control is available. (See Uni-code Technical Report #20, > ?Unicode in XML and Other Markup Languages,? for more information.) " > > I went through TR#20 and did not find any mention that ZWJ and ZWNJ > are not suitable for use with markup languages. On the contrary, ZWJ > and ZWNJ are listed in TR#20 under section 4 titled "Format Characters > Suitable for Use with Markup". > > So are ZWJ and ZWNJ characters suitable for use with markup languages > such as HTML and XML? They are indeed suitable for use with markup languages. They are so suitable that they are already provided as entities in RFC 2070, which is now historic, and from there on through HTML 4.0 and onwards. Please see http://tools.ietf.org/html/rfc2070#section-4.2. I'm not sure why Unicode 8.0 has the text it has; at the least, this should be toned down somewhat to say "they may be replaced by higher-level ligation and cursive control mechanisms if available". Thanks for finding this! The main reason for this is that these characters apply at a single point; creating markup such as and would not give any advantages over ‍/‌. Markup is at its best when it can be applied to nested spans of text. It is not inconcievable that something like ... could occasionally be useful, but I have difficulties immagining a use case of the top of my head. I'll file a bug report with the content of this email. Regards, Martin. From asmus-inc at ix.netcom.com Fri Nov 27 20:49:31 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 27 Nov 2015 18:49:31 -0800 Subject: ZWJ, ZWNJ and Markup languages. In-Reply-To: References: Message-ID: <5659163B.1080708@ix.netcom.com> An HTML attachment was scrubbed... URL: From asmus-inc at ix.netcom.com Fri Nov 27 22:14:40 2015 From: asmus-inc at ix.netcom.com (Asmus Freytag (t)) Date: Fri, 27 Nov 2015 20:14:40 -0800 Subject: ZWJ, ZWNJ and Markup languages. In-Reply-To: <56590677.1020204@it.aoyama.ac.jp> References: <56590677.1020204@it.aoyama.ac.jp> Message-ID: <56592A30.4070606@ix.netcom.com> An HTML attachment was scrubbed... URL: From plug.gulp at gmail.com Sun Nov 29 20:58:18 2015 From: plug.gulp at gmail.com (Plug Gulp) Date: Mon, 30 Nov 2015 02:58:18 +0000 Subject: ZWJ, ZWNJ and Markup languages. In-Reply-To: <56590677.1020204@it.aoyama.ac.jp> References: <56590677.1020204@it.aoyama.ac.jp> Message-ID: On Sat, Nov 28, 2015 at 1:42 AM, Martin J. D?rst wrote: > > They are indeed suitable for use with markup languages. They are so suitable > that they are already provided as entities in RFC 2070, which is now > historic, and from there on through HTML 4.0 and onwards. Please see > http://tools.ietf.org/html/rfc2070#section-4.2. > Thank you Martin for the information! Yes, I now see that it is indeed specified in the HTML spec here http://www.w3.org/TR/html4/sgml/entities.html#h-24.4 Thanks once again for the help! Kind regards, ~Plug > I'm not sure why Unicode 8.0 has the text it has; at the least, this should > be toned down somewhat to say "they may be replaced by higher-level ligation > and cursive control mechanisms if available". > Thanks for finding this! > > The main reason for this is that these characters apply at a single point; > creating markup such as and would not give any advantages > over ‍/‌. > > Markup is at its best when it can be applied to nested spans of text. It is > not inconcievable that something like ... > could occasionally be useful, but I have > difficulties immagining a use case of the top of my head. > > I'll file a bug report with the content of this email. > > Regards, Martin.