From doug at ewellic.org Fri Jun 3 16:40:35 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 03 Jun 2016 14:40:35 -0700 Subject: Encoding the Mayan Script: Message-ID: <20160603144035.665a7a7059d7ee80bb4d670165c8327d.79252965a8.wbe@email03.godaddy.com> http://blog.unicode.org/2016/06/encoding-mayan-script-your-adopt.html This is great news. Congratulations to both UTC and the sponsors for helping to fund this worthwhile encoding effort. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From steffen at sdaoden.eu Sat Jun 4 06:08:31 2016 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Sat, 04 Jun 2016 13:08:31 +0200 Subject: Encoding the Mayan Script: In-Reply-To: <20160603144035.665a7a7059d7ee80bb4d670165c8327d.79252965a8.wbe@email03.godaddy.com> References: <20160603144035.665a7a7059d7ee80bb4d670165c8327d.79252965a8.wbe@email03.godaddy.com> Message-ID: <20160604110831.g5-Yi-glJ%steffen@sdaoden.eu> |http://blog.unicode.org/2016/06/encoding-mayan-script-your-adopt.html | |This is great news. Congratulations to both UTC and the sponsors for |helping to fund this worthwhile encoding effort. I concur with all my heart! Are usch? ocher z?ch war?l K'itsch?' ub'?'. Good luck! Are uz?choschik wa'e: k'? kaz'ininoq, k'? katschamamoq, kaz'inonik, k'? kasilanik, k'? kalolonik, kat?lona putsch upa k?ch. May the force be with you! --steffen From mathias at qiwi.be Mon Jun 6 02:58:37 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Mon, 6 Jun 2016 10:58:37 +0300 Subject: UAX44: loose matching of symbolic values and the `is` prefix Message-ID: http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix: > For loose matching of symbolic values, an initial prefix string "is" is ignored. [?] Ignoring any initial "is" on a symbolic value during loose matching is likely to produce the best results in application areas such as regex. Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases. UAX44 provides the reason for the existence of this ?feature?: > The reason for this is that APIs returning property values are often named using the convention of prefixing "is" (or "Is" or "Is_", and so forth) to a property value. That seems like a rather weak argument. Specifically applying this to UTS18 (Unicode regular expressions): > "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek" If there is already a way to match all symbols in the Greek script (not counting the use of aliases and other loose matching requirements), i.e. `Script=Greek` ? what good does it do to add support for yet another one? Looking at implementations in the wild, Steven Levithan found (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062) that some regex flavors use `Is` for scripts, some for blocks, some for scripts and blocks, some for neither. Since some script and block names collide, this causes problems, especially when porting regexes across flavors. The `is` prefix doesn?t provide any functionality that would otherwise be unavailable. It doesn?t add any value, yet causes incompatibility, author confusion, and it increases implementation complexity. UAX 44 includes two entire paragraphs pointing out that last part: > Removal of an initial "is" string for a loose matching comparison only needs to be done once for a symbolic value, and need not be tested recursively. There are no property aliases or property value aliases of the form "isisisisistooconvoluted" defined just to test implementation edge cases. > > Existing and future property aliases and property value aliases are guaranteed to be unique within their relevant namespaces, even if an initial prefix string "is" is ignored. The existing cases of note for aliases that do start with "is" are: dt=Iso (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value alias does not cause any problem, because there is no contrasting value alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is the entire property value alias, and is not a prefix. There is no null value for the Line_Break property for it to contrast with, but implementations of loose matching should be careful of this edge case, so that "lb=IS" is not misinterpreted as matching a null value. Backwards compatibility seems to be the only good reason to continue supporting the `is` prefix *for existing implementations*, such as the one in Perl. But why is it still a requirement for new engines to support it as part of UAX44-LM3? I?d like to propose changing UAX44-LM3 to make supporting the `is` prefix optional for new implementations. From sisrivas at blueyonder.co.uk Mon Jun 6 04:11:15 2016 From: sisrivas at blueyonder.co.uk (srivas sinnathurai) Date: Mon, 6 Jun 2016 10:11:15 +0100 (BST) Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: References: Message-ID: <398429781.2108169.1465204275981.JavaMail.open-xchange@oxbe16.tb.ukmail.iss.as9143.net> Thanks Ashley. > > On 06 June 2016 at 08:58 Mathias Bynens wrote: > > > http://unicode.org/reports/tr44/#UAX44-LM3 mentions the `is` prefix: > > > For loose matching of symbolic values, an initial prefix string "is" is > > ignored. [?] Ignoring any initial "is" on a symbolic value during loose > > matching is likely to produce the best results in application areas such > > as regex. Removal of an initial "is" string for a loose matching > > comparison only needs to be done once for a symbolic value, and need not > > be tested recursively. There are no property aliases or property value > > aliases of the form "isisisisistooconvoluted" defined just to test > > implementation edge cases. > > UAX44 provides the reason for the existence of this ?feature?: > > > The reason for this is that APIs returning property values are often > > named using the convention of prefixing "is" (or "Is" or "Is_", and so > > forth) to a property value. > > That seems like a rather weak argument. Specifically applying this to > UTS18 (Unicode regular expressions): > > > "Script=Greek" is equivalent to "Script=isGreek" or "Script=Is_Greek" > > If there is already a way to match all symbols in the Greek script (not > counting the use of aliases and other loose matching requirements), i.e. > `Script=Greek` ? what good does it do to add support for yet another one? > > Looking at implementations in the wild, Steven Levithan found > (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062) > that some regex flavors use `Is` for scripts, some for blocks, some for > scripts and blocks, some for neither. Since some script and block names > collide, this causes problems, especially when porting regexes across flavors. > > The `is` prefix doesn?t provide any functionality that would otherwise be > unavailable. It doesn?t add any value, yet causes incompatibility, author > confusion, and it increases implementation complexity. UAX 44 includes two > entire paragraphs pointing out that last part: > > > Removal of an initial "is" string for a loose matching comparison only > > needs to be done once for a symbolic value, and need not be tested > > recursively. There are no property aliases or property value aliases of > > the form "isisisisistooconvoluted" defined just to test implementation > > edge cases. > > > > Existing and future property aliases and property value aliases are > > guaranteed to be unique within their relevant namespaces, even if an > > initial prefix string "is" is ignored. The existing cases of note for > > aliases that do start with "is" are: dt=Iso > > (Decomposition_Type=Isolated) and lb=IS. The Decomposition_Type value > > alias does not cause any problem, because there is no contrasting value > > alias dt=o (Decomposition_Type=olated). For lb=IS, note that the "IS" is > > the entire property value alias, and is not a prefix. There is no null > > value for the Line_Break property for it to contrast with, but > > implementations of loose matching should be careful of this edge case, > > so that "lb=IS" is not misinterpreted as matching a null value. > > > Backwards compatibility seems to be the only good reason to continue > supporting the `is` prefix *for existing implementations*, such as the one in > Perl. But why is it still a requirement for new engines to support it as part > of UAX44-LM3? > > I?d like to propose changing UAX44-LM3 to make supporting the `is` prefix > optional for new implementations. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Mon Jun 6 10:04:36 2016 From: kenwhistler at att.net (Ken Whistler) Date: Mon, 6 Jun 2016 08:04:36 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: References: Message-ID: On 6/6/2016 12:58 AM, Mathias Bynens wrote: > Backwards compatibility seems to be the only good reason to continue supporting the `is` prefix*for existing implementations*, such as the one in Perl. But why is it still a requirement for new engines to support it as part of UAX44-LM3? > > I?d like to propose changing UAX44-LM3 to make supporting the `is` prefix optional for new implementations. > I think the target of concern here is wrong. UAX #44 doesn't *require* any regex engine to include this "is prefix" handling. What UAX #44 does is recommend that all property and property value aliases be correctly recognized, and then specifies a clear statement (in UAX44-LM3) of the loose matching rule for recognizing the various forms of those aliases that could be considered equivalent. I don't think messing with that rule statement (which has been in place since 2010) would be helpful. The target instead should be in UTS #18, which happily, has a proposed update available for comment right now: http://www.unicode.org/review/pri325/ The relevant point is: http://www.unicode.org/reports/tr18/tr18-18.html#RL1.2 That is the conformance part that requires that conformant Unicode regex implementations "must follow the Matching rules from [UAX44]". If you are seeking indulgences for new engine implementations, that seems like the correct point to be adding clarifications and exceptions. Note that the following text in that section already includes wording about exceptions and compatibility issues. There is also a following section specifically about regex for the Script and Script Extensions properties that seems like it would be the appropriate place to talk about the Greek/IsGreek issue as pertains to regex support. I would suggest you make specific suggestions about the text of UTS #18 as part of the ongoing public review for the proposed update of that specification. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From mathias at qiwi.be Mon Jun 6 10:25:12 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Mon, 6 Jun 2016 18:25:12 +0300 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: References: Message-ID: <9A8F2EF5-5716-442A-8DB9-A50E6C69E98D@qiwi.be> > On 6 Jun 2016, at 18:04, Ken Whistler wrote: > > UAX #44 doesn't *require* any regex engine to include this "is prefix" handling. Are you referring to the fact that the first paragraph on http://unicode.org/reports/tr44/#Matching_Rules uses ?strongly recommended? and ?should? instead of ?required? and ?must?? > What UAX #44 does is recommend that all property and property value aliases be correctly recognized, and then specifies a clear statement (in UAX44-LM3) of the loose matching rule for recognizing the various forms of those aliases that could be considered equivalent. I don't think messing with that rule statement (which has been in place since 2010) would be helpful. Why not? What I had in mind was adding a small sentence like: > For compatibility reasons, implementations may optionally support any initial prefix string "is". This wouldn?t be a breaking change in any way, and it would enable new implementations that aim to follow UAX44 to do so without having to support `is`, and it would solve the problem everywhere the matching rules get applied rather than just for regular expressions. > I think the target of concern here is wrong. Not sure I agree. It seems to me the `is` prefix is problematic (for the same reasons) wherever it?s used, whether that?s in regular expressions or not. > The target instead should be in UTS #18, which happily, has a proposed update available for comment right now: > > http://www.unicode.org/review/pri325/ > > The relevant point is: > > http://www.unicode.org/reports/tr18/tr18-18.html#RL1.2 > > That is the conformance part that requires that conformant Unicode regex implementations "must follow the Matching rules from [UAX44]". Thanks for the pointer! I will submit my feedback there as well. It seems more awkward / difficult to add an exception there rather than just slightly tweaking the UAX44-LM3 text as suggested above, though. From doug at ewellic.org Mon Jun 6 10:32:19 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 06 Jun 2016 08:32:19 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix Message-ID: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> Mathias Bynens wrote: > Looking at implementations in the wild, Steven Levithan found > (https://github.com/mathiasbynens/es-unicode-regexp-proposal/issues/2#issuecomment-143288062) > that some regex flavors use `Is` for scripts, some for blocks, some > for scripts and blocks, some for neither. Since some script and block > names collide, this causes problems, especially when porting regexes > across flavors. Are script names and block names expected to share a common namespace? If they don't, then there is no collision. LM3 says to ignore initial (and non-final) "is" for all property aliases and property value aliases, not just Script and Block values. There will be a lot of "collisions" if you take all of those into consideration. > The `is` prefix doesn?t provide any functionality that would otherwise > be unavailable. It doesn?t add any value, yet causes incompatibility, > author confusion, and it increases implementation complexity. I don't see any evidence that it adds no value. Support for existing implementations is value. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mathias at qiwi.be Mon Jun 6 10:40:45 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Mon, 6 Jun 2016 18:40:45 +0300 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> References: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> Message-ID: <65A2B7E5-B4D5-4605-B828-AC41BE7F540B@qiwi.be> > >> The `is` prefix doesn?t provide any functionality that would otherwise >> be unavailable. It doesn?t add any value, yet causes incompatibility, >> author confusion, and it increases implementation complexity. > > I don't see any evidence that it adds no value. Support for existing > implementations is value. It adds no value because it doesn?t enable any new functionality. I agree support for existing implementations would have some value, but given that existing implementations disagree on the properties for which they support `is` that is not going to happen anyway. It?s impossible to be compatible with all those different implementations at the same time. From markus.icu at gmail.com Mon Jun 6 11:09:11 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Mon, 6 Jun 2016 09:09:11 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: <65A2B7E5-B4D5-4605-B828-AC41BE7F540B@qiwi.be> References: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> <65A2B7E5-B4D5-4605-B828-AC41BE7F540B@qiwi.be> Message-ID: Interesting discussion! ICU does not support "is" nor "in" prefixes. I wasn't even aware that UAX #44 loose matching prescribes "is". ICU just implements what Property[Value]Aliases.txt say: # Loose matching should be applied to all property names and property values, with # the exception of String Property values. With loose matching of property names and # values, the case distinctions, whitespace, hyphens, and '_' are ignored. The prefixes seem gratuitous and confusing. For example, if I read UAX44-LM3 right, it would allow [:isscript=isgreek:]. We do support just [:Greek:] for scripts and [:L:] for general categories. I would rather not add support for the prefixes in ICU. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Mon Jun 6 11:48:27 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Mon, 6 Jun 2016 09:48:27 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: References: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> <65A2B7E5-B4D5-4605-B828-AC41BE7F540B@qiwi.be> Message-ID: <8bb32d92-dc1e-c54a-8fb0-b25fe40f05fe@ix.netcom.com> An HTML attachment was scrubbed... URL: From patch.nova at gmail.com Mon Jun 6 16:39:05 2016 From: patch.nova at gmail.com (Nova Patch) Date: Mon, 6 Jun 2016 17:39:05 -0400 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> References: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> Message-ID: Den mandag 6. juni 2016 skrev Doug Ewell f?lgende: > > Mathias Bynens wrote: > > > The `is` prefix doesn?t provide any functionality that would otherwise > > be unavailable. It doesn?t add any value, yet causes incompatibility, > > author confusion, and it increases implementation complexity. > > I don't see any evidence that it adds no value. Support for existing > implementations is value. Markus has now confirmed that ICU doesn?t support this syntax and I can confirm that even Perl, which probably supports the most different ways to write the same regex, doesn?t support any form of the `is` prefix for property values when the property name is provided. $ perl -Mutf8 -E 'say "?" =~ /\p{Script=Greek}/' 1 $ perl -Mutf8 -E 'say "?" =~ /\p{Script=IsGreek}/' Can't find Unicode property definition "Script=IsGreek" at -e line 1. $ perl -Mutf8 -E 'say "?" =~ /\p{Script=Is_Greek}/' Can't find Unicode property definition "Script=Is_Greek" at -e line 1. Although Perl does optionally support the `is` prefix for property names and standalone property values: $ perl -Mutf8 -E 'say "?" =~ /\p{IsScript=Greek}/' 1 $ perl -Mutf8 -E 'say "?" =~ /\p{IsGreek}/' 1 However, this syntax is notoriously inconstant among different regex engines. Perl?s specific rules are documented in *perluniprops* ( http://perldoc.perl.org/perluniprops.html) as \p{Is_*} (case- and underscore-insensitive) being a synonym for \p{*} which explains the above functionality. Based on my past research for *Unicode Regular Expression Engines* at IUC38, I suspect that there might not be any regex engine that actually supports syntax like Script=IsGreek as described in UAX44-LM3! If anybody knows otherwise, I?d love to hear about it. Nova -------------- next part -------------- An HTML attachment was scrubbed... URL: From oren.watson at gmail.com Mon Jun 6 16:48:40 2016 From: oren.watson at gmail.com (Oren Watson) Date: Mon, 6 Jun 2016 17:48:40 -0400 Subject: 72 New Emoji Characters In-Reply-To: <5755CCBA.2000701@unicode.org> References: <5755CCBA.2000701@unicode.org> Message-ID: I see this in the list of new emoji: GOAL NET ? marksmanship, sport shooting, hunting This is incorrect, a goal net would be for football or hockey, not marksmanship. On Mon, Jun 6, 2016 at 3:19 PM, wrote: > [image: [Emoji Image]]The 72 new emoji characters for Unicode 9.0 are now > final, and listed in Emoji Recently Added > . They include 7 > faces, 7 people, 7 hand gestures, 14 plants/animals, 18 food emoji, 12 > sports emoji, and a few others. The corresponding documentation in *UTR > #51 Unicode Emoji, Version 3.0 * > has also been updated, with additional guidelines for implementers and the > new versions of the emoji data files. These should appear on smart phones > and other devices that support emoji once vendors have a chance to update > them. > > Four of the new emoji are added to complete gender pairs. Work has already > begun on the Version 4.0 of Unicode Emoji, with a focus on further > enhancing gender representation, and targeted to appear in the near future. > > The new emoji characters will soon be available for adoption > , helping support projects > to improve language support > . > > http://blog.unicode.org/2016/06/72-new-emoji-characters.html > > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: android_1f926.png Type: image/png Size: 2329 bytes Desc: not available URL: From mathias at qiwi.be Mon Jun 6 22:11:49 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Tue, 7 Jun 2016 06:11:49 +0300 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: References: <20160606083219.665a7a7059d7ee80bb4d670165c8327d.db3c97e525.wbe@email03.godaddy.com> Message-ID: > On 7 Jun 2016, at 00:39, Nova Patch wrote: > > [?] Based on my past research for Unicode Regular Expression Engines at IUC38, I suspect that there might not be any regex engine that actually supports syntax like Script=IsGreek as described in UAX44-LM3! If anybody knows otherwise, I?d love to hear about it. This seems like a cut-and-dried case of reality not matching the specification, which is not helpful in any way. The sensible thing to do is to update the specification accordingly, as proposed. From doug at ewellic.org Tue Jun 7 09:56:46 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 07 Jun 2016 07:56:46 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix Message-ID: <20160607075646.665a7a7059d7ee80bb4d670165c8327d.d8a17725c6.wbe@email03.godaddy.com> Mathias Bynens replied to Nova Patch: >> [...] Based on my past research for Unicode Regular Expression >> Engines at IUC38, I suspect that there might not be any regex engine >> that actually supports syntax like Script=IsGreek as described in >> UAX44-LM3! If anybody knows otherwise, I?d love to hear about it. > > This seems like a cut-and-dried case of reality not matching the > specification, which is not helpful in any way. The sensible thing to > do is to update the specification accordingly, as proposed. Rather than changing the spec based on anecdotal evidence, an even more sensible thing to do would be to make this a Public Review Issue: "We're considering simplifying this matching rule and need to know if any implementers rely on the part we're planning to delete. Please send feedback by $date." There must have been some basis for including the "is" case in the first place. It seems irresponsible to assume now that nobody anywhere needs it. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From mathias at qiwi.be Tue Jun 7 14:13:09 2016 From: mathias at qiwi.be (Mathias Bynens) Date: Tue, 7 Jun 2016 22:13:09 +0300 Subject: UAX44: loose matching of symbolic values and the `is` prefix In-Reply-To: <20160607075646.665a7a7059d7ee80bb4d670165c8327d.d8a17725c6.wbe@email03.godaddy.com> References: <20160607075646.665a7a7059d7ee80bb4d670165c8327d.d8a17725c6.wbe@email03.godaddy.com> Message-ID: <5194BF5D-4EDD-4D02-87AD-308362B8A800@qiwi.be> > On 7 Jun 2016, at 17:56, Doug Ewell wrote: > > Rather than changing the spec based on anecdotal evidence, [?] > > It seems irresponsible to assume now that nobody anywhere needs > it. What assumption are you talking about? Markus and Nova provided actual examples of implementations not following the spec, and so far no one has been able to provide even a single counter-example. > There must have been some basis for including the "is" case in the first > place. Now *that* sounds like an assumption to me. From doug at ewellic.org Tue Jun 7 14:51:57 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 07 Jun 2016 12:51:57 -0700 Subject: UAX44: loose matching of symbolic values and the `is` prefix Message-ID: <20160607125156.665a7a7059d7ee80bb4d670165c8327d.b4f2f270ec.wbe@email03.godaddy.com> Mathias Bynens wrote: >> Rather than changing the spec based on anecdotal evidence, [...] >> >> It seems irresponsible to assume now that nobody anywhere needs >> it. > > What assumption are you talking about? Markus and Nova provided actual > examples of implementations not following the spec, and so far no one > has been able to provide even a single counter-example. I read the synopsis of Nova's IUC38 presentation, and it looks like he did some pretty thorough research into regex engines, so I take back the phrase "based on anecdotal evidence." Changes to a Unicode specification that would have the effect of removing functionality normally trigger a public review. They help tease out the edge cases better than a mailing list discussion. The UTC has done well to make frequent use of this mechanism when potentially breaking changes are being considered. >> There must have been some basis for including the "is" case in the >> first place. > > Now *that* sounds like an assumption to me. Do you suppose they just made it up out of whole cloth? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From public at khwilliamson.com Tue Jun 7 15:48:10 2016 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 7 Jun 2016 14:48:10 -0600 Subject: Adopting ZWJ Message-ID: <5757330A.3000109@khwilliamson.com> I heard that someone was considering adopting ZWJ. They seemed to think that non-printables are not adoptable. But I was unable to find a clear list of criteria. The page that allows one to adopt said that it wasn't available, but that page really doesn't make it clear how one can test for this without actually doing the adoption. (Since it doesn't actually ask for your credit card number on the initial page, one can back out before the final commitment, but that's not a very friendly interface) From public at khwilliamson.com Tue Jun 7 15:52:36 2016 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 7 Jun 2016 14:52:36 -0600 Subject: Adopting ZWJ In-Reply-To: <5757330A.3000109@khwilliamson.com> References: <5757330A.3000109@khwilliamson.com> Message-ID: <57573414.30301@khwilliamson.com> On 06/07/2016 02:48 PM, Karl Williamson wrote: > I heard that someone was considering adopting ZWJ. They seemed to think > that non-printables are not adoptable. But I was unable to find a clear > list of criteria. The page that allows one to adopt said that it wasn't > available, but that page really doesn't make it clear how one can test > for this without actually doing the adoption. (Since it doesn't > actually ask for your credit card number on the initial page, one can > back out before the final commitment, but that's not a very friendly > interface) > After I wrote that, I found this that I previously overlooked "You can?t sponsor candidate characters (those not yet released in a version of Unicode, such as the Emoji Candidates), nor certain characters such as invisible ones." But why this rule. Why should someone be forbidden to adopt ZWJ? From charupdate at orange.fr Tue Jun 7 19:25:13 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 8 Jun 2016 02:25:13 +0200 (CEST) Subject: Adopting ZWJ In-Reply-To: <57573414.30301@khwilliamson.com> References: <5757330A.3000109@khwilliamson.com> <57573414.30301@khwilliamson.com> Message-ID: <903819309.27994.1465345513630.JavaMail.www@wwinf1c20> On Tue, 7 Jun 2016 14:52:36 -0600, Karl Williamson wrote: > On 06/07/2016 02:48 PM, Karl Williamson wrote: > > I heard that someone was considering adopting ZWJ. They seemed to think > > that non-printables are not adoptable. But I was unable to find a clear > > list of criteria. The page that allows one to adopt said that it wasn't > > available, but that page really doesn't make it clear how one can test > > for this without actually doing the adoption. (Since it doesn't > > actually ask for your credit card number on the initial page, one can > > back out before the final commitment, but that's not a very friendly > > interface) > > > > After I wrote that, I found this that I previously overlooked > > "You can?t sponsor candidate characters (those not yet released in a > version of Unicode, such as the Emoji Candidates), nor certain > characters such as invisible ones." > > But why this rule. Why should someone be forbidden to adopt ZWJ? Likewise I seriously considered adopting NNBSP, that is very important as a layout control, e.g. in the fr-FR locale, and is almost always stable in the applications, as opposed to NBSP. Indeed neither do I?see any reason not to be able to adopt these characters, the less as there *is* a visible representation, displaying their abbreviation in a box. However I?was aware from the beginning that my desire was unconventional. At least it isn?t the kind of ideal gift for your niece as referred to on http://www.unicode.org/consortium/adopt-a-character.html From public at khwilliamson.com Tue Jun 7 21:39:07 2016 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 7 Jun 2016 20:39:07 -0600 Subject: Adopting ZWJ In-Reply-To: <903819309.27994.1465345513630.JavaMail.www@wwinf1c20> References: <5757330A.3000109@khwilliamson.com> <57573414.30301@khwilliamson.com> <903819309.27994.1465345513630.JavaMail.www@wwinf1c20> Message-ID: <5757854B.6030902@khwilliamson.com> On 06/07/2016 06:25 PM, Marcel Schneider wrote: > On Tue, 7 Jun 2016 14:52:36 -0600, Karl Williamson wrote: > >> On 06/07/2016 02:48 PM, Karl Williamson wrote: >>> I heard that someone was considering adopting ZWJ. They seemed to think >>> that non-printables are not adoptable. But I was unable to find a clear >>> list of criteria. The page that allows one to adopt said that it wasn't >>> available, but that page really doesn't make it clear how one can test >>> for this without actually doing the adoption. (Since it doesn't >>> actually ask for your credit card number on the initial page, one can >>> back out before the final commitment, but that's not a very friendly >>> interface) >>> >> >> After I wrote that, I found this that I previously overlooked >> >> "You can?t sponsor candidate characters (those not yet released in a >> version of Unicode, such as the Emoji Candidates), nor certain >> characters such as invisible ones." >> >> But why this rule. Why should someone be forbidden to adopt ZWJ? > > Likewise I seriously considered adopting NNBSP, that is very important > as a layout control, e.g. in the fr-FR locale, and is almost always stable > in the applications, as opposed to NBSP. Indeed neither do I see any > reason not to be able to adopt these characters, the less as there *is* > a visible representation, displaying their abbreviation in a box. > > However I was aware from the beginning that my desire was unconventional. > At least it isn?t the kind of ideal gift for your niece as referred to on > http://www.unicode.org/consortium/adopt-a-character.html > Actually, someone suggested to me, only partially tongue-in-cheek that Unicode pitch to Sesame Street (https://en.wikipedia.org/wiki/Sesame_Street) that they adopt some letters, as the show often (used to anyway) say that this episode is brought to you by the letters Q and x (different letters sponsored different episodes). Or maybe the pitch could be to the uncles and aunts, "Now you can be like Sesame Street, and sponsor a letter." From charupdate at orange.fr Wed Jun 8 00:03:14 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Wed, 8 Jun 2016 07:03:14 +0200 (CEST) Subject: Adopting ZWJ In-Reply-To: <5757854B.6030902@khwilliamson.com> References: <5757330A.3000109@khwilliamson.com> <57573414.30301@khwilliamson.com> <903819309.27994.1465345513630.JavaMail.www@wwinf1c20> <5757854B.6030902@khwilliamson.com> Message-ID: <1017162842.394.1465362194446.JavaMail.www@wwinf1j18> On Tue, 7 Jun 2016 20:39:07 -0600, Karl Williamson wrote: >On 06/07/2016 06:25 PM, Marcel Schneider wrote: >> On Tue, 7 Jun 2016 14:52:36 -0600, Karl Williamson wrote: >> >>> On 06/07/2016 02:48 PM, Karl Williamson wrote: >>>> I heard that someone was considering adopting ZWJ. They seemed to think >>>> that non-printables are not adoptable. But I was unable to find a clear >>>> list of criteria. The page that allows one to adopt said that it wasn't >>>> available, but that page really doesn't make it clear how one can test >>>> for this without actually doing the adoption. (Since it doesn't >>>> actually ask for your credit card number on the initial page, one can >>>> back out before the final commitment, but that's not a very friendly >>>> interface) >>>> >>> >>> After I wrote that, I found this that I previously overlooked >>> >>> "You can?t sponsor candidate characters (those not yet released in a >>> version of Unicode, such as the Emoji Candidates), nor certain >>> characters such as invisible ones." >>> >>> But why this rule. Why should someone be forbidden to adopt ZWJ? >> >> Likewise I seriously considered adopting NNBSP, that is very important >> as a layout control, e.g. in the fr-FR locale, and is almost always stable >> in the applications, as opposed to NBSP. Indeed neither do I see any >> reason not to be able to adopt these characters, the less as there *is* >> a visible representation, displaying their abbreviation in a box. >> >> However I was aware from the beginning that my desire was unconventional. >> At least it isn?t the kind of ideal gift for your niece as referred to on >> http://www.unicode.org/consortium/adopt-a-character.html >> > > Actually, someone suggested to me, only partially tongue-in-cheek that > Unicode pitch to Sesame Street > (https://en.wikipedia.org/wiki/Sesame_Street) that they adopt some > letters, as the show often (used to anyway) say that this episode is > brought to you by the letters Q and x (different letters sponsored > different episodes). Or maybe the pitch could be to the uncles and > aunts, "Now you can be like Sesame Street, and sponsor a letter." Sesame Street adopting all letters the episodes are brought by, would be great, as would be all children being brought a character as an anniversary gift at least once in their lives. I feel that this could be the way Unicode become ultimately part of everybody?s real world, and get a place in people?s hearts. I?d like to tell the uncles and aunts not to stick with ASCII only?hoping that the young will then ask for a keyboard layout all their characters are on?and make it! From mark at macchiato.com Wed Jun 8 03:03:47 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 8 Jun 2016 10:03:47 +0200 Subject: Adopting ZWJ In-Reply-To: <57573414.30301@khwilliamson.com> References: <5757330A.3000109@khwilliamson.com> <57573414.30301@khwilliamson.com> Message-ID: We wanted to be a bit conservative regarding those characters, partly because we are using a payment service that is fussy. We could test it out again ? but our first priority is getting U9.0 out the door! Mark On Tue, Jun 7, 2016 at 10:52 PM, Karl Williamson wrote: > On 06/07/2016 02:48 PM, Karl Williamson wrote: > >> I heard that someone was considering adopting ZWJ. They seemed to think >> that non-printables are not adoptable. But I was unable to find a clear >> list of criteria. The page that allows one to adopt said that it wasn't >> available, but that page really doesn't make it clear how one can test >> for this without actually doing the adoption. (Since it doesn't >> actually ask for your credit card number on the initial page, one can >> back out before the final commitment, but that's not a very friendly >> interface) >> >> > After I wrote that, I found this that I previously overlooked > > "You can?t sponsor candidate characters (those not yet released in a > version of Unicode, such as the Emoji Candidates), nor certain characters > such as invisible ones." > > But why this rule. Why should someone be forbidden to adopt ZWJ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Wed Jun 8 09:26:43 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Wed, 8 Jun 2016 14:26:43 +0000 (UTC) Subject: No subject References: <1529931306.384927.1465396003757.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1529931306.384927.1465396003757.JavaMail.yahoo@mail.yahoo.com> Hello, Just a question here. The Zodiac sign Capricorn has an alternate Glyph/Symbol (see below): http://www.capricornzodiacsign.net/capricornsymbol.htm It is only vaguely similar to the glyph found in the Unicode charts and astrological sites, and sometimes astrological software offers a choice between the two. Since every font I have checked on my computer, uses a glyph close to the Unicode charts (if they have Zodiac symbols at all), I am thinking that it might be best to propose this as a separate character. Is this a good idea? Also, Zodiac signs right now have Emoji representations. Would I have to submit this as an Emoji rather than a symbol? Would I have to make up a coloured Emoji Glyph? Thanks for any responses. David Faulks From gwalla at gmail.com Wed Jun 8 17:47:23 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 8 Jun 2016 15:47:23 -0700 Subject: In-Reply-To: <1529931306.384927.1465396003757.JavaMail.yahoo@mail.yahoo.com> References: <1529931306.384927.1465396003757.JavaMail.yahoo.ref@mail.yahoo.com> <1529931306.384927.1465396003757.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Wed, Jun 8, 2016 at 7:26 AM, David Faulks wrote: > Hello, > > Just a question here. > > The Zodiac sign Capricorn has an alternate Glyph/Symbol (see below): > http://www.capricornzodiacsign.net/capricornsymbol.htm > > It is only vaguely similar to the glyph found in the Unicode charts and astrological sites, and sometimes astrological software offers a choice between the two. > > Since every font I have checked on my computer, uses a glyph close to the Unicode charts (if they have Zodiac symbols at all), I am thinking that it might be best to propose this as a separate character. > > Is this a good idea? Is it ever used alongside the more common symbol, with some semantic distinction, or is it more of a stylistic choice? > Also, Zodiac signs right now have Emoji representations. Would I have to submit this as an Emoji rather than a symbol? Would I have to make up a coloured Emoji Glyph? I think the emoji representations of the standard zodiac symbols exist because a Japanese cell phone provider put zodiac symbols in their Shift-JIS emoji sets (since those symbols are not otherwise part of the Shift-JIS standard). From gwalla at gmail.com Wed Jun 8 21:22:11 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 8 Jun 2016 19:22:11 -0700 Subject: In-Reply-To: <1529931306.384927.1465396003757.JavaMail.yahoo@mail.yahoo.com> References: <1529931306.384927.1465396003757.JavaMail.yahoo.ref@mail.yahoo.com> <1529931306.384927.1465396003757.JavaMail.yahoo@mail.yahoo.com> Message-ID: On Wed, Jun 8, 2016 at 7:26 AM, David Faulks wrote: > Hello, > > Just a question here. > > The Zodiac sign Capricorn has an alternate Glyph/Symbol (see below): > http://www.capricornzodiacsign.net/capricornsymbol.htm > > It is only vaguely similar to the glyph found in the Unicode charts and astrological sites, and sometimes astrological software offers a choice between the two. > > Since every font I have checked on my computer, uses a glyph close to the Unicode charts (if they have Zodiac symbols at all), I am thinking that it might be best to propose this as a separate character. > > Is this a good idea? I just saw this alternate glyph pop up on a webpage, not as an image, so I checked through the fonts on my system. It's apparently used for U+2651 by the GNU FreeFont family, GNU Unifont, and Chrysanthi Unicode. Chrysanthi does some odd things in the Miscellaneous Symbols range but the others are pretty normal. It may just be a version of the standard symbol with the loop enlarged and the left-hand side reduced to a small wave or hook. From davidj_faulks at yahoo.ca Thu Jun 9 08:34:01 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Thu, 9 Jun 2016 13:34:01 +0000 (UTC) Subject: Capricorn References: <1681983953.125182.1465479241953.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1681983953.125182.1465479241953.JavaMail.yahoo@mail.yahoo.com> > On Wed, 6/8/16, Garth Wallace wrote: >> On Wed, Jun 8, 2016 at 7:26 AM, David Faulks >> wrote: (cut text) >> The Zodiac sign Capricorn has an alternate >> Glyph/Symbol (see below): (cut text) >> Since every font I have checked on my computer, >> uses a glyph close to the Unicode charts (if they >> have Zodiac symbols at all), I am thinking that it might >> be best to propose this as a separate character. >> >> Is this a good idea? > I just saw this alternate glyph pop up on a webpage, not > as an image, so I checked through the fonts on my > system. It's apparently used for U+2651 by the GNU > FreeFont family, GNU Unifont, and Chrysanthi Unicode. > Chrysanthi does some odd things in the Miscellaneous > Symbols range but the others are pretty normal. It may > just be a version of the standard symbol with the loop > enlarged and the left-hand side reduced to a small wave > or hook. Thank you for this information. Capricorn is pretty common, and while I feel that completely different symbols for the same thing should still be different characters, I was uncertain in this case. I might try to propose it as a standard variation, though. David Faulks From lang.support at gmail.com Thu Jun 9 20:39:47 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 10 Jun 2016 11:39:47 +1000 Subject: Mende Kikakui Number 10 Message-ID: Hi, Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. The number ten uses the same glyph as syllable PU U+1E88E. Should I use U+1E88E to represent both the number 10 and the syllable PU? Andrew -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Jun 10 01:15:10 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 10 Jun 2016 16:15:10 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. Andrew On Friday, 10 June 2016, Andrew Cunningham wrote: > Hi, > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. > The number ten uses the same glyph as syllable PU U+1E88E. > Should I use U+1E88E to represent both the number 10 and the syllable PU? > Andrew > > -- > Andrew Cunningham > lang.support at gmail.com > > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jun 10 01:52:59 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 10 Jun 2016 08:52:59 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: Given that there's no digit for zero, you need to append combining characters to digits 1-9 in order to multiply them by a base 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't know how zero is represented. Note that for base 10, when the first digit is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that TEENS is only for numbers 11-19, not for number 10. But I agree that there should be a reference in http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages 722-723) that would explain how to render 10 (add some rows in table 19-6 for the numbers 10/100/.../1,000,000). This leaves a hole in the description. I'm not sure that the glyph for PU is exactly the glyph for 10. Or what is the appropriate sequence: ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is ambiguous, and probably both sequences should produce the equivalent glyph. However the letter PU (when meaning number 10) looks more like the glyph produced by ONE+TEN (1E8C7,1E8D1). Then how to represent zero ? Probably by a syllable or word meaning "none" (don't know which it is), or by using European or Arabic digits (as indicated in Chapter 19). 2016-06-10 8:15 GMT+02:00 Andrew Cunningham : > Ok looking at issue again I guess the other alternative is to have a > discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it > within the font to the PU glyph. > > And hope that font developers don't create a glyph based on shape of > U+1E8C7 and U+1E8D1, but PU instead. > > Andrew > > > On Friday, 10 June 2016, Andrew Cunningham wrote: > > Hi, > > Currently I am doing some work on the Mende Kikakui script, and I was > wondering what the best way was to represent the number 10. > > In the early proposals for the script there was a glyph and codepoint > specifically for the number 10. When the model for Mende Kikakui numbers > was changed before the finalising of the code block, the number ten was > removed. But using existing digits and numbers we can produce 1-9 and 11 -> > but we can not produce the number 10 from digits and numbers. > > The number ten uses the same glyph as syllable PU U+1E88E. > > Should I use U+1E88E to represent both the number 10 and the syllable PU? > > Andrew > > > > -- > > Andrew Cunningham > > lang.support at gmail.com > > > > > > > > -- > Andrew Cunningham > lang.support at gmail.com > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Jun 10 02:00:30 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 10 Jun 2016 17:00:30 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: Hi Phillipe, ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 A. On 10 Jun 2016 4:53 pm, "Philippe Verdy" wrote: > Given that there's no digit for zero, you need to append combining > characters to digits 1-9 in order to multiply them by a base > 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't > know how zero is represented. Note that for base 10, when the first digit > is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) > but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that > TEENS is only for numbers 11-19, not for number 10. > > But I agree that there should be a reference in > http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in > http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, > pages 722-723) that would explain how to render 10 (add some rows in table > 19-6 for the numbers 10/100/.../1,000,000). > > This leaves a hole in the description. I'm not sure that the glyph for PU > is exactly the glyph for 10. Or what is the appropriate sequence: > ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is > ambiguous, and probably both sequences should produce the equivalent glyph. > However the letter PU (when meaning number 10) looks more like the glyph > produced by ONE+TEN (1E8C7,1E8D1). > > Then how to represent zero ? Probably by a syllable or word meaning "none" > (don't know which it is), or by using European or Arabic digits (as > indicated in Chapter 19). > > > > 2016-06-10 8:15 GMT+02:00 Andrew Cunningham : > >> Ok looking at issue again I guess the other alternative is to have a >> discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it >> within the font to the PU glyph. >> >> And hope that font developers don't create a glyph based on shape of >> U+1E8C7 and U+1E8D1, but PU instead. >> >> Andrew >> >> >> On Friday, 10 June 2016, Andrew Cunningham >> wrote: >> > Hi, >> > Currently I am doing some work on the Mende Kikakui script, and I was >> wondering what the best way was to represent the number 10. >> > In the early proposals for the script there was a glyph and codepoint >> specifically for the number 10. When the model for Mende Kikakui numbers >> was changed before the finalising of the code block, the number ten was >> removed. But using existing digits and numbers we can produce 1-9 and 11 -> >> but we can not produce the number 10 from digits and numbers. >> > The number ten uses the same glyph as syllable PU U+1E88E. >> > Should I use U+1E88E to represent both the number 10 and the syllable >> PU? >> > Andrew >> > >> > -- >> > Andrew Cunningham >> > lang.support at gmail.com >> > >> > >> > >> >> -- >> Andrew Cunningham >> lang.support at gmail.com >> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jun 10 03:54:07 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 10 Jun 2016 10:54:07 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: I do not contest that about number 11, and it was not the question ! The question was about number **10**: * ONE+TENS or ONE+TEENS ? This is NOT specified clearly in TUS Chapter 19 which speaks about numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99. The question is the same about 110,210,...,910: * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ? For me it seems that both questions will repy with ONE+TENS, not ONE+TEENS. 2016-06-10 9:00 GMT+02:00 Andrew Cunningham : > Hi Phillipe, > > ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 > > A. > On 10 Jun 2016 4:53 pm, "Philippe Verdy" wrote: > >> Given that there's no digit for zero, you need to append combining >> characters to digits 1-9 in order to multiply them by a base >> 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't >> know how zero is represented. Note that for base 10, when the first digit >> is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) >> but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that >> TEENS is only for numbers 11-19, not for number 10. >> >> But I agree that there should be a reference in >> http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in >> http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, >> pages 722-723) that would explain how to render 10 (add some rows in table >> 19-6 for the numbers 10/100/.../1,000,000). >> >> This leaves a hole in the description. I'm not sure that the glyph for PU >> is exactly the glyph for 10. Or what is the appropriate sequence: >> ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is >> ambiguous, and probably both sequences should produce the equivalent glyph. >> However the letter PU (when meaning number 10) looks more like the glyph >> produced by ONE+TEN (1E8C7,1E8D1). >> >> Then how to represent zero ? Probably by a syllable or word meaning >> "none" (don't know which it is), or by using European or Arabic digits (as >> indicated in Chapter 19). >> >> >> >> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham : >> >>> Ok looking at issue again I guess the other alternative is to have a >>> discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it >>> within the font to the PU glyph. >>> >>> And hope that font developers don't create a glyph based on shape of >>> U+1E8C7 and U+1E8D1, but PU instead. >>> >>> Andrew >>> >>> >>> On Friday, 10 June 2016, Andrew Cunningham >>> wrote: >>> > Hi, >>> > Currently I am doing some work on the Mende Kikakui script, and I was >>> wondering what the best way was to represent the number 10. >>> > In the early proposals for the script there was a glyph and codepoint >>> specifically for the number 10. When the model for Mende Kikakui numbers >>> was changed before the finalising of the code block, the number ten was >>> removed. But using existing digits and numbers we can produce 1-9 and 11 -> >>> but we can not produce the number 10 from digits and numbers. >>> > The number ten uses the same glyph as syllable PU U+1E88E. >>> > Should I use U+1E88E to represent both the number 10 and the syllable >>> PU? >>> > Andrew >>> > >>> > -- >>> > Andrew Cunningham >>> > lang.support at gmail.com >>> > >>> > >>> > >>> >>> -- >>> Andrew Cunningham >>> lang.support at gmail.com >>> >>> >>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Jun 10 04:32:42 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 10 Jun 2016 19:32:42 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: I'd agree that it is likely ONE+TENS. Looking at the original proposal and articles on the number system .... it was originally 1-9, 10, 11-19, 20-99 etc But became 1-9, 11-19, 20-99, etc during the deliberations on the model the numbers would follow. A. At least thats how I reconstrct it from the public documrnts I have seen. On Friday, 10 June 2016, Philippe Verdy wrote: > I do not contest that about number 11, and it was not the question ! > The question was about number **10**: > * ONE+TENS or ONE+TEENS ? > This is NOT specified clearly in TUS Chapter 19 which speaks about numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99. > The question is the same about 110,210,...,910: > * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ? > For me it seems that both questions will repy with ONE+TENS, not ONE+TEENS. > > 2016-06-10 9:00 GMT+02:00 Andrew Cunningham : >> >> Hi Phillipe, >> >> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 >> >> A. >> >> On 10 Jun 2016 4:53 pm, "Philippe Verdy" wrote: >>> >>> Given that there's no digit for zero, you need to append combining characters to digits 1-9 in order to multiply them by a base 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't know how zero is represented. Note that for base 10, when the first digit is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that TEENS is only for numbers 11-19, not for number 10. >>> But I agree that there should be a reference in http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages 722-723) that would explain how to render 10 (add some rows in table 19-6 for the numbers 10/100/.../1,000,000). >>> This leaves a hole in the description. I'm not sure that the glyph for PU is exactly the glyph for 10. Or what is the appropriate sequence: ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is ambiguous, and probably both sequences should produce the equivalent glyph. However the letter PU (when meaning number 10) looks more like the glyph produced by ONE+TEN (1E8C7,1E8D1). >>> Then how to represent zero ? Probably by a syllable or word meaning "none" (don't know which it is), or by using European or Arabic digits (as indicated in Chapter 19). >>> >>> >>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham : >>>> >>>> Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. >>>> >>>> And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. >>>> >>>> Andrew >>>> >>>> On Friday, 10 June 2016, Andrew Cunningham wrote: >>>> > Hi, >>>> > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. >>>> > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. >>>> > The number ten uses the same glyph as syllable PU U+1E88E. >>>> > Should I use U+1E88E to represent both the number 10 and the syllable PU? >>>> > Andrew >>>> > >>>> > -- >>>> > Andrew Cunningham >>>> > lang.support at gmail.com >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Andrew Cunningham >>>> lang.support at gmail.com >>>> >>>> >>>> >>> > > -- Andrew Cunningham lang.support at gmail.com -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Jun 10 04:55:58 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 10 Jun 2016 19:55:58 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: The original proposals inluded a specific numbr 10 codepoint. I assume it was removed and its representation was to be generated by use of the combining characters In the original proposal there was nothing corresponding to ONE+TENS instead there was a distinct number TEN. The glyph for number 10 was identical to glyph for syllable PU. A. On Friday, 10 June 2016, Philippe Verdy wrote: > I do not contest that about number 11, and it was not the question ! > The question was about number **10**: > * ONE+TENS or ONE+TEENS ? > This is NOT specified clearly in TUS Chapter 19 which speaks about numbers 1-9 then 11-19 for TEENS, and TENS for numbers 20-99. > The question is the same about 110,210,...,910: > * (ONE..NINE)+HUNDREDS+ONE+TENS or (ONE..NINE)+HUNDREDS+ONE+TEENS ? > For me it seems that both questions will repy with ONE+TENS, not ONE+TEENS. > > 2016-06-10 9:00 GMT+02:00 Andrew Cunningham : >> >> Hi Phillipe, >> >> ONE+TEENS (1E8C7,1E8D0) is definitely the number 11 >> >> A. >> >> On 10 Jun 2016 4:53 pm, "Philippe Verdy" wrote: >>> >>> Given that there's no digit for zero, you need to append combining characters to digits 1-9 in order to multiply them by a base 10/100/1,000/10,000/100,000/1,000,000. The system is then additive. I don't know how zero is represented. Note that for base 10, when the first digit is 1 (i.e. for numbers 11-19), the combining character is not 1E8D1 (TENS) but 1E8D0 (TEENS), i.e. the slash-like glyph. But the description says that TEENS is only for numbers 11-19, not for number 10. >>> But I agree that there should be a reference in http://www.unicode.org/charts/PDF/U1E800.pdf, to the description in http://www.unicode.org/versions/Unicode8.0.0/ch19.pdf (section 19.8, pages 722-723) that would explain how to render 10 (add some rows in table 19-6 for the numbers 10/100/.../1,000,000). >>> This leaves a hole in the description. I'm not sure that the glyph for PU is exactly the glyph for 10. Or what is the appropriate sequence: ONE+TENS (1E8C7,1E8D1) or ONE+TEENS (1E8C7,1E8D0) ? The description is ambiguous, and probably both sequences should produce the equivalent glyph. However the letter PU (when meaning number 10) looks more like the glyph produced by ONE+TEN (1E8C7,1E8D1). >>> Then how to represent zero ? Probably by a syllable or word meaning "none" (don't know which it is), or by using European or Arabic digits (as indicated in Chapter 19). >>> >>> >>> 2016-06-10 8:15 GMT+02:00 Andrew Cunningham : >>>> >>>> Ok looking at issue again I guess the other alternative is to have a discontiguous set of numbers. Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU glyph. >>>> >>>> And hope that font developers don't create a glyph based on shape of U+1E8C7 and U+1E8D1, but PU instead. >>>> >>>> Andrew >>>> >>>> On Friday, 10 June 2016, Andrew Cunningham wrote: >>>> > Hi, >>>> > Currently I am doing some work on the Mende Kikakui script, and I was wondering what the best way was to represent the number 10. >>>> > In the early proposals for the script there was a glyph and codepoint specifically for the number 10. When the model for Mende Kikakui numbers was changed before the finalising of the code block, the number ten was removed. But using existing digits and numbers we can produce 1-9 and 11 -> but we can not produce the number 10 from digits and numbers. >>>> > The number ten uses the same glyph as syllable PU U+1E88E. >>>> > Should I use U+1E88E to represent both the number 10 and the syllable PU? >>>> > Andrew >>>> > >>>> > -- >>>> > Andrew Cunningham >>>> > lang.support at gmail.com >>>> > >>>> > >>>> > >>>> >>>> -- >>>> Andrew Cunningham >>>> lang.support at gmail.com >>>> >>>> >>>> >>> > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Fri Jun 10 10:16:51 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Fri, 10 Jun 2016 17:16:51 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: References: Message-ID: <575AD9E3.8060301@gmail.com> If you look at the documents archived for 2012 (http://www.unicode.org/L2/L2013/13001-register-2012.htm), you will find, beyond the Mende proposal (http://www.unicode.org/L2/L2012/12023-n4167-mende.pdf), several documents by Deborah Anderson focused on the problem of the encoding model Mende Numbers. (http://www.unicode.org/L2/L2012/12049-mende-model.pdf , http://www.unicode.org/L2/L2012/12265-mende-numbers.pdf ). They all discuss the problem posed by the representation of 10 in a model using combining character, and the ambiguity of its representation. The there is a document (http://www.unicode.org/L2/L2012/12335-n4375-mende-adhoc.pdf) on the ad hoc meeting deciding the (different) encoding model which has been kept for Unicode. But neither this document, nor the unicode standard expliceitely say how to represent 10 or say that 10 has an inherent dot. The document explicitly says that ?precomposed glyphs in smart fonts will give the best representation?, so my reading is almost the same as yours : Le 10/06/2016 08:15, Andrew Cunningham a ?crit : > Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU > glyph. except that the vertical line of PU goes beyond its ?bowl? which is not the case for the glyph for 10, which should look like the glyph for TENS, with a dot above. > > And hope that font developers don't create a glyph based on shape of > U+1E8C7 and U+1E8D1, but PU instead. Once someone present in the ad-hoc Mende meeting (some read this list) confirms (or corrects) this interpretation, I guess it will be time to add some clarification in the standard. Fr?d?ric From verdy_p at wanadoo.fr Fri Jun 10 11:05:33 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 10 Jun 2016 18:05:33 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: <575AD9E3.8060301@gmail.com> References: <575AD9E3.8060301@gmail.com> Message-ID: I can reread the doc several times (I did not read it precisely before) and in fact Chapter 19 is absolutely not clear at all. OK, represents 11, but is not clearly represents 10, and the proposals do not exhibit 10 with the same glyph as PU (even if it is based on it, in fact the combining TENS is a small subscript glyph variant of letter/syllable PU intended to mark digits). Using letter PU would discard the initial digit 1, and the subscript variant, making it confusable with a real letter/syllable PU. The initial proposal for letter 10 was a PU with a dot, i.e. instead of small-subscripting the PU glyph for TENS, the PU glyph is still used, but it is the initial digit one (normally a vertical stroke) which is subperscripted as a smaller tick (an in my opinion this tick should join with the letter PU, just like the other digits+TENS are displayed by attaching the TENS subscript to the standard digit. I've made some other searches and digits+tens are also rendered by combining two glyphs of equal vertical size stacked on top of each other (so the base digit becomes a superscript variant the TENS is also a subscript, except that in this mode, everything reamins above the baseline (no need of descenders), numbers are rendered completely with sequences of combined digits all having the same vertical height, like other letters/syllables. So I don't think that using letter PU can correctly represent the number 10. is the way to go (it is then followed by for 11... for 19, then for 20, for 21...). Now for fonts, the sequences with and both require changing the shape and reducing vertical size of the initial base digit. There's no complex change of shape for the combining mark itself: it stacks vertically normally below the reduced initial digit. There's no case where both combining marks would be used together for some special meaning, and no evidence that these marks can be repeated: there can be only one combining TENS or one combining TEENS. Other diacritics however may be used if needed for additional notations outside the number itself (such as arrows or enclosing marks), and would be encoded after the or . But encoding a standalone digit 10 would have been better (and probably extending it to standalone versions for 11 and 12, for usage with months numbers and hours on clock, just like with Roman digits). It would be interesting to look at how traditional solar clocks or traditional calendars, or even "modern" mechanical clocks with displays in Kikakui Mende, are showing these common numbers 10,11,12 (may be there are photos or facsimiles of artworks or "real life" photos kept in some museum or in book library or videos showing some religious celebrations or social events where these digits would have been displayed or taught). 2016-06-10 17:16 GMT+02:00 Fr?d?ric Grosshans : > If you look at the documents archived for 2012 ( > http://www.unicode.org/L2/L2013/13001-register-2012.htm), you will find, > beyond the Mende proposal ( > http://www.unicode.org/L2/L2012/12023-n4167-mende.pdf), several documents > by Deborah Anderson focused on the problem of the encoding model Mende > Numbers. (http://www.unicode.org/L2/L2012/12049-mende-model.pdf , > http://www.unicode.org/L2/L2012/12265-mende-numbers.pdf ). They all > discuss the problem posed by the representation of 10 in a model using > combining character, and the ambiguity of its representation. > > The there is a document ( > http://www.unicode.org/L2/L2012/12335-n4375-mende-adhoc.pdf) on the ad > hoc meeting deciding the (different) encoding model which has been kept for > Unicode. But neither this document, nor the unicode standard expliceitely > say how to represent 10 or say that 10 has an inherent dot. The document > explicitly says that ?precomposed glyphs in smart fonts will give the best > representation?, so my reading is almost the same as yours : > > Le 10/06/2016 08:15, Andrew Cunningham a ?crit : > >> Represent 10 as U+1E8C7 U+1E8D1 and map it within the font to the PU >> glyph. >> > except that the vertical line of PU goes beyond its ?bowl? which is not > the case for the glyph for 10, which should look like the glyph for TENS, > with a dot above. > > >> And hope that font developers don't create a glyph based on shape of >> U+1E8C7 and U+1E8D1, but PU instead. >> > > Once someone present in the ad-hoc Mende meeting (some read this list) > confirms (or corrects) this interpretation, I guess it will be time to add > some clarification in the standard. > > Fr?d?ric > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Fri Jun 10 11:43:21 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Fri, 10 Jun 2016 18:43:21 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> Message-ID: <575AEE29.90308@gmail.com> Le 10/06/2016 18:05, Philippe Verdy a ?crit : > > OK, represents 11, but is > not clearly represents 10, and the proposals do not exhibit 10 with > the same glyph as PU (even if it is based on it, in fact the combining > TENS is a small subscript glyph variant of letter/syllable PU intended > to mark digits). > > Using letter PU would discard the initial digit 1, and the subscript > variant, making it confusable with a real letter/syllable PU. > > The initial proposal for letter 10 was a PU with a dot, i.e. instead > of small-subscripting the PU glyph for TENS, the PU glyph is still > used, but it is the initial digit one (normally a vertical stroke) > which is subperscripted as a smaller tick (an in my opinion this tick > should join with the letter PU, just like the other digits+TENS are > displayed by attaching the TENS subscript to the standard digit. > Reading the proposal again, there is a mention that the glyph for 10 (puu) may be related to the one for PU (see page 3). They look really similar, have both the same dot above, but the difference is the extent of the vertical line on the right side. The normal way to write 10 does NOT include a digit 1. (see discussion at the end of p4, where it is explicitly stated), hence the confusion about the proper encoding of number 10 [...] > > But encoding a standalone digit 10 would have been better It has certainly been considered, and one can guess from the ad-hoc document that many solutions have been evaluated and defended during this meeting, and the final decision was a practical compromise. The problem with the standalone number 10 is that the native user of the script see it as the same symbol as the TENS number, with an inherent dot which disapears when combined with something else. > (and probably extending it to standalone versions for 11 and 12, for > usage with months numbers and hours on clock, just like with Roman > digits). No! Roman numerals where included for compatibility with East Asian standards. They are compatibility characters. > It would be interesting to look at how traditional solar clocks or > traditional calendars, or even "modern" mechanical clocks with > displays in Kikakui Mende, are showing these common numbers 10,11,12 > (may be there are photos or facsimiles of artworks or "real life" > photos kept in some museum or in book library or videos showing some > religious celebrations or social events where these digits would have > been displayed or taught). > It may be interesting, but no standardisation happen because we speculate that ?may be there are photos? showing these characters, which are presumably encodable as sequence ! Fr?d?ric From verdy_p at wanadoo.fr Fri Jun 10 13:51:02 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 10 Jun 2016 20:51:02 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: <575AEE29.90308@gmail.com> References: <575AD9E3.8060301@gmail.com> <575AEE29.90308@gmail.com> Message-ID: 2016-06-10 18:43 GMT+02:00 Fr?d?ric Grosshans : > Le 10/06/2016 18:05, Philippe Verdy a ?crit : > >> >> OK, represents 11, but is not >> clearly represents 10, and the proposals do not exhibit 10 with the same >> glyph as PU (even if it is based on it, in fact the combining TENS is a >> small subscript glyph variant of letter/syllable PU intended to mark >> digits). >> >> Using letter PU would discard the initial digit 1, and the subscript >> variant, making it confusable with a real letter/syllable PU. >> >> The initial proposal for letter 10 was a PU with a dot, i.e. instead of >> small-subscripting the PU glyph for TENS, the PU glyph is still used, but >> it is the initial digit one (normally a vertical stroke) which is >> subperscripted as a smaller tick (an in my opinion this tick should join >> with the letter PU, just like the other digits+TENS are displayed by >> attaching the TENS subscript to the standard digit. >> >> Reading the proposal again, there is a mention that the glyph for 10 > (puu) may be related to the one for PU (see page 3). They look really > similar, have both the same dot above, but the difference is the extent of > the vertical line on the right side. The normal way to write 10 does NOT > include a digit 1. (see discussion at the end of p4, where it is explicitly > stated), hence the confusion about the proper encoding of number 10 > > > [...] > >> >> But encoding a standalone digit 10 would have been better >> > It has certainly been considered, and one can guess from the ad-hoc > document that many solutions have been evaluated and defended during this > meeting, and the final decision was a practical compromise. The problem > with the standalone number 10 is that the native user of the script see it > as the same symbol as the TENS number, with an inherent dot which disapears > when combined with something else. So the error was to encode the TENS as a combining character instead of a standalone, that would have just created a ligature when it follows a digit from 2 to 9 If TENS had been a normal digit (non combining) the sequence from 1 to 10 would be uninterrupted and encoded with just one character. Between 11 and 19, the TEENS would still be needed as a combining character after a digit 1-9 (it does not exist in standalone, except after a SPACE or NBSP or a DOTTED CIRCLE to show it as a spacing character like we do for usual diacritics). Then for 20, 30, ... 90, we would have just encoded TWO..NINE, TEN (as a contextual ligature). No ligature was needed for 10 (only encoded as the single character TEN). But now that TEN is a combining character we then need to use NBSP,TEN... or ONE,TEN ! I'm not convinced, given waht you say, that this insertion of ONE is conceptually correct if the language is perceived as not differentiating the character when it is used in isolation for 10 or in combination with TWO for 20. Having to insert a ONE before TEN looks like a Unicode-specific quirk not matching the logical perception of the script. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Fri Jun 10 16:51:27 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 11 Jun 2016 07:51:27 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> Message-ID: Hi Phillipe, On Saturday, 11 June 2016, Philippe Verdy wrote: > OK, represents 11, but is not clearly represents 10, and the proposals do not exhibit 10 with the same glyph as PU (even if it is based on it, in fact the combining TENS is a small subscript glyph variant of letter/syllable PU intended to mark digits). > Mende Kikakui script disolays a high degree of glyph variation. Some variations minor, some variations more substantive. The syllable PU can be found as it is in the charts, it can be found looking like the number 10. Other variations are also observed. The ideal situation would have been to encode the number 10. But in its absence, I guess ONE+TENS may be the approach. Even though it seems less than ideal. A. A. -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Fri Jun 10 16:59:45 2016 From: doug at ewellic.org (Doug Ewell) Date: Fri, 10 Jun 2016 14:59:45 -0700 Subject: Mende Kikakui Number 10 Message-ID: <20160610145945.665a7a7059d7ee80bb4d670165c8327d.382bc4359d.wbe@email03.godaddy.com> How does one represent the values 100 and 1000 in Mende Kikakui? Is it not with ONE?+?HUNDREDS and ONE?+?THOUSANDS respectively? If so, then how is encoding 10 as ONE?+?TENS any different? Am I missing something? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Fri Jun 10 17:20:24 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 10 Jun 2016 15:20:24 -0700 Subject: Mende Kikakui Number 10 In-Reply-To: <20160610145945.665a7a7059d7ee80bb4d670165c8327d.382bc4359d.wbe@email03.godaddy.com> References: <20160610145945.665a7a7059d7ee80bb4d670165c8327d.382bc4359d.wbe@email03.godaddy.com> Message-ID: On 6/10/2016 2:59 PM, Doug Ewell wrote: > How does one represent the values 100 and 1000 in Mende Kikakui? Is it > not with ONE + HUNDREDS and ONE + THOUSANDS respectively? > > If so, then how is encoding 10 as ONE + TENS any different? Am I > missing something? > Nope, you got it right: 10 = <1E8C7, 1E8D1>. Put ligature glyph for 10 (with the dot) in your Mende Kikakui font. Problem solved. --Ken From everson at evertype.com Fri Jun 10 17:23:20 2016 From: everson at evertype.com (Michael Everson) Date: Fri, 10 Jun 2016 23:23:20 +0100 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> Message-ID: <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> We encoded MYANMAR LETTER WA and MYANMAR DIGIT ZERO separately because the latter is used in decimal arithmetic, which is essential and well supported by computers. Mende Kikakui has no ZERO. This is a fault, and they would do well to devise one. An oval with a line through it like ? would do. But they don?t have this. We have in the proposal an image of a tax document of some sort. This has not been transliterated and translated. It may or may not contain the number ?10?. MENDE KIKAKUI SYLLABLE PU is the appropriate character to use for a non-decimal 10. The dot or not-dot or the length of the bar is not relevant; I understand that both occur for both entities. Do we have other LETTER characters which are disunified from NUMBER (as opposed to DIGIT) characters? If so, then consistency might be a reason to disunify them. From everson at evertype.com Fri Jun 10 17:25:46 2016 From: everson at evertype.com (Michael Everson) Date: Fri, 10 Jun 2016 23:25:46 +0100 Subject: Mende Kikakui Number 10 In-Reply-To: References: <20160610145945.665a7a7059d7ee80bb4d670165c8327d.382bc4359d.wbe@email03.godaddy.com> Message-ID: <23FCE93C-0B10-4469-9F80-A8C15121F1A9@evertype.com> On 10 Jun 2016, at 23:20, Ken Whistler wrote: > > On 6/10/2016 2:59 PM, Doug Ewell wrote: >> How does one represent the values 100 and 1000 in Mende Kikakui? Is it >> not with ONE + HUNDREDS and ONE + THOUSANDS respectively? >> >> If so, then how is encoding 10 as ONE + TENS any different? Am I >> missing something? > > Nope, you got it right: 10 = <1E8C7, 1E8D1>. Put ligature glyph for 10 (with the dot) in your Mende > Kikakui font. Problem solved. If that?s better than just using SYLLABLE PU, OK, but please document that in the standard. It is a little non-intuitive. Well, at least a little non-obvious. :-) Michael From kenwhistler at att.net Fri Jun 10 17:34:15 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 10 Jun 2016 15:34:15 -0700 Subject: Mende Kikakui Number 10 In-Reply-To: <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> Message-ID: <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> On 6/10/2016 3:23 PM, Michael Everson wrote: > Mende Kikakui has no ZERO. This is a fault, and they would do well to devise one. An oval with a line through it like ? would do. But they don?t have this. I concur with that. If the users of this system decide that they want to have a decimal radix system instead of the system documented with the combining marks for decimal ranks, then adding a zero at 1E8C6 would be feasible. That's why we left a gap at that point in the chart. > > MENDE KIKAKUI SYLLABLE PU is the appropriate character to use for a non-decimal 10. The dot or not-dot or the length of the bar is not relevant; I understand that both occur for both entities. Do we have other LETTER characters which are disunified from NUMBER (as opposed to DIGIT) characters? If so, then consistency might be a reason to disunify them. > I disagree about that. There is no reason to depart from the logic of the system for this one value. Add one ligature glyph to your font for the sequence for 10, and you're done. --Ken From lang.support at gmail.com Fri Jun 10 19:34:16 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 11 Jun 2016 10:34:16 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> Message-ID: On Saturday, 11 June 2016, Ken Whistler wrote: > > I disagree about that. There is no reason to depart from the logic of the system for this one value. Add one ligature glyph to your font for the sequence for 10, and you're done. > > There is the logic of how kikakui numbers are encoded in Unicode and there is the internal logic of the numeral system itself. They are not necessarily the same. There are two few descriptions of the system for me to be definitive .... but the number ten seems hold a unique position within the numeral system. A. -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Fri Jun 10 19:47:11 2016 From: everson at evertype.com (Michael Everson) Date: Sat, 11 Jun 2016 01:47:11 +0100 Subject: Mende Kikakui Number 10 In-Reply-To: <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> Message-ID: On 10 Jun 2016, at 23:34, Ken Whistler wrote: > On 6/10/2016 3:23 PM, Michael Everson wrote: >> Mende Kikakui has no ZERO. This is a fault, and they would do well to devise one. An oval with a line through it like ? would do. But they don?t have this. > > I concur with that. If the users of this system decide that they want to have a decimal radix system instead of the system documented with the combining marks for decimal ranks, then adding a zero at 1E8C6 would be feasible. That's why we left a gap at that point in the chart. Indeed! >> MENDE KIKAKUI SYLLABLE PU is the appropriate character to use for a non-decimal 10. The dot or not-dot or the length of the bar is not relevant; I understand that both occur for both entities. Do we have other LETTER characters which are disunified from NUMBER (as opposed to DIGIT) characters? If so, then consistency might be a reason to disunify them. > > I disagree about that. There is no reason to depart from the logic of the system for this one value. Add one ligature glyph to your font for the sequence for 10, and you're done. You?re right about that. I hadn?t considered the ligature being structurally appropriate for this usage. (It would have been more obvious if Andrew had given the character names alongside the code positions; I hadn?t looked it up yet.) Michael From kenwhistler at att.net Fri Jun 10 19:50:19 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 10 Jun 2016 17:50:19 -0700 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> Message-ID: <02b4e913-0253-b4d0-8eb8-3c0a520abf93@att.net> On 6/10/2016 5:34 PM, Andrew Cunningham wrote: > There are two few descriptions of the system for me to be definitive > .... but the number ten seems hold a unique position within the > numeral system. As does the number 10 in every decimal numeral system. ;-) But that doesn't automatically require that it be *encoded* with a single character. After all the number 10 in the European decimal numeral system is also represented with a character sequence: <0031, 0030>. --Ken From lang.support at gmail.com Fri Jun 10 20:47:39 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 11 Jun 2016 11:47:39 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: <02b4e913-0253-b4d0-8eb8-3c0a520abf93@att.net> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <02b4e913-0253-b4d0-8eb8-3c0a520abf93@att.net> Message-ID: I am not suggesting it needs to be encoded. And I did suggest that using the digit one and the symbol for tens was an option. It can be done via a ligature. It would have to be a required ligature. Since other ligature types may or may not be enabled in various contexts. And we dont want default substitution and mark positioning to generate a non-ligature equivalent. A. An it will be interesting to see which rendering engines handle kikakui. A. On Saturday, 11 June 2016, Ken Whistler wrote: > > On 6/10/2016 5:34 PM, Andrew Cunningham wrote: >> >> There are two few descriptions of the system for me to be definitive .... but the number ten seems hold a unique position within the numeral system. > > As does the number 10 in every decimal numeral system. ;-) > > But that doesn't automatically require that it be *encoded* with a single character. After all the number 10 in the European decimal numeral system is also represented with a character sequence: <0031, 0030>. > > --Ken > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From everson at evertype.com Fri Jun 10 21:29:51 2016 From: everson at evertype.com (Michael Everson) Date: Sat, 11 Jun 2016 03:29:51 +0100 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <02b4e913-0253-b4d0-8eb8-3c0a520abf93@att.net> Message-ID: <61B082D3-ECFE-4162-B90F-FB65EDAC5E5B@evertype.com> On 11 Jun 2016, at 02:47, Andrew Cunningham wrote: > It can be done via a ligature. It would have to be a required ligature. Since other ligature types may or may not be enabled in various contexts. And we dont want default substitution and mark positioning to generate a non-ligature equivalent. Aren?t all of the number combinations required ligatures? Michael From lang.support at gmail.com Fri Jun 10 22:25:49 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sat, 11 Jun 2016 13:25:49 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: <61B082D3-ECFE-4162-B90F-FB65EDAC5E5B@evertype.com> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <02b4e913-0253-b4d0-8eb8-3c0a520abf93@att.net> <61B082D3-ECFE-4162-B90F-FB65EDAC5E5B@evertype.com> Message-ID: rlig is the quickest and easiest approach. But in theory could be done other more complicated ways. There are currently no opentype implementations that I know of. And no known shapers. rlig hopefully works with general shapers. But what what ot features will be expected by script specific shaper is still an unknown. On Saturday, 11 June 2016, Michael Everson wrote: > On 11 Jun 2016, at 02:47, Andrew Cunningham wrote: > >> It can be done via a ligature. It would have to be a required ligature. Since other ligature types may or may not be enabled in various contexts. And we dont want default substitution and mark positioning to generate a non-ligature equivalent. > > Aren?t all of the number combinations required ligatures? > > Michael > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Jun 11 02:08:00 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sat, 11 Jun 2016 00:08:00 -0700 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> Message-ID: <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 11 05:22:08 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 11 Jun 2016 12:22:08 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> Message-ID: Exactly, Unicode should not create its own logic about scripts or numeral systems. All looks like the encoding of 10 as a pair (ONE+combining TENS) was a severe conceptual error that could have been avoided by NOT encoding "TENS" as combining but as a regular number/digit TEN usable isolately, and forming a contectual ligature with a previous digit from TWO to NINE. The encoding of 10 as (ONE+TENS) is superfluously needing an artificial leading ONE. This is purely an Unicode construction, foreign to the logic of the numeral system. 2016-06-11 9:08 GMT+02:00 Asmus Freytag (c) : > On 6/10/2016 5:34 PM, Andrew Cunningham wrote: > > There is the logic of how kikakui numbers are encoded in Unicode and there > is the internal logic of the numeral system itself. They are not > necessarily the same. > > This statement should be framed! > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Sat Jun 11 05:25:39 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 11 Jun 2016 12:25:39 +0200 Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> Message-ID: Note that this is most probably true for the encoding of 100 as ONE+HUNDREDS, when HUNDREDS should be a regular number usable in isolation without the leading ONE. Same thing about THOUSANDS and similar, all encoded as combining characters; the name itself should not have taken the plural. I just hope they have combining class 0. Then the error is the assigned general category C* which should have been N*. Can we fix that so that isolated uses of TENS or HUNDREDS or others in the series will NOT require any artificial leading digit ONE ? 2016-06-11 12:22 GMT+02:00 Philippe Verdy : > Exactly, Unicode should not create its own logic about scripts or numeral > systems. > > All looks like the encoding of 10 as a pair (ONE+combining TENS) was a > severe conceptual error that could have been avoided by NOT encoding "TENS" > as combining but as a regular number/digit TEN usable isolately, and > forming a contectual ligature with a previous digit from TWO to NINE. > > The encoding of 10 as (ONE+TENS) is superfluously needing an artificial > leading ONE. This is purely an Unicode construction, foreign to the logic > of the numeral system. > > > 2016-06-11 9:08 GMT+02:00 Asmus Freytag (c) : > >> On 6/10/2016 5:34 PM, Andrew Cunningham wrote: >> >> There is the logic of how kikakui numbers are encoded in Unicode and >> there is the internal logic of the numeral system itself. They are not >> necessarily the same. >> >> This statement should be framed! >> >> A./ >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Jun 11 17:12:57 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 12 Jun 2016 00:12:57 +0200 (CEST) Subject: Mende Kikakui Number 10 In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> Message-ID: <1551343798.17257.1465683177287.JavaMail.www@wwinf1p21> On Sat, 11 Jun 2016 12:25:39 +0200, Philippe Verdy wrote: > > Exactly, Unicode should not create its own logic about scripts or numeral systems. > > All looks like the encoding of 10 as a pair (ONE+combining TENS) was a severe > conceptual error that could have been avoided by NOT encoding "TENS" as combining > but as a regular number/digit TEN usable isolately, and forming a contectual > ligature with a previous digit from TWO to NINE. > > The encoding of 10 as (ONE+TENS) is superfluously needing an artificial leading > ONE. This is purely an Unicode construction, foreign to the logic of the numeral > system. > Seeing the discussion exhausted, I join my hope to Philippe Verdy?s, and reinforce by quoting Asmus Freytag on backcompat vs enhancement, before bringing another concern: ?If you add a feature to match behavior somewhere else, it rarely pays to make that perform "better", because it just means it's now different and no longer matches. The exception is a feature for which you can establish unambiguously that there is a metric of correctness or a widely (universally?) shared expectation by users as to the ideal behavior. In that case, being compatible with a broken feature (or a random implementation of one) may in fact be counter productive.? http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0109.html Being bound with stability guarantees, Unicode could eventually add a _new_ *1E8D7 MENDE KIKAKUI NUMBER TEN Best wishes, Marcel From charupdate at orange.fr Sat Jun 11 17:20:12 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 12 Jun 2016 00:20:12 +0200 (CEST) Subject: Latin Letters Capital and Small Theta Message-ID: <235625534.17310.1465683612359.JavaMail.www@wwinf1p21> People are facing the recurrent idea that the Greek theta used to write the Rromani language in International Standard orthography?as well as a number of other languages?will be or ought to be encoded as a separate casing pair in Unicode. LATIN CAPITAL LETTER THETA and LATIN SMALL LETTER THETA were part of Michael Everson?s 2012 proposal at http://www.unicode.org/L2/L2012/12138-n4262-unifon.pdf as the intended code points U+A7B0 and U+A7B1. While some characters were retained, others were rejected, among which the Latin Theta pair, but no mention is found of this rejection in the Non-Approval Notices. Two years later this proposal was sustained by Denis Moyogo Jacquerye?s additional proposal at http://www.unicode.org/L2/L2014/14202-latin-theta-delta.pdf with a new rationale, as being required in writing systems of several natural languages. On the sole criterium of glyphic resemblance there exist already two matching characters in Unicode: 03F4 GREEK CAPITAL THETA SYMBOL 03B8 GREEK SMALL LETTER THETA Does the UTC consider it as feasible to meet the issue by implementing a tailored casing pair for the related locales, and adding somewhere an annotation for the information of font designers, or can people expect to see one day a successful proposal for LATIN CAPITAL LETTER THETA and LATIN SMALL LETTER THETA? Yet to date, this is not found in the Pipeline. (Though experience showed that a given character being rejected in one proposal is without prejudice to its being accepted as a part of a later proposal. That happened to the LATIN CAPITAL LETTER SMALL CAPITAL I, found already in Mr Everson?s 2012 proposal and now added to Unicode in 2016.) The Greek Theta as an IPA character was incidentally discussed already in the following thread: Unicode Mail List Archive: gamma as a phonetic symbol. (Sat Sep 27 2008 - 11:43:57 CDT). Retrieved June 10, 2016, from http://www.unicode.org/mail-arch/unicode-ml/y2008-m09/0072.html According to Mr Everson in this thread, ?Theta is perhaps the hardest to argue for? disunification: http://www.unicode.org/mail-arch/unicode-ml/y2008-m09/0076.html Why so, is however non-obvious to me because the capital does not match the glyphic expectations for the Romani International Standard Latin script subset as referred to in https://en.wikipedia.org/wiki/Romani_alphabets#International_Standard and more detailedly in https://fr.wikipedia.org/wiki/Th%C3%AAta_latin (available yet in French only, but anyway one might wish to check the picture). Consequently AFAIK to date the Greek Capital Theta Symbol is preferred as uppercase, not the Greek Capital Theta. Using the Symbol variant brings some odds in data processing due to the lack of round-trip casing relationship. This adds to the overall problem of cross-script usage. Using several scripts to write one language contradicts one of the design principles of Unicode. I note too, that in its International Standard Alphabet form, Romany is not supported by the blocks up to Latin Extended-A, unlike TUS 8.0 states on page 296. This brings up the need to underscore that Unicode added the H with h??ek (U+021E U+021F) for Finnish Romany in the Latin Extended-B block. However U+03F4 ( ? ) GREEK CAPITAL THETA SYMBOL was among the subset of potentially obsolete characters found in the Archives of this List in the following e-mail: http://www.unicode.org/mail-arch/unicode-ml/y2009-m01/0558.html Solving this issue now is important in that the French Standard Keyboard Layout will support Rromani Standard Latin script (along with all European Latin script using languages). This topic being about plain character encoding, I?ve finally decided to submit it to your kind advice. Marcel -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Jun 11 19:20:12 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 11 Jun 2016 18:20:12 -0600 Subject: Latin Letters Capital and Small Theta Message-ID: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> Marcel Schneider wrote: > While some characters were retained, others were rejected, among which > the Latin Theta pair, but no mention is found of this rejection in the > Non-Approval Notices. Lots of characters in proposals are rejected without rising to the level of explicit disapproval: "Look, we said NO, and don't ask us again." The Non-Approval Notices page starts with an extensive description of the difference. At the same time, note that a few proposals, such as LATIN CAPITAL LETTER SHARP S, have risen phoenix-like from the ranks of non-approvaldom to become genuine encoded characters. -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From lang.support at gmail.com Sat Jun 11 22:25:17 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 12 Jun 2016 13:25:17 +1000 Subject: Mende Kikakui Number 10 In-Reply-To: <1551343798.17257.1465683177287.JavaMail.www@wwinf1p21> References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> <1551343798.17257.1465683177287.JavaMail.www@wwinf1p21> Message-ID: Marcel, it isn't so much that the conversation was exhausted, rather that the original question has been sufficienlty answered. A. On Sunday, 12 June 2016, Marcel Schneider wrote: > On Sat, 11 Jun 2016 12:25:39 +0200, Philippe Verdy wrote: >> >> Exactly, Unicode should not create its own logic about scripts or numeral systems. >> >> All looks like the encoding of 10 as a pair (ONE+combining TENS) was a severe >> conceptual error that could have been avoided by NOT encoding "TENS" as combining >> but as a regular number/digit TEN usable isolately, and forming a contectual >> ligature with a previous digit from TWO to NINE. >> >> The encoding of 10 as (ONE+TENS) is superfluously needing an artificial leading >> ONE. This is purely an Unicode construction, foreign to the logic of the numeral >> system. >> > > > Seeing the discussion exhausted, I join my hope to Philippe Verdy?s, > and reinforce by quoting Asmus Freytag on backcompat vs enhancement, > before bringing another concern: > > ?If you add a feature to match behavior somewhere else, > it rarely pays to make that perform "better", because > it just means it's now different and no longer matches. > The exception is a feature for which you can establish > unambiguously that there is a metric of correctness or > a widely (universally?) shared expectation by users > as to the ideal behavior. In that case, being compatible > with a broken feature (or a random implementation of one) > may in fact be counter productive.? > > http://www.unicode.org/mail-arch/unicode-ml/y2016-m03/0109.html > > Being bound with stability guarantees, Unicode could eventually add a _new_ > > *1E8D7 MENDE KIKAKUI NUMBER TEN > > Best wishes, > > Marcel > > -- Andrew Cunningham lang.support at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sun Jun 12 05:34:59 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 12 Jun 2016 12:34:59 +0200 (CEST) Subject: Scheduling Public Reviews (was: Re: Mende Kikakui Number 10) In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> <1551343798.17257.1465683177287.JavaMail.www@wwinf1p21> Message-ID: <1746124551.3342.1465727699179.JavaMail.www@wwinf1g04> On Sun, 12 Jun 2016 13:25:17 +1000, Andrew Cunningham wrote: > Marcel, it isn't so much that the conversation was exhausted, rather that > the original question has been sufficienlty answered. I understand the difference now. Anyway I didn?t consider the issue as settled. More, the Mende Kikakui number encoding default as pointed in the original thread would IMHO have been far, far less likely to occur if the first public review would be scheduled at a more useful stage than the beta one. I don?t know if people feel being taken seriously when their feedback is solicited while almost all striking parameters are yet immutable. And I?guess that some of those who are ready to spend a part of their time for Unicode?s and character encoding?s sake, wouldn?t necessarily do so unless they are solicited and given the material in a handsome format. Perhaps would a Public Alpha Review prevent things from running worse? Beta will be reviewed again of course, but in a less frustrating manner. Marcel From charupdate at orange.fr Sun Jun 12 05:41:10 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sun, 12 Jun 2016 12:41:10 +0200 (CEST) Subject: Latin Letters Capital and Small Theta In-Reply-To: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> Message-ID: <748151443.3396.1465728070818.JavaMail.www@wwinf1g04> On Sat, 11 Jun 2016 18:20:12 -0600, Doug Ewell wrote: > Marcel Schneider wrote: > >> While some characters were retained, others were rejected, among which >> the Latin Theta pair, but no mention is found of this rejection in the >> Non-Approval Notices. > > Lots of characters in proposals are rejected without rising to the level > of explicit disapproval: "Look, we said NO, and don't ask us again." The > Non-Approval Notices page starts with an extensive description of the > difference. > > At the same time, note that a few proposals, such as LATIN CAPITAL > LETTER SHARP S, have risen phoenix-like from the ranks of > non-approvaldom to become genuine encoded characters. Thank you for these hints, which moreover remember me the persistent case folding issue around the LATIN CAPITAL LETTER SHARP S which is not round-trip neither and needs to be tailored, because its small letter is to be kept stable. Seen from here, tailoring the Greek small theta to get it the Latin way around becomes quite obvious: Without tailoring: ? ? ? ? SS ? ? ? ? ? With tailoring: ? ? ? ? ? ? ? ? ? ? Now I?m much likely to believe that theta-using Latin script writers are eventually better served with the UTC?s not retaining a separate Latin theta, because tailoring this custom Greek casing pair presumably makes for a more streamlined implementation and a more user-friendly result, given that the font issue is spared. The more as for units like Ohm and prefixes like micro, Greek letters are preferred too, no matter what script they are used in. [BTW, on Sun, 12 Jun 2016 00:20:12 +0200 (CEST) I?wrote: > According to Mr Everson in this thread, ?Theta is perhaps the > hardest to argue for? disunification: > http://www.unicode.org/mail-arch/unicode-ml/y2008-m09/0076.html > > Why so, is however non-obvious to me [?] But it is, since the thread was about IPA, so uppercase was not discussed.] The question still remains whether in practice this working model could be unanimously accepted, given that at least somebody is preferring the regular Greek capital, presumably to get around the case folding issue. Marcel From daniel.buenzli at erratique.ch Sun Jun 12 08:26:30 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sun, 12 Jun 2016 14:26:30 +0100 Subject: 9.0.0 segmentation and line breaks on the empty string Message-ID: <46731C3271204F11AEE86273A1BC9922@erratique.ch> Hello, I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on the empty string while UAX14 still does report a hard line break on it. Is this intended ? and what is the rationale behind these changes and non-changes ? While I think that the proposed UAX29 is a better one, these kind of changes on special cases make it easy to break assumptions made by client code so it would be better if these things do not change to often. Hence my request, shouldn't UAX14 also report no breaks on the empty string ? Best, Daniel From asmusf at ix.netcom.com Sun Jun 12 10:49:15 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sun, 12 Jun 2016 08:49:15 -0700 Subject: compatibility features (was: Mende Kikakui Number 10) In-Reply-To: References: <575AD9E3.8060301@gmail.com> <31AE4EE7-D2B3-492F-8CD6-E9333BCF1B5B@evertype.com> <9fa22860-7392-b9f5-52ce-7e5f16bd2942@att.net> <0ca74adc-d8b2-908c-c74f-2c10f2b8a354@ix.netcom.com> <1551343798.17257.1465683177287.JavaMail.www@wwinf1p21> Message-ID: <098f04c6-a647-73e0-800d-74443b8905c5@ix.netcom.com> On 6/11/2016 8:25 PM, Andrew Cunningham wrote: > ?If you add a [compatibility] feature to match behavior > > [found] somewhere else [not in the Unicode standard], > > it rarely pays to make that perform "better", because > > it just means it's now different and no longer matches > > [the behavior to which it was supposed to be compatible]. > > > The exception is a feature for which you can establish > > unambiguously that there is a metric of correctness or > > a widely (universally?) shared expectation by users > > as to the ideal behavior. In that case, being compatible > > with a broken feature (or a random implementation of one) > > may in fact be counter productive.? > > In the case of Mende Kikakui methods for encoding number 10, I don't see where the "compatibility" with an existing implementation of that number system comes into play. My statement was a warning to not add features for the sake of "compatibility", but then to break that compatibility by making the feature "better" - i.e. different. You can have one, but not the other. Either a new (better/correct) feature, or one that is compatible. A./ From asmusf at ix.netcom.com Sun Jun 12 10:59:30 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sun, 12 Jun 2016 08:59:30 -0700 Subject: Latin Letters Capital and Small Theta In-Reply-To: <748151443.3396.1465728070818.JavaMail.www@wwinf1g04> References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> <748151443.3396.1465728070818.JavaMail.www@wwinf1g04> Message-ID: An HTML attachment was scrubbed... URL: From frederic.grosshans at gmail.com Mon Jun 13 07:41:18 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Mon, 13 Jun 2016 14:41:18 +0200 Subject: Latin Letters Capital and Small Theta In-Reply-To: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> Message-ID: <575EA9EE.6080809@gmail.com> Le 12/06/2016 02:20, Doug Ewell a ?crit : > Marcel Schneider wrote: > >> While some characters were retained, others were rejected, among which >> the Latin Theta pair, but no mention is found of this rejection in the >> Non-Approval Notices. > > Lots of characters in proposals are rejected without rising to the > level of explicit disapproval: "Look, we said NO, and don't ask us > again." The Non-Approval Notices page starts with an extensive > description of the difference. > > At the same time, note that a few proposals, such as LATIN CAPITAL > LETTER SHARP S, have risen phoenix-like from the ranks of > non-approvaldom to become genuine encoded characters. And, if I I remember correctly, to proposal for the Latin letter theta yet has given example of the current usage of ttheta in latin orthography, like in Rromani (http://www.rromaniconnect.org/Rromanifonts.html, http://romani.humanities.manchester.ac.uk/whatis/status/codification.shtml ). I guess a proposal based on the Rromani orthography, (and with input for the user community, of course!) would easily be accepted. Cheers, Fr?d?ric From mark at macchiato.com Mon Jun 13 08:04:10 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 13 Jun 2016 15:04:10 +0200 Subject: Latin Letters Capital and Small Theta In-Reply-To: References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> <748151443.3396.1465728070818.JavaMail.www@wwinf1g04> Message-ID: > such as URLs (domain names) there are restrictions that prevent script-mixing in a single label. That is just a current implementation restriction, based on only using the Script property. Implementations upgraded to use Script_Extensions to test for multiple scripts in a string can handle multiple scripts for a character properly. Mark On Sun, Jun 12, 2016 at 5:59 PM, Asmus Freytag (c) wrote: > Just a note: for any living(!) language, it is important that Unicode not > mix scripts, but instead *disunify characters based on script*. The > reason for that is that in important implementations, such as URLs (domain > names) there are restrictions that prevent script-mixing in a single label. > > If an othography were to use characters from more than one script, it will > not be supported for things like domain names (at least in certain zones), > because doing so introduces an element of risk to the domain name system. > > This consideration is less relevant to dead languages and notational > systems, because supporting them in URLs has never been a priority and > users can live with a "best case" or partial solution. Living languages, > esp. ones that have (or are excpected to achieve) significant literate > populations, are a different matter. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Mon Jun 13 12:52:14 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Mon, 13 Jun 2016 19:52:14 +0200 Subject: Latin Letters Capital and Small Theta In-Reply-To: <575EA9EE.6080809@gmail.com> References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> <575EA9EE.6080809@gmail.com> Message-ID: This is general. characters may initially be encoded with a single case where the demonstrated use for only IPA usage (which is single cased). To get dual cased letters, we need to find examples of use in the orthography of a language where all other letters are dual cased. Well this was tur for the German sharp S but for long it was not demonstrated that the lowercase and uppercase was different. With Rromany (which has multiple orthographies in multiple scripts), the problem is that there's no formal standard and the rromany communities around countries have adapted their orthography with usages found in other ntational languages. There's no real academy and in fact the language is very fragmented, and its tradition is fact more oral than written There are authors of written texts but each one has adopted a convention more or less based on the standard orthography of another language where they live. So there are variants of the orthography in multiple scripts, at least Latin, Cyrillic, Greek, Devanagari (probably also Arabic in North-Eastern India, Pakistan, Iran; many be also Georgian: the rromany people are spread in a very large area from Southern Asia, Central Asia, Western Asia, to Europe and North Africa). The orthographies are more or less adaptations of the phonetics of the oral tradition. For those authors that want to better represent the language phonetics it's natural that they'll want to borrow the IPA theta symbol when chossing the Latin script (and in the Greek-based orthography they'll correctly differentiate the Greek Tau and Theta letters for the same purpose). I wonder which letters they choose to differentiate Tau and Theta in Cyrillic (there'a a sizeable rromany community in Bulgaria, Macedonia, Serbia...). But in the Latin script, authors have also used digraphs (T vs. TH) since long (just like other European languages, including English or French, even if French does not differentiate the phonetics and the H in TH is in fact completely mute!). There's actually no stable translitterators because there are competing orthographies depending on authors, and no formal agreements between authors and no academic institution which is widely recognized (there are severla local cummunities that may have authored some writing guides, but I don't think these are very strong to be authoritative: the tradition is still strongly oral and what is important is not the way the language is written but how it is pronounced and sung: music and songs is an essential part of the rromany culture, and what unites them across countries, even if there are some religion splits). It's normal for Unicode to accept the existence of Latin orthographies that will use the Theta letter as a normal dual cased letter if we can demonstrate that authors need it and publications were easily made and relatively easy to find. Those publicatiosn are part of our wold cultures and needs to be preserved and correctly represented, even if we don't have any formal academy. It is even more important than encoding many new emojis for fun (that are recent inventions but don't have the same level of historic background). Being able to write all languages even if their historic tradition is oral, is an important and respectable goal, notably when these are living languages with a large speaking community. It's not something new: various native African languages have also adopted IPA symbols in their Latin orthography, and wanted to have dual case. So now we also have dual-cased Latin letters Alpha, Epsilon, Open O... It does not matter if IPA only needs lowercase, but it has become a strong common base used for orthographies of languages with oral traditions, and natural for them to expand the IPA set with capital letters for the Latin script (and another proof that IPA is not a separate script but a subset of the Latin script). 2016-06-13 14:41 GMT+02:00 Fr?d?ric Grosshans : > Le 12/06/2016 02:20, Doug Ewell a ?crit : > >> Marcel Schneider wrote: >> >> While some characters were retained, others were rejected, among which >>> the Latin Theta pair, but no mention is found of this rejection in the >>> Non-Approval Notices. >>> >> >> Lots of characters in proposals are rejected without rising to the level >> of explicit disapproval: "Look, we said NO, and don't ask us again." The >> Non-Approval Notices page starts with an extensive description of the >> difference. >> >> At the same time, note that a few proposals, such as LATIN CAPITAL LETTER >> SHARP S, have risen phoenix-like from the ranks of non-approvaldom to >> become genuine encoded characters. >> > And, if I I remember correctly, to proposal for the Latin letter theta > yet has given example of the current usage of ttheta in latin orthography, > like in Rromani (http://www.rromaniconnect.org/Rromanifonts.html, > http://romani.humanities.manchester.ac.uk/whatis/status/codification.shtml > ). I guess a proposal based on the Rromani orthography, (and with input for > the user community, of course!) would easily be accepted. > > Cheers, > > Fr?d?ric > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rick at unicode.org Wed Jun 15 21:34:10 2016 From: rick at unicode.org (Rick McGowan) Date: Wed, 15 Jun 2016 19:34:10 -0700 Subject: Public review of draft repertoire for ISO/IEC 10646 Message-ID: <57621022.1070209@unicode.org> The UTC would appreciate feedback on new repertoire that is currently under ballot for future additions to ISO/IEC 10646. This includes repertoire that has already been reviewed and approved by the UTC, but which will not be published until next year, as part of Version 10.0 of the Unicode Standard. This is your opportunity to review the planned new repertoire for possible problems, and to make any suggestions you might have about improvements for glyphs or character names. See PRI #327 and PRI #328 for details on access to the draft repertoire documents for review, and for how to provide your feedback. The characters of interest -- the new repertoire under ballot -- are highlighted in yellow in the code charts in those documents. Glyph corrections or improvements in the charts are highlighted in a light blue. Note that we already know about the mistaken glyph for the new character U+1D378 TALLY MARK FIVE, so you do not need to report that problem again! Note also that a few of the characters for review in PRI #328, including the 72 new emoji characters, have been accelerated for publication in Unicode 9.0. The UTC will not be able to respond to further feedback on those 9.0 characters, which are already frozen for publication. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Jun 16 02:09:03 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 16 Jun 2016 00:09:03 -0700 Subject: Public review of draft repertoire for ISO/IEC 10646 In-Reply-To: <57621022.1070209@unicode.org> References: <57621022.1070209@unicode.org> Message-ID: I'm not sure if it merits formal feedback, but would it be a good idea to cross reference IDEOGRAPHIC TALLY MARK FIVE to CJK UNIFIED IDEOGRAPH-6B63? They are effectively visually identical (in fact I was under the impression they were the same thing). On Wed, Jun 15, 2016 at 7:34 PM, Rick McGowan wrote: > The UTC would appreciate feedback on new repertoire that is currently > under ballot for future additions to ISO/IEC 10646. This includes > repertoire that has already been reviewed and approved by the UTC, but > which will not be published until next year, as part of Version 10.0 of the > Unicode Standard. > > This is your opportunity to review the planned new repertoire for possible > problems, and to make any suggestions you might have about improvements for > glyphs or character names. > > See PRI #327 and PRI #328 > for details on access to the > draft repertoire documents for review, and for how to provide your > feedback. The characters of interest -- the new repertoire under ballot -- > are highlighted in yellow in the code charts in those documents. Glyph > corrections or improvements in the charts are highlighted in a light blue. > > Note that we already know about the mistaken glyph for the new character > U+1D378 TALLY MARK FIVE, so you do not need to report that problem again! > Note also that a few of the characters for review in PRI #328, including > the 72 new emoji characters, have been accelerated for publication in > Unicode 9.0. The UTC will not be able to respond to further feedback on > those 9.0 characters, which are already frozen for publication. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Thu Jun 16 05:42:47 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Thu, 16 Jun 2016 12:42:47 +0200 (CEST) Subject: Latin Letters Capital and Small Theta In-Reply-To: References: <6206A937BCA14C3C942388D7EFC9407A@DougEwell> <575EA9EE.6080809@gmail.com> Message-ID: <2064827394.7993.1466073767346.JavaMail.www@wwinf1j27> On Sun, 12 Jun 2016 08:59:30 -0700, Asmus Freytag (c) wrote: > Just a note: for any living(!) language, it is important that Unicode not mix > scripts, but instead *disunify characters based on script.* The reason for that > is that in important implementations, such as URLs (domain names) there are > restrictions that prevent script-mixing in a single label. On Mon, 13 Jun 2016 14:41:18 +0200, Fr?d?ric Grosshans wrote: > Le 12/06/2016 02:20, Doug Ewell a ?crit : >> Marcel Schneider wrote: >> >>> While some characters were retained, others were rejected, among which >>> the Latin Theta pair, but no mention is found of this rejection in the >>> Non-Approval Notices. >> >> Lots of characters in proposals are rejected without rising to the >> level of explicit disapproval: "Look, we said NO, and don't ask us >> again." The Non-Approval Notices page starts with an extensive >> description of the difference. >> >> At the same time, note that a few proposals, such as LATIN CAPITAL >> LETTER SHARP S, have risen phoenix-like from the ranks of >> non-approvaldom to become genuine encoded characters. > And, if I I remember correctly, to proposal for the Latin letter theta > yet has given example of the current usage of ttheta in latin > orthography, like in Rromani > (http://www.rromaniconnect.org/Rromanifonts.html, > http://romani.humanities.manchester.ac.uk/whatis/status/codification.shtml > ). I guess a proposal based on the Rromani orthography, (and with input > for the user community, of course!) would easily be accepted. On Mon, 13 Jun 2016 19:52:14 +0200, Philippe Verdy wrote: > With Rromany (which has multiple orthographies in multiple scripts), the > problem is that there's no formal standard and the rromany communities > around countries have adapted their orthography with usages found in other > ntational languages. There's no real academy and in fact the language is > very fragmented, and its tradition is fact more oral than written There are > authors of written texts but each one has adopted a convention more or less > based on the standard orthography of another language where they live. [?] > There's actually no stable translitterators because there are competing > orthographies depending on authors, and no formal agreements between > authors and no academic institution which is widely recognized [?] > > It's normal for Unicode to accept the existence of Latin orthographies that > will use the Theta letter as a normal dual cased letter if we can > demonstrate that authors need it and publications were easily made and > relatively easy to find. Those publicatiosn are part of our wold cultures > and needs to be preserved and correctly represented, even if we don't have > any formal academy. It is even more important than encoding many new emojis > for fun (that are recent inventions but don't have the same level of > historic background). > > Being able to write all languages even if their historic tradition is oral, > is an important and respectable goal, notably when these are living > languages with a large speaking community. It's not something new: various > native African languages have also adopted IPA symbols in their Latin > orthography, and wanted to have dual case. So now we also have dual-cased > Latin letters Alpha, Epsilon, Open O... It does not matter if IPA only > needs lowercase, but it has become a strong common base used for > orthographies of languages with oral traditions, and natural for them to > expand the IPA set with capital letters for the Latin script (and another > proof that IPA is not a separate script but a subset of the Latin script). Thanks to all who responded in this thread. The challenge as I see it now is to spread the word and motivate persons who are in touch with the Rromani Standard Alphabet user communities, and are thus in a position to write up the proposal for Latin Letters Capital and Small Theta. As of the subsequent font issue, I believe that it will be settled quickly by adding the new code points and duplicating the already existing glyphs of GREEK CAPITAL THETA SYMBOL and GREEK SMALL LETTER THETA. Hopefully, Marcel From daniel.buenzli at erratique.ch Sun Jun 19 08:25:44 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sun, 19 Jun 2016 14:25:44 +0100 Subject: 9.0.0 segmentation and line breaks on the empty string In-Reply-To: <46731C3271204F11AEE86273A1BC9922@erratique.ch> References: <46731C3271204F11AEE86273A1BC9922@erratique.ch> Message-ID: Le dimanche, 12 juin 2016 ? 14:26, Daniel B?nzli a ?crit : > Hello, > > I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on the empty string while UAX14 still does report a hard line break on it. Is this intended ? and what is the rationale behind these changes and non-changes ? > > While I think that the proposed UAX29 is a better one, these kind of changes on special cases make it easy to break assumptions made by client code so it would be better if these things do not change to often. Hence my request, shouldn't UAX14 also report no breaks on the empty string ? I realize we are out of the beta review time. But do people think it would be worth raising for 10.0.0 ? Best, Daniel From public at khwilliamson.com Sun Jun 19 10:57:28 2016 From: public at khwilliamson.com (Karl Williamson) Date: Sun, 19 Jun 2016 09:57:28 -0600 Subject: 9.0.0 segmentation and line breaks on the empty string In-Reply-To: References: <46731C3271204F11AEE86273A1BC9922@erratique.ch> Message-ID: <5766C0E8.3050700@khwilliamson.com> On 06/19/2016 07:25 AM, Daniel B?nzli wrote: > Le dimanche, 12 juin 2016 ? 14:26, Daniel B?nzli a ?crit : >> Hello, >> >> I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on the empty string while UAX14 still does report a hard line break on it. Is this intended ? and what is the rationale behind these changes and non-changes ? >> >> While I think that the proposed UAX29 is a better one, these kind of changes on special cases make it easy to break assumptions made by client code so it would be better if these things do not change to often. Hence my request, shouldn't UAX14 also report no breaks on the empty string ? > I realize we are out of the beta review time. But do people think it would be worth raising for 10.0.0 ? > > Best, > > Daniel > > Yes. Use http://www.unicode.org/reporting.html to make an error report. I did this last year to report about the empty strings matching, and TR29 got changed for 9.0. (Perhaps others reported it too.) I was aware that the problem was also in TR14, but I don't remember now, I could very well have not included this in my submission. And the Unicode personnel are busy people, and like me, can overlook things, and fail to draw logical inferences that, in retrospect, appear to be obvious. From daniel.buenzli at erratique.ch Sun Jun 19 11:34:08 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Sun, 19 Jun 2016 17:34:08 +0100 Subject: 9.0.0 segmentation and line breaks on the empty string In-Reply-To: <5766C0E8.3050700@khwilliamson.com> References: <46731C3271204F11AEE86273A1BC9922@erratique.ch> <5766C0E8.3050700@khwilliamson.com> Message-ID: Le dimanche, 19 juin 2016 ? 16:57, Karl Williamson a ?crit : > Yes. Use http://www.unicode.org/reporting.html to make an error report. Thanks, did that. Best, Daniel From andy.heninger at gmail.com Mon Jun 20 17:32:12 2016 From: andy.heninger at gmail.com (Andy Heninger) Date: Mon, 20 Jun 2016 15:32:12 -0700 Subject: 9.0.0 segmentation and line breaks on the empty string In-Reply-To: References: <46731C3271204F11AEE86273A1BC9922@erratique.ch> <5766C0E8.3050700@khwilliamson.com> Message-ID: > > I notice that in 9.0.0, UAX29 segmentations no longer report boundaries on > the empty string while UAX14 still does This is an interesting edge case. My reading of UAX 14 is that an empty string would not produce a break. Both "sot" and "eot" would be true, so LB2, sot ? would match and apply, and that would be the end of the story. LB3 would never be applied because LB2 would match first. As to mandating a hard break at the end of text (LB3), I'm not at all sure this was a good idea. It seems like the breaking behavior would depend on the external context of the text, about which the LB algorithm knows nothing. It's different from having text that ends ends with a LF or other hard-break character. But I'm also disinclined to suggest changes in this area; the possibility of breaking applications that have come to expect the existing behavior seems real, and it's all edge cases. -- Andy On Sun, Jun 19, 2016 at 9:34 AM, Daniel B?nzli wrote: > Le dimanche, 19 juin 2016 ? 16:57, Karl Williamson a ?crit : > > Yes. Use http://www.unicode.org/reporting.html to make an error report. > > Thanks, did that. > > Best, > > Daniel > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Mon Jun 20 17:49:12 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Mon, 20 Jun 2016 23:49:12 +0100 Subject: 9.0.0 segmentation and line breaks on the empty string In-Reply-To: References: <46731C3271204F11AEE86273A1BC9922@erratique.ch> <5766C0E8.3050700@khwilliamson.com> Message-ID: <85B3F496F90E4D8E91A1F132C29493BA@erratique.ch> Le lundi, 20 juin 2016 ? 23:32, Andy Heninger a ?crit : > My reading of UAX 14 is that an empty string would not produce a break. Both "sot" and "eot" would be true, so LB2, sot ? would match and apply, and that would be the end of the story. Uh. I just checked my own implementation and that's actually what happens (I actually even have a test for this?). I guess I read the clarifications of UAX29 and wrongly remembered the rules were the same on the empty string in UAX 14. So maybe take my report as a request for clarification? Thanks for the answer and sorry for the noise, Daniel From doug at ewellic.org Tue Jun 21 09:43:34 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 21 Jun 2016 07:43:34 -0700 Subject: Release =?UTF-8?Q?date=3F?= Message-ID: <20160621074334.665a7a7059d7ee80bb4d670165c8327d.5132cfa6a7.wbe@email03.godaddy.com> http://opiniojuris.org/2016/06/20/emojis-and-international-law "And tomorrow, June 21, we will have 71 new emojis to play with." Do only bloggers and the press get notified in advance of the release date of Unicode 9.0? -- Doug Ewell | http://ewellic.org | Thornton, CO ???? From kenwhistler at att.net Tue Jun 21 10:03:40 2016 From: kenwhistler at att.net (Ken Whistler) Date: Tue, 21 Jun 2016 08:03:40 -0700 Subject: Release date? In-Reply-To: <20160621074334.665a7a7059d7ee80bb4d670165c8327d.5132cfa6a7.wbe@email03.godaddy.com> References: <20160621074334.665a7a7059d7ee80bb4d670165c8327d.5132cfa6a7.wbe@email03.godaddy.com> Message-ID: Doug, On 6/21/2016 7:43 AM, Doug Ewell wrote: > "And tomorrow, June 21, we will have 71 new emojis to play with." > > Do only bloggers and the press get notified in advance of the release > date of Unicode 9.0? They are getting it from the same place all of the members and anybody else could have been seeing it, the draft 9.0.0 landing page we've had up for months as part of the beta review: http://www.unicode.org/versions/Unicode9.0.0/ That used to just say, June 2016, but then we got more explicit when June got closer, and we could plan to an exact date. The page was also easily accessible until a couple of days ago, when we took down the link to the old 9.0.0 beta review page in preparation for the actual release. Oh, and for several days now, we've been tweeting that the release is imminent. In fact... wait a few hours, and it will be here... ;-) --Ken From jaruga at redhat.com Tue Jun 21 08:54:34 2016 From: jaruga at redhat.com (Jun Aruga) Date: Tue, 21 Jun 2016 09:54:34 -0400 (EDT) Subject: The license for Unihan v1.1 In-Reply-To: <975764717.1628778.1466514955212.JavaMail.zimbra@redhat.com> Message-ID: <1169826287.1673685.1466517274485.JavaMail.zimbra@redhat.com> Hello, I would like to ask you about the license of the old version Unicode mapping data v1.1. I am developing for Ruby package. And there are files in the Ruby package with Unihan GB12345-90 (GB12345-80), GB2312-80 Unicode version 1.1. [1][2] I would like to ask which license should be used for these files. When I checked latest version Unihan.zip [3] of the Unicode mapping, and the ReadMe.txt [4], I found "Unicode Character Database (UCD)" was used. However I checked the directory in version 1.1. [1], it seems that Unihan data has been disappeared from the web site directory. [5] It seems that "Unicode license" had been used for that. So, do you have any idea about which license I should use? Thanks. [1] https://github.com/ruby/ruby/tree/ruby_2_3/enc/trans/GB 4 files in this directory. [2] http://unicode.org/reports/tr38/ [3] http://www.unicode.org/Public/9.0.0/ucd/Unihan.zip [4] http://www.unicode.org/Public/9.0.0/ucd/ReadMe.txt > # Unicode Character Database [5] http://www.unicode.org/Public/1.1-Update/ Jun Aruga From doug at ewellic.org Tue Jun 21 10:18:30 2016 From: doug at ewellic.org (Doug Ewell) Date: Tue, 21 Jun 2016 08:18:30 -0700 Subject: Release =?UTF-8?Q?date=3F?= Message-ID: <20160621081830.665a7a7059d7ee80bb4d670165c8327d.b4c0de5e60.wbe@email03.godaddy.com> Ken Whistler wrote: >> Do only bloggers and the press get notified in advance of the release >> date of Unicode 9.0? > > They are getting it from the same place all of the members and anybody > else could have been seeing it, the draft 9.0.0 landing page we've had > up for months as part of the beta review: > > http://www.unicode.org/versions/Unicode9.0.0/ > > That used to just say, June 2016, but then we got more explicit when > June got closer, and we could plan to an exact date. Sorry, I wasn't aware that page was being updated with new date information. My apologies. -- Doug Ewell | http://ewellic.org | Thornton, CO From public at khwilliamson.com Tue Jun 21 10:14:55 2016 From: public at khwilliamson.com (Karl Williamson) Date: Tue, 21 Jun 2016 09:14:55 -0600 Subject: Release date? In-Reply-To: <20160621074334.665a7a7059d7ee80bb4d670165c8327d.5132cfa6a7.wbe@email03.godaddy.com> References: <20160621074334.665a7a7059d7ee80bb4d670165c8327d.5132cfa6a7.wbe@email03.godaddy.com> Message-ID: <576959EF.1000402@khwilliamson.com> On 06/21/2016 08:43 AM, Doug Ewell wrote: > http://opiniojuris.org/2016/06/20/emojis-and-international-law > > "And tomorrow, June 21, we will have 71 new emojis to play with." > > Do only bloggers and the press get notified in advance of the release > date of Unicode 9.0? http://www.unicode.org/versions/Unicode9.0.0/ From daniel.buenzli at erratique.ch Tue Jun 21 11:02:15 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 21 Jun 2016 17:02:15 +0100 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments Message-ID: I have a few questions/comments about the new emoji segmentation rules in 9.0.0 1. I have trouble understanding what the ^ symbol means in these rules: http://www.unicode.org/reports/tr29/proposed.html#GB8a http://www.unicode.org/reports/tr29/proposed.html#WB15 does it correspond to the regexp SOL symbol ? If that is the case SOL is a bit ambiguous in that context it could also mean that you need to match start of lines which is a whole different business. Couldn't that simply be replaced by sot ? 2. Besides given that with GB8* rules you need to be able to count an odd number of RI, it seems to me that the sentence "Grapheme cluster boundaries can be easily tested by looking at immediately adjacent characters." is no longer accurate. 3. There are two rules named GB8c. 4. In ?1.1 the link to UTS18 is broken (#RegEx does not exist in UAX 41). Best, Daniel From daniel.buenzli at erratique.ch Tue Jun 21 16:19:31 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Tue, 21 Jun 2016 22:19:31 +0100 Subject: UAX29 9.0.0 Grapheme cluster spec & test discrepancy Message-ID: <3E23D9AA9EF7402BA96D1D8D14579CEF@erratique.ch> Hello, It seems there's a discrepancy between the tests and the spec for grapheme clusters. In http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt we have: ? 261D ? 0308 ? 1F3FB ? # ? [0.2] WHITE UP POINTING INDEX (E_Base) # ? [9.0] COMBINING DIAERESIS (Extend) # ? [10.0] EMOJI MODIFIER FITZPATRICK TYPE-1-2 (E_Modifier) ? [0.3] which is http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.html#r10.0 but the spec doesn't talk about interleaved Extend*: http://www.unicode.org/reports/tr29/proposed.html#GB10 It seems following the spec this would be: ? 261D ? 0308 ? 1F3FB ? which one is right ? Best, Daniel From liancu at microsoft.com Tue Jun 21 19:32:08 2016 From: liancu at microsoft.com (Laurentiu Iancu) Date: Wed, 22 Jun 2016 00:32:08 +0000 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments In-Reply-To: References: Message-ID: Hello, Re #1, the ^ symbol indeed denotes a start-of-line anchor, in usual regex notation, and the corresponding rules could use sot instead. Re #2, that was an oversight, and will be addressed in the Proposed Update of UAX #29 for Unicode 10.0. Re #3 and #4, both were addressed before the release of Version 9.0. For suggestions such as #1, which require review by the UTC, please remember to use the feedback reporting form. Thank you, L. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Daniel B?nzli Sent: Tuesday, June 21, 2016 9:02 AM To: Unicode Public Subject: UAX 29 9.0.0 new emoji flag rules questions and comments I have a few questions/comments about the new emoji segmentation rules in 9.0.0 1. I have trouble understanding what the ^ symbol means in these rules: http://www.unicode.org/reports/tr29/proposed.html#GB8a http://www.unicode.org/reports/tr29/proposed.html#WB15 does it correspond to the regexp SOL symbol ? If that is the case SOL is a bit ambiguous in that context it could also mean that you need to match start of lines which is a whole different business. Couldn't that simply be replaced by sot ? 2. Besides given that with GB8* rules you need to be able to count an odd number of RI, it seems to me that the sentence "Grapheme cluster boundaries can be easily tested by looking at immediately adjacent characters." is no longer accurate. 3. There are two rules named GB8c. 4. In ?1.1 the link to UTS18 is broken (#RegEx does not exist in UAX 41). Best, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From liancu at microsoft.com Tue Jun 21 19:32:46 2016 From: liancu at microsoft.com (Laurentiu Iancu) Date: Wed, 22 Jun 2016 00:32:46 +0000 Subject: UAX29 9.0.0 Grapheme cluster spec & test discrepancy In-Reply-To: <3E23D9AA9EF7402BA96D1D8D14579CEF@erratique.ch> References: <3E23D9AA9EF7402BA96D1D8D14579CEF@erratique.ch> Message-ID: Hello, This discrepancy was addressed during the release process. Please refer to the published Version 9.0 of UAX #29 and the UCD files. Regards, L. -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Daniel B?nzli Sent: Tuesday, June 21, 2016 2:20 PM To: Unicode at unicode.org Subject: UAX29 9.0.0 Grapheme cluster spec & test discrepancy Hello, It seems there's a discrepancy between the tests and the spec for grapheme clusters. In http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.txt we have: ? 261D ? 0308 ? 1F3FB ? # ? [0.2] WHITE UP POINTING INDEX (E_Base) # ? [9.0] COMBINING DIAERESIS (Extend) # ? [10.0] EMOJI MODIFIER FITZPATRICK TYPE-1-2 (E_Modifier) ? [0.3] which is http://www.unicode.org/Public/9.0.0/ucd/auxiliary/GraphemeBreakTest.html#r10.0 but the spec doesn't talk about interleaved Extend*: http://www.unicode.org/reports/tr29/proposed.html#GB10 It seems following the spec this would be: ? 261D ? 0308 ? 1F3FB ? which one is right ? Best, Daniel -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Jun 22 04:22:21 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 22 Jun 2016 10:22:21 +0100 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments In-Reply-To: References: Message-ID: <1945B5EB463B4B62ABDBFF876C3FB169@erratique.ch> Thanks for the answers Laurentiu. > For suggestions such as #1, which require review by the UTC, please remember to use the feedback reporting form. Will do ? I always prefer to first check my understanding with the list to avoid making bogus reports. Best, Daniel From verdy_p at wanadoo.fr Wed Jun 22 05:33:57 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Wed, 22 Jun 2016 12:33:57 +0200 Subject: Announcing The Unicode(R) Standard, Version 9.0 Message-ID: 2016-06-22 0:02 GMT+02:00 : > Important symbol additions include: > > - 19 symbols for the new 4K TV standard > > We were told that this standad is not named "4K" but "UltraHD" (UHD)... "4K" is just a popular informal term in English medias, or used in commercial announcements, here also in English. It is not correctly understood everywhere, or would lead to confusion about the required conformance level [Basically, this does not just include a minimum resolution but also a set of encoding technologies, support for encryption, support for several protocols -- including support for UTF-8 as this standard is now based on web standards -- and no longer requires the MPEG envelope, but will rather use streaming over IP. For broadcasting, it also includes a new signal format requiring a new hardware tuner and demultiplexer and channels will transport more than just audio and videos, and will also have dynmically changing parameters (resolution, color planes, supplementary planes for stereoscopic 3D, supplentary streams for 5.1 sound, possibility of reducing the bandwidth usage dynamically for some programs, so that channel producers can negociate their mutual bandwidth need on the multiplex support, and add/remove supplementary streams, including for advertzing, or for renewing usage rights to the authorized subscribers with conforming devices... All this is also supported on the new DVB-T2 standard for broadcasting, but the format is designed to be transportable as well over various networking media, including fiber, DSL, mobile internet, or relayed over VLANs. For "4K" resolution, the requirement on devices is not just on the tuner or demuxer, but also in terms of minimum performance level for the codec which will also support secondary streams for error corrections, possibly via other connections, such as correcting a received broadcast using a separate Internet access, which may also be used to negociate and renew decryption keys for paid programs.] The UltraHD logo (for use on sold products) is set accordingly (and already there's another DVB-T2 logo for hardware decoders that are still not ready for UltraHD, but may be eligible later via firmware updates, because existing DVB-T tuners will not be able to decode the signal even if they support the necessary codecs and are able to display the 4K resolution). For cable decoders or "boxes" propsoed by ISP, there are separate specifications, but they are controled by the ISP. However they will support the UltraHD streams and will implement the necessary virtual networking interfaces in their router. For mobile devices, this will support as long as you have the support for the 4G/5G network, the rest will be a driver update or will be supported by installable apps, but the rendering capabilities will be limited by the GPU and screen hardware. Anyway, aren't all these logos (not "4K", but "UltraHD" and "DVB-T"/"DVB-T2") protected by IP rights (with specific rules about their conforming usage, and a design for the shapes) ? -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Jun 22 06:10:30 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 22 Jun 2016 12:10:30 +0100 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments In-Reply-To: References: Message-ID: Le mercredi, 22 juin 2016 ? 01:32, Laurentiu Iancu a ?crit : > Re #1, the ^ symbol indeed denotes a start-of-line anchor, in usual regex notation, and the corresponding rules could use sot instead. By the way it seems to me that an equivalent formulation of GB12/GB13 and WB15/WB16 would be to have the sequence of rules: RI RI ? RI RI RI x RI This fits particularly well in the case of word breaking since you already need as much context as this because of the rules WB{6,7,11,12}. It also avoids regexps and negation. Best, Daniel From mark at macchiato.com Wed Jun 22 06:32:43 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 22 Jun 2016 13:32:43 +0200 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments In-Reply-To: References: Message-ID: That wouldn't work. The process works by taking each offset, and walking through all the rules, using the first one that matches. So with your rules and the following input: RI RI RI RI RI RI You'd get that any offset with at least 2 RI on the right and on the left would have no break, and every thing else would have a break, thus: RI x RI ? RI ? RI ? RI x RI Mark On Wed, Jun 22, 2016 at 1:10 PM, Daniel B?nzli wrote: > > Le mercredi, 22 juin 2016 ? 01:32, Laurentiu Iancu a ?crit : > > Re #1, the ^ symbol indeed denotes a start-of-line anchor, in usual > regex notation, and the corresponding rules could use sot instead. > > By the way it seems to me that an equivalent formulation of GB12/GB13 and > WB15/WB16 would be to have the sequence of rules: > > RI RI ? RI RI > RI x RI > > This fits particularly well in the case of word breaking since you already > need as much context as this because of the rules WB{6,7,11,12}. It also > avoids regexps and negation. > > Best, > > Daniel > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.buenzli at erratique.ch Wed Jun 22 06:54:28 2016 From: daniel.buenzli at erratique.ch (=?utf-8?Q?Daniel_B=C3=BCnzli?=) Date: Wed, 22 Jun 2016 12:54:28 +0100 Subject: UAX 29 9.0.0 new emoji flag rules questions and comments In-Reply-To: References: Message-ID: Le mercredi, 22 juin 2016 ? 12:32, Mark Davis ?? a ?crit : > That wouldn't work. Ah yes indeed. You'd need to be able to remember which previous boundary decisions that were taken. I.e. have rules of the form: RI x RI ? RI RI Thanks, Daniel From kenwhistler at att.net Wed Jun 22 10:06:00 2016 From: kenwhistler at att.net (Ken Whistler) Date: Wed, 22 Jun 2016 08:06:00 -0700 Subject: Announcing The Unicode(R) Standard, Version 9.0 In-Reply-To: References: Message-ID: <3839938e-397b-1920-88e6-f14fee5783e1@att.net> On 6/22/2016 3:33 AM, Philippe Verdy wrote: > > > 2016-06-22 0:02 GMT+02:00 >: > > Important symbol additions include: > > * 19 symbols for the new 4K TV standard > > We were told that this standad is not named "4K" but "UltraHD" > (UHD)... "4K" is just a popular informal term in English medias, or > used in commercial announcements, here also in English. It is not > correctly understood everywhere, or would lead to confusion about the > required conformance level > > ... [verbose explanation of the standard] ... > > Anyway, aren't all these logos (not "4K", but "UltraHD" and > "DVB-T"/"DVB-T2") protected by IP rights (with specific rules about > their conforming usage, and a design for the shapes) ? > The characters in question are correctly identified as from the "ARIB STD B62" in the 9.0 code charts. We recognize that "4K" is just a shorthand term for the standard in question. And while there may be specific rules for conforming usage in actual television implementation, the symbols in question were urgently requested by the Japanese National Body for inclusion in 10646 and the Unicode Standard, precisely to ensure Unicode *character-based* interchange and interoperability. See: http://www.unicode.org/L2/L2015/15238-n4671.pdf From that document: "Therefore, it is highly expected that the additional symbols in ARIB STD-B62 are safely interchanged via UCS." These 19 symbols were accelerated for publication in Unicode 9.0 to ensure their availability for implementations as of 2016. --Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Wed Jun 22 22:54:16 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Thu, 23 Jun 2016 11:54:16 +0800 Subject: Announcing The Unicode(R) Standard, Version 9.0 In-Reply-To: References: Message-ID: >From what I understand, these symbols are from Japanese Broadcasting Standard and I do see Japanese government use 4K in their official documents which probably explained the naming. https://www.google.co.jp/search?q=4k+site%3A.go.jp&oq=4k+site%3A.go.jp 2016/06/22 18:37 "Philippe Verdy" : > > > 2016-06-22 0:02 GMT+02:00 : > >> Important symbol additions include: >> >> - 19 symbols for the new 4K TV standard >> >> We were told that this standad is not named "4K" but "UltraHD" (UHD)... > "4K" is just a popular informal term in English medias, or used in > commercial announcements, here also in English. It is not correctly > understood everywhere, or would lead to confusion about the required > conformance level > > [Basically, this does not just include a minimum resolution but also a set > of encoding technologies, support for encryption, support for several > protocols -- including support for UTF-8 as this standard is now based on > web standards -- and no longer requires the MPEG envelope, but will rather > use streaming over IP. For broadcasting, it also includes a new signal > format requiring a new hardware tuner and demultiplexer and channels will > transport more than just audio and videos, and will also have dynmically > changing parameters (resolution, color planes, supplementary planes for > stereoscopic 3D, supplentary streams for 5.1 sound, possibility of reducing > the bandwidth usage dynamically for some programs, so that channel > producers can negociate their mutual bandwidth need on the multiplex > support, and add/remove supplementary streams, including for advertzing, or > for renewing usage rights to the authorized subscribers with conforming > devices... All this is also supported on the new DVB-T2 standard for > broadcasting, but the format is designed to be transportable as well over > various networking media, including fiber, DSL, mobile internet, or relayed > over VLANs. For "4K" resolution, the requirement on devices is not just on > the tuner or demuxer, but also in terms of minimum performance level for > the codec which will also support secondary streams for error corrections, > possibly via other connections, such as correcting a received broadcast > using a separate Internet access, which may also be used to negociate and > renew decryption keys for paid programs.] > > The UltraHD logo (for use on sold products) is set accordingly (and > already there's another DVB-T2 logo for hardware decoders that are still > not ready for UltraHD, but may be eligible later via firmware updates, > because existing DVB-T tuners will not be able to decode the signal even if > they support the necessary codecs and are able to display the 4K > resolution). For cable decoders or "boxes" propsoed by ISP, there are > separate specifications, but they are controled by the ISP. However they > will support the UltraHD streams and will implement the necessary virtual > networking interfaces in their router. For mobile devices, this will > support as long as you have the support for the 4G/5G network, the rest > will be a driver update or will be supported by installable apps, but the > rendering capabilities will be limited by the GPU and screen hardware. > > Anyway, aren't all these logos (not "4K", but "UltraHD" and > "DVB-T"/"DVB-T2") protected by IP rights (with specific rules about their > conforming usage, and a design for the shapes) ? > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From otto.stolz at uni-konstanz.de Thu Jun 23 06:50:38 2016 From: otto.stolz at uni-konstanz.de (Otto Stolz) Date: Thu, 23 Jun 2016 13:50:38 +0200 Subject: =?UTF-8?Q?Re:_Announcing_The_Unicode=c2=ae_Standard=2c_Version_9.0?= In-Reply-To: <5769B96E.5040804@unicode.org> References: <5769B96E.5040804@unicode.org> Message-ID: <576BCD0E.7030605@uni-konstanz.de> Ciao, il 2016-06-22 alle 00:02 announcements at unicode.org ha scritto: > Version 9.0 of the Unicode Standard is now available. ? > MOTOR SCOOTER Almost exactly 70 years after its invention, ?la vespa? has found her way into Unicode. I have related that important news, immediately, to the members of my Italian language class ;-) Auguri, Otto From ken.shirriff at gmail.com Thu Jun 23 14:53:17 2016 From: ken.shirriff at gmail.com (Ken Shirriff) Date: Thu, 23 Jun 2016 12:53:17 -0700 Subject: Adding half-star to Unicode? Message-ID: Half-stars are used all over the place for reviews and many people have expressed interest in a Unicode half star. I propose two new Unicode characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a half star without and with an outline. What do you think? Is there any reason Unicode doesn't have a half star? Ken -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Jun 23 16:34:40 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 23 Jun 2016 14:34:40 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: On Thu, Jun 23, 2016 at 12:53 PM, Ken Shirriff wrote: > Half-stars are used all over the place for reviews and many people have > expressed interest in a Unicode half star. I propose two new Unicode > characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a > half star without and with an outline. What do you think? Is there any > reason Unicode doesn't have a half star? > > Ken > Ratings are usually sequences of stars, with any half star coming at the end, like ???(half), AIUI, so it's usually the left side that's black. But what about in right-to-left contexts? Would they be bidi-mirrored? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 23 16:44:10 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 23 Jun 2016 23:44:10 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: Only one of the two would be enough: the existing **full** white star (?), but only half filled. However a second one may be needed for RTL : half filling may be done on the left or right side of the white star. This would then be WHITE STAR WITH LEFT HALF BLACK, WHITE STAR WITH RIGHT HALF BLACK Possibly we may also need top and bottom half filling (for vertical written scripts, top to bottom, or bottom to top) 2016-06-23 21:53 GMT+02:00 Ken Shirriff : > Half-stars are used all over the place for reviews and many people have > expressed interest in a Unicode half star. I propose two new Unicode > characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a > half star without and with an outline. What do you think? Is there any > reason Unicode doesn't have a half star? > > Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 23 16:46:36 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 23 Jun 2016 23:46:36 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: You're right, mirroring for RTL, and vertical presentation may avoid creating 4 characters, only one would then be needed: HALF-BLACK WHITE STAR ... 2016-06-23 23:34 GMT+02:00 Garth Wallace : > On Thu, Jun 23, 2016 at 12:53 PM, Ken Shirriff > wrote: > >> Half-stars are used all over the place for reviews and many people have >> expressed interest in a Unicode half star. I propose two new Unicode >> characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a >> half star without and with an outline. What do you think? Is there any >> reason Unicode doesn't have a half star? >> >> Ken >> > > Ratings are usually sequences of stars, with any half star coming at the > end, like ???(half), AIUI, so it's usually the left side that's black. > But what about in right-to-left contexts? Would they be bidi-mirrored? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Jun 23 17:01:18 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 23 Jun 2016 15:01:18 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: But precedent is for separate WITH LEFT HALF BLACK and WITH RIGHT HALF BLACK geometric shapes. Also, I'm not sure if the BLACK HALF STAR and STAR WITH LEFT HALF BLACK are entirely interchangeable. I usually see the former in situations using a variable number of glyphs, where the number of glyphs shows the rating, as in: ? ??? ????? while I see the latter in ratings with a fixed number of glyphs, where the number of *filled* glyphs shows the rating, as in: ????? ????? ????? It seems like either would work in the first case, but the LEFT HALF STAR would be awkward in the second. On Thu, Jun 23, 2016 at 2:46 PM, Philippe Verdy wrote: > You're right, mirroring for RTL, and vertical presentation may avoid > creating 4 characters, only one would then be needed: HALF-BLACK WHITE STAR > ... > > 2016-06-23 23:34 GMT+02:00 Garth Wallace : > >> On Thu, Jun 23, 2016 at 12:53 PM, Ken Shirriff >> wrote: >> >>> Half-stars are used all over the place for reviews and many people have >>> expressed interest in a Unicode half star. I propose two new Unicode >>> characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a >>> half star without and with an outline. What do you think? Is there any >>> reason Unicode doesn't have a half star? >>> >>> Ken >>> >> >> Ratings are usually sequences of stars, with any half star coming at the >> end, like ???(half), AIUI, so it's usually the left side that's black. >> But what about in right-to-left contexts? Would they be bidi-mirrored? >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Jun 23 17:21:46 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 24 Jun 2016 00:21:46 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: There are also cases where these ratings are using a fixed number of stars, but ALL of them are filled. Only the fiull color changes: the rating shows for example the main rating stars in a plain contrasting blue, the other stars are soft grey shades (less contrastng on the background. And in this case, there's no WHITE STAR used ! 2016-06-24 0:01 GMT+02:00 Garth Wallace : > But precedent is for separate WITH LEFT HALF BLACK and WITH RIGHT HALF > BLACK geometric shapes. > > Also, I'm not sure if the BLACK HALF STAR and STAR WITH LEFT HALF BLACK > are entirely interchangeable. I usually see the former in situations using > a variable number of glyphs, where the number of glyphs shows the rating, > as in: > > ? > ??? > ????? > > while I see the latter in ratings with a fixed number of glyphs, where the > number of *filled* glyphs shows the rating, as in: > > ????? > ????? > ????? > > It seems like either would work in the first case, but the LEFT HALF STAR > would be awkward in the second. > > On Thu, Jun 23, 2016 at 2:46 PM, Philippe Verdy > wrote: > >> You're right, mirroring for RTL, and vertical presentation may avoid >> creating 4 characters, only one would then be needed: HALF-BLACK WHITE STAR >> ... >> >> 2016-06-23 23:34 GMT+02:00 Garth Wallace : >> >>> On Thu, Jun 23, 2016 at 12:53 PM, Ken Shirriff >>> wrote: >>> >>>> Half-stars are used all over the place for reviews and many people have >>>> expressed interest in a Unicode half star. I propose two new Unicode >>>> characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a >>>> half star without and with an outline. What do you think? Is there any >>>> reason Unicode doesn't have a half star? >>>> >>>> Ken >>>> >>> >>> Ratings are usually sequences of stars, with any half star coming at the >>> end, like ???(half), AIUI, so it's usually the left side that's black. >>> But what about in right-to-left contexts? Would they be bidi-mirrored? >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Jun 23 17:27:04 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 23 Jun 2016 15:27:04 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: <05f51bb0-b140-863d-decd-55d02d975e42@ix.netcom.com> An HTML attachment was scrubbed... URL: From unicode at acjs.net Thu Jun 23 17:30:27 2016 From: unicode at acjs.net (ACJ Unicode) Date: Fri, 24 Jun 2016 00:30:27 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: <203da7a5-653c-d098-75e0-001b7b589277@acjs.net> Op 23-6-2016 om 21:53 schreef Ken Shirriff: > Half-stars are used all over the place for reviews and many people > have expressed interest in a Unicode half star. I propose two new > Unicode characters: half a BLACK STAR (?) and a half-filled WHITE STAR > (?), i.e. a half star without and with an outline. What do you think? > Is there any reason Unicode doesn't have a half star? +1 I was actually planning to write a proposal for this. Alexander From kenwhistler at att.net Thu Jun 23 17:35:08 2016 From: kenwhistler at att.net (Ken Whistler) Date: Thu, 23 Jun 2016 15:35:08 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: <4987b2cc-3a4f-ca0b-c638-3f534bfbbfb2@att.net> On 6/23/2016 3:01 PM, Garth Wallace wrote: > But precedent is for separate WITH LEFT HALF BLACK and WITH RIGHT HALF > BLACK geometric shapes. > > Also, I'm not sure if the BLACK HALF STAR and STAR WITH LEFT HALF > BLACK are entirely interchangeable. I agree. If we are going to do this, a set of 4 geometric symbols makes sense: the half black star, left and right, and the half and half black/white star, left and right. We aren't likely to start down the road of making this kind of symbol tally display automatic bidi-mirroring. These aren't math operators -- but the half stars just as more symbols would be useful. And if somebody can turn up convincing evidence of use of star symbols cut in half on a horizontal axis, those would be interesting, too. Oh, and please don't come around next asking for a green rotten splat and a red certified fresh tomato!! --Ken From leob at mailcom.com Thu Jun 23 17:37:56 2016 From: leob at mailcom.com (Leo Broukhis) Date: Thu, 23 Jun 2016 15:37:56 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: For a previous discussion on the topic, please see the thread "Missing geometric shapes" around 11/12/12 Leo On Thu, Jun 23, 2016 at 12:53 PM, Ken Shirriff wrote: > Half-stars are used all over the place for reviews and many people have > expressed interest in a Unicode half star. I propose two new Unicode > characters: half a BLACK STAR (?) and a half-filled WHITE STAR (?), i.e. a > half star without and with an outline. What do you think? Is there any > reason Unicode doesn't have a half star? > > Ken > -------------- next part -------------- An HTML attachment was scrubbed... URL: From textexin at xencraft.com Thu Jun 23 18:06:04 2016 From: textexin at xencraft.com (Tex Texin) Date: Thu, 23 Jun 2016 16:06:04 -0700 Subject: Adding half-star to Unicode? In-Reply-To: <4987b2cc-3a4f-ca0b-c638-3f534bfbbfb2@att.net> References: <4987b2cc-3a4f-ca0b-c638-3f534bfbbfb2@att.net> Message-ID: <002101d1cda3$d1d2c6f0$757854d0$@xencraft.com> I would have to check to see whether they are actually used, but I suspect using stars in RTL markets is not the best choice of symbols... When you look at the number of symbols used for ratings or more general valuations, rather than adding horizontal and vertical shading for each of them, it might be better to have modifier characters for the four half-filled colorings. Or to generalize further, they can just be half modifiers, and the presentation can decide if the coloring is halved, or the image itself is only half-shown, or the other ways of indicating halved (eg half-eaten tomato...) tex -----Original Message----- From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Ken Whistler Sent: Thursday, June 23, 2016 3:35 PM To: Garth Wallace Cc: unicode at unicode.org Subject: Re: Adding half-star to Unicode? On 6/23/2016 3:01 PM, Garth Wallace wrote: > But precedent is for separate WITH LEFT HALF BLACK and WITH RIGHT HALF > BLACK geometric shapes. > > Also, I'm not sure if the BLACK HALF STAR and STAR WITH LEFT HALF > BLACK are entirely interchangeable. I agree. If we are going to do this, a set of 4 geometric symbols makes sense: the half black star, left and right, and the half and half black/white star, left and right. We aren't likely to start down the road of making this kind of symbol tally display automatic bidi-mirroring. These aren't math operators -- but the half stars just as more symbols would be useful. And if somebody can turn up convincing evidence of use of star symbols cut in half on a horizontal axis, those would be interesting, too. Oh, and please don't come around next asking for a green rotten splat and a red certified fresh tomato!! --Ken From frederic.grosshans at gmail.com Fri Jun 24 07:12:31 2016 From: frederic.grosshans at gmail.com (=?UTF-8?Q?Fr=c3=a9d=c3=a9ric_Grosshans?=) Date: Fri, 24 Jun 2016 14:12:31 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: Message-ID: <576D23AF.2050003@gmail.com> Le 24/06/2016 00:37, Leo Broukhis a ?crit : > For a previous discussion on the topic, please see > the thread "Missing geometric shapes" around 11/12/12 The thread starts here : http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html It contains an example of half-filled star used in RTL (Hebrew) context, in an advertisement in Haaretz here http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html From jknappen at web.de Fri Jun 24 09:04:18 2016 From: jknappen at web.de (=?UTF-8?Q?=22J=C3=B6rg_Knappen=22?=) Date: Fri, 24 Jun 2016 16:04:18 +0200 Subject: Aw: Re: Adding half-star to Unicode? In-Reply-To: <576D23AF.2050003@gmail.com> References: , <576D23AF.2050003@gmail.com> Message-ID: An HTML attachment was scrubbed... URL: From tfujiwar at redhat.com Fri Jun 24 00:21:29 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Fri, 24 Jun 2016 14:21:29 +0900 Subject: Emoji and Annotation data Message-ID: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> Hi, I'm working on IBus - the input method framework for Linux. I parse http://unicode.org/emoji/charts/emoji-list.html and create a dictionary between the annotations and the Emoji characters. Since the file size is large and it's often updated, I'm thinking how to maintain the file. I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for the build at the moment. I have questions: - if unicode.org provides the tarball of the stable html files or other data. - what is the license of the html files. Do you have any ideas? Thanks, Fujiwara From gwalla at gmail.com Fri Jun 24 10:55:45 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 24 Jun 2016 08:55:45 -0700 Subject: Adding half-star to Unicode? In-Reply-To: References: <576D23AF.2050003@gmail.com> Message-ID: But would anarchists even want their symbol to be encoded? On Fri, Jun 24, 2016 at 7:04 AM, "J?rg Knappen" wrote: > Talking about fancy five stars, besides the vertically split ones there is > the "Anarchist star" (a symbol for anarcho-syndicalism) > with a diagonal split in a upper left red half and a lower left black > half. Since there are political and ideological symbols encoded > in UNicode, maybe this one is worth encoding as well (probably twice, once > as a black and white plain symbol and once as a colourful Emoji). > > See here: > https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg > > FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star > ANARCHIST STAR EMOJI > > --J?rg Knappen > > *Gesendet:* Freitag, 24. Juni 2016 um 14:12 Uhr > *Von:* "Fr?d?ric Grosshans" > *An:* unicode at unicode.org > *Betreff:* Re: Adding half-star to Unicode? > Le 24/06/2016 00:37, Leo Broukhis a ?crit : > > For a previous discussion on the topic, please see > > the thread "Missing geometric shapes" around 11/12/12 > The thread starts here : > http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html > > It contains an example of half-filled star used in RTL (Hebrew) context, > in an advertisement in Haaretz here > http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Fri Jun 24 11:04:40 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Fri, 24 Jun 2016 18:04:40 +0200 Subject: Emoji and Annotation data In-Reply-To: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> Message-ID: You should never be scraping *any* Unicode HTML files. They are not made for that, and there is no guarantee of stability. The emoji files are built from data which is described in http://www.unicode.org/reports/tr51/ (plus CLDR annotations and collation) Mark On Fri, Jun 24, 2016 at 7:21 AM, Takao Fujiwara wrote: > Hi, > > I'm working on IBus - the input method framework for Linux. > I parse http://unicode.org/emoji/charts/emoji-list.html and create a > dictionary between the annotations and the Emoji characters. > Since the file size is large and it's often updated, I'm thinking how to > maintain the file. > > I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for > the build at the moment. > > I have questions: > - if unicode.org provides the tarball of the stable html files or other > data. > - what is the license of the html files. > > Do you have any ideas? > > Thanks, > Fujiwara > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jun 24 12:10:06 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 24 Jun 2016 19:10:06 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: <576D23AF.2050003@gmail.com> Message-ID: My bet is that they'll prefer using whatever code they want, hacking fonts as necessary to overtake another political symbol when they'll want. They could do that easily with Webfonts today (by designing a tiny webfont with just one glyph mapped to any code point, including some ASCII symbol such as the DOLLAR sign). They would even refuse any normalization and would not even use the codepoint proposed for them, or by remapping some ASCII-art string (the classic emoticons of Usenet; if we even attempt to define standard colors, or glyph design, they'll invent another incompatible design, will change colors, will rotate it, will change it into an exploding star...). However the historic anarchists symbol that was seen on walls and painted banners in Europe in the 19th and early 20th century was only black. And it was not really a star, but derived from the A letter in a circle, with the horizontal bar frequently replaced by some fire arm, or slnated and looking more like a thin arrow head slightly pointing upward (Various decorations could be added on top: a striker throwing a mollotov... or flowers; a plus sign; a "V" on top to mean "victory"). The strokes were most often very irregular, as if they were brushed very rapidly on a wall. More polished forms have been used where it is a standard A in an circle open at the bottom and a small curved leg. Not all of them want flags with colors. Other groups just use a red-filled standard 5-pointed star, over a plain black background. In London still today, there's most often no star, just a red and black flag (color cut on the diagonal). The red side or black side may be attached on the hanging stem, but generally a black side is below the right side. The red color varies also (green, dark purple, pink, orange, white...) but the black color is seems to be always there (even if it's just the classic circle A, that black may be used to fill the glyph, or the background. There's no dedicated support, the symbols may be used everywhere, integrated in all sort of graphics, made with various materials. The flag may be raised in all positions. In Australia, this is a vertical rainbow over a black area. Other symbols of anarchism include a closed hand (fist) raised upward (in a sign of protest) with a venom snake. The anarchist movements have always been inventive and protecting against all sort of political regimes, democartic or not, in fact they protest against all forms of state government, and their official symbols. 2016-06-24 17:55 GMT+02:00 Garth Wallace : > But would anarchists even want their symbol to be encoded? > > On Fri, Jun 24, 2016 at 7:04 AM, "J?rg Knappen" wrote: > >> Talking about fancy five stars, besides the vertically split ones there >> is the "Anarchist star" (a symbol for anarcho-syndicalism) >> with a diagonal split in a upper left red half and a lower left black >> half. Since there are political and ideological symbols encoded >> in UNicode, maybe this one is worth encoding as well (probably twice, >> once as a black and white plain symbol and once as a colourful Emoji). >> >> See here: >> https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg >> >> FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star >> ANARCHIST STAR EMOJI >> >> --J?rg Knappen >> >> *Gesendet:* Freitag, 24. Juni 2016 um 14:12 Uhr >> *Von:* "Fr?d?ric Grosshans" >> *An:* unicode at unicode.org >> *Betreff:* Re: Adding half-star to Unicode? >> Le 24/06/2016 00:37, Leo Broukhis a ?crit : >> > For a previous discussion on the topic, please see >> > the thread "Missing geometric shapes" around 11/12/12 >> The thread starts here : >> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html >> >> It contains an example of half-filled star used in RTL (Hebrew) context, >> in an advertisement in Haaretz here >> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html >> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Fri Jun 24 12:20:51 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Fri, 24 Jun 2016 14:20:51 -0300 Subject: Adding half-star to Unicode? In-Reply-To: References: <576D23AF.2050003@gmail.com> Message-ID: > My bet is that they'll prefer using whatever code they want, hacking fonts as necessary to overtake another political symbol when they'll want. They could liberate a code point from the private use area. 2016-06-24 14:10 GMT-03:00 Philippe Verdy : > My bet is that they'll prefer using whatever code they want, hacking fonts > as necessary to overtake another political symbol when they'll want. They > could do that easily with Webfonts today (by designing a tiny webfont with > just one glyph mapped to any code point, including some ASCII symbol such > as the DOLLAR sign). They would even refuse any normalization and would not > even use the codepoint proposed for them, or by remapping some ASCII-art > string (the classic emoticons of Usenet; if we even attempt to define > standard colors, or glyph design, they'll invent another incompatible > design, will change colors, will rotate it, will change it into an > exploding star...). However the historic anarchists symbol that was seen on > walls and painted banners in Europe in the 19th and early 20th century was > only black. > > And it was not really a star, but derived from the A letter in a circle, > with the horizontal bar frequently replaced by some fire arm, or slnated > and looking more like a thin arrow head slightly pointing upward (Various > decorations could be added on top: a striker throwing a mollotov... or > flowers; a plus sign; a "V" on top to mean "victory"). The strokes were > most often very irregular, as if they were brushed very rapidly on a wall. > More polished forms have been used where it is a standard A in an circle > open at the bottom and a small curved leg. Not all of them want flags with > colors. Other groups just use a red-filled standard 5-pointed star, over a > plain black background. > > In London still today, there's most often no star, just a red and black > flag (color cut on the diagonal). The red side or black side may be > attached on the hanging stem, but generally a black side is below the right > side. The red color varies also (green, dark purple, pink, orange, > white...) but the black color is seems to be always there (even if it's > just the classic circle A, that black may be used to fill the glyph, or the > background. There's no dedicated support, the symbols may be used > everywhere, integrated in all sort of graphics, made with various materials. > > The flag may be raised in all positions. In Australia, this is a vertical > rainbow over a black area. > > Other symbols of anarchism include a closed hand (fist) raised upward (in > a sign of protest) with a venom snake. The anarchist movements have always > been inventive and protecting against all sort of political regimes, > democartic or not, in fact they protest against all forms of state > government, and their official symbols. > > 2016-06-24 17:55 GMT+02:00 Garth Wallace : > >> But would anarchists even want their symbol to be encoded? >> >> On Fri, Jun 24, 2016 at 7:04 AM, "J?rg Knappen" wrote: >> >>> Talking about fancy five stars, besides the vertically split ones there >>> is the "Anarchist star" (a symbol for anarcho-syndicalism) >>> with a diagonal split in a upper left red half and a lower left black >>> half. Since there are political and ideological symbols encoded >>> in UNicode, maybe this one is worth encoding as well (probably twice, >>> once as a black and white plain symbol and once as a colourful Emoji). >>> >>> See here: >>> https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg >>> >>> FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star >>> ANARCHIST STAR EMOJI >>> >>> --J?rg Knappen >>> >>> *Gesendet:* Freitag, 24. Juni 2016 um 14:12 Uhr >>> *Von:* "Fr?d?ric Grosshans" >>> *An:* unicode at unicode.org >>> *Betreff:* Re: Adding half-star to Unicode? >>> Le 24/06/2016 00:37, Leo Broukhis a ?crit : >>> > For a previous discussion on the topic, please see >>> > the thread "Missing geometric shapes" around 11/12/12 >>> The thread starts here : >>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html >>> >>> It contains an example of half-filled star used in RTL (Hebrew) context, >>> in an advertisement in Haaretz here >>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html >>> >>> >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Fri Jun 24 12:23:19 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 24 Jun 2016 19:23:19 +0200 Subject: Adding half-star to Unicode? In-Reply-To: References: <576D23AF.2050003@gmail.com> Message-ID: Or just reuse the code already assigned to the circled A (the most common basic symbol), ignoring the many variants of shapes and colors. 2016-06-24 19:20 GMT+02:00 Leonardo Boiko : > > My bet is that they'll prefer using whatever code they want, hacking > fonts as necessary to overtake another political symbol when they'll want. > > > They could liberate a code point from the private use area. > > > 2016-06-24 14:10 GMT-03:00 Philippe Verdy : > >> My bet is that they'll prefer using whatever code they want, hacking >> fonts as necessary to overtake another political symbol when they'll want. >> They could do that easily with Webfonts today (by designing a tiny webfont >> with just one glyph mapped to any code point, including some ASCII symbol >> such as the DOLLAR sign). They would even refuse any normalization and >> would not even use the codepoint proposed for them, or by remapping some >> ASCII-art string (the classic emoticons of Usenet; if we even attempt to >> define standard colors, or glyph design, they'll invent another >> incompatible design, will change colors, will rotate it, will change it >> into an exploding star...). However the historic anarchists symbol that was >> seen on walls and painted banners in Europe in the 19th and early 20th >> century was only black. >> >> And it was not really a star, but derived from the A letter in a circle, >> with the horizontal bar frequently replaced by some fire arm, or slnated >> and looking more like a thin arrow head slightly pointing upward (Various >> decorations could be added on top: a striker throwing a mollotov... or >> flowers; a plus sign; a "V" on top to mean "victory"). The strokes were >> most often very irregular, as if they were brushed very rapidly on a wall. >> More polished forms have been used where it is a standard A in an circle >> open at the bottom and a small curved leg. Not all of them want flags with >> colors. Other groups just use a red-filled standard 5-pointed star, over a >> plain black background. >> >> In London still today, there's most often no star, just a red and black >> flag (color cut on the diagonal). The red side or black side may be >> attached on the hanging stem, but generally a black side is below the right >> side. The red color varies also (green, dark purple, pink, orange, >> white...) but the black color is seems to be always there (even if it's >> just the classic circle A, that black may be used to fill the glyph, or the >> background. There's no dedicated support, the symbols may be used >> everywhere, integrated in all sort of graphics, made with various materials. >> >> The flag may be raised in all positions. In Australia, this is a vertical >> rainbow over a black area. >> >> Other symbols of anarchism include a closed hand (fist) raised upward (in >> a sign of protest) with a venom snake. The anarchist movements have always >> been inventive and protecting against all sort of political regimes, >> democartic or not, in fact they protest against all forms of state >> government, and their official symbols. >> >> 2016-06-24 17:55 GMT+02:00 Garth Wallace : >> >>> But would anarchists even want their symbol to be encoded? >>> >>> On Fri, Jun 24, 2016 at 7:04 AM, "J?rg Knappen" wrote: >>> >>>> Talking about fancy five stars, besides the vertically split ones there >>>> is the "Anarchist star" (a symbol for anarcho-syndicalism) >>>> with a diagonal split in a upper left red half and a lower left black >>>> half. Since there are political and ideological symbols encoded >>>> in UNicode, maybe this one is worth encoding as well (probably twice, >>>> once as a black and white plain symbol and once as a colourful Emoji). >>>> >>>> See here: >>>> https://commons.wikimedia.org/wiki/Category:Anarcho-Syndicalism#/media/File:Anarchist_star.svg >>>> >>>> FIVE PIONTED STAR WITH BLACK LOWER RIGHT HALF = anarchist star >>>> ANARCHIST STAR EMOJI >>>> >>>> --J?rg Knappen >>>> >>>> *Gesendet:* Freitag, 24. Juni 2016 um 14:12 Uhr >>>> *Von:* "Fr?d?ric Grosshans" >>>> *An:* unicode at unicode.org >>>> *Betreff:* Re: Adding half-star to Unicode? >>>> Le 24/06/2016 00:37, Leo Broukhis a ?crit : >>>> > For a previous discussion on the topic, please see >>>> > the thread "Missing geometric shapes" around 11/12/12 >>>> The thread starts here : >>>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0008.html >>>> >>>> It contains an example of half-filled star used in RTL (Hebrew) context, >>>> in an advertisement in Haaretz here >>>> http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0024.html >>>> >>>> >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From costello at mitre.org Sun Jun 26 03:37:32 2016 From: costello at mitre.org (Costello, Roger L.) Date: Sun, 26 Jun 2016 08:37:32 +0000 Subject: Are there Unicode symbols for parenthesis generator symbols? Message-ID: Hi Folks, In the book Parsing Techniques the authors use a less than symbol with a dot tucked inside for the open parenthesis and a greater than symbol with a dot tucked insider for the close parenthesis. Also, they use an equal sign with a dot over it. You can see the 3 symbols here: https://books.google.com/books?id=05xA_d5dSwAC&pg=PA267&lpg=PA267&dq=parenthesis+generator+symbols&source=bl&ots=3OwyeBndO8&sig=ZhwoeYRJjm3GTzNNP1vgsAVRisc&hl=en&sa=X&sqi=2&ved=0ahUKEwi577X-o8XNAhWBaz4KHc0QA_EQ6AEIIzAB#v=onepage&q=parenthesis%20generator%20symbols&f=false Are there Unicode symbols for the 3 symbols? /Roger From ori at avtalion.name Sun Jun 26 04:12:27 2016 From: ori at avtalion.name (Ori Avtalion) Date: Sun, 26 Jun 2016 12:12:27 +0300 Subject: Emoji and Annotation data In-Reply-To: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> Message-ID: Hey, I maintain an IBus module(?) that allows inputting emojis [1] (I think I mentioned it before on IRC). I use the data provided by EmojiOne, which also includes aliases and the popular (but unofficial) "shortnames". You might find it useful [2]. [1] https://github.com/salty-horse/ibus-uniemoji [2] https://github.com/Ranks/emojione/emoji.json On Fri, Jun 24, 2016 at 8:21 AM, Takao Fujiwara wrote: > Hi, > > I'm working on IBus - the input method framework for Linux. > I parse http://unicode.org/emoji/charts/emoji-list.html and create a > dictionary between the annotations and the Emoji characters. > Since the file size is large and it's often updated, I'm thinking how to > maintain the file. > > I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for > the build at the moment. > > I have questions: > - if unicode.org provides the tarball of the stable html files or other > data. > - what is the license of the html files. > > Do you have any ideas? > > Thanks, > Fujiwara From andrewcwest at gmail.com Sun Jun 26 04:38:28 2016 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 26 Jun 2016 10:38:28 +0100 Subject: Are there Unicode symbols for parenthesis generator symbols? In-Reply-To: References: Message-ID: On 26 June 2016 at 09:37, Costello, Roger L. wrote: > > In the book Parsing Techniques the authors use a less than symbol with a dot tucked inside for the open parenthesis and a greater than symbol with a dot tucked insider for the close parenthesis. Also, they use an equal sign with a dot over it. You can see the 3 symbols here: > > https://books.google.com/books?id=05xA_d5dSwAC&pg=PA267&lpg=PA267&dq=parenthesis+generator+symbols&source=bl&ots=3OwyeBndO8&sig=ZhwoeYRJjm3GTzNNP1vgsAVRisc&hl=en&sa=X&sqi=2&ved=0ahUKEwi577X-o8XNAhWBaz4KHc0QA_EQ6AEIIzAB#v=onepage&q=parenthesis%20generator%20symbols&f=false > > Are there Unicode symbols for the 3 symbols? Yes, and they have all been around since Unicode 1.0: U+22D6 ? U+22D7 ? U+2250 ? (named APPROACHES THE LIMIT) Andrew From verdy_p at wanadoo.fr Sun Jun 26 08:00:56 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 26 Jun 2016 15:00:56 +0200 Subject: Are there Unicode symbols for parenthesis generator symbols? In-Reply-To: References: Message-ID: But there are also variants of U+2264 (?) and U+2265 (?) with dots within the bracket (starting page 973 in the same book) for "weak precedence" of operators... These variants (used to compine ? or ? with ?) don't seem to be encoded. 2016-06-26 11:38 GMT+02:00 Andrew West : > On 26 June 2016 at 09:37, Costello, Roger L. wrote: > > > > In the book Parsing Techniques the authors use a less than symbol with a > dot tucked inside for the open parenthesis and a greater than symbol with a > dot tucked insider for the close parenthesis. Also, they use an equal sign > with a dot over it. You can see the 3 symbols here: > > > > > https://books.google.com/books?id=05xA_d5dSwAC&pg=PA267&lpg=PA267&dq=parenthesis+generator+symbols&source=bl&ots=3OwyeBndO8&sig=ZhwoeYRJjm3GTzNNP1vgsAVRisc&hl=en&sa=X&sqi=2&ved=0ahUKEwi577X-o8XNAhWBaz4KHc0QA_EQ6AEIIzAB#v=onepage&q=parenthesis%20generator%20symbols&f=false > > > > Are there Unicode symbols for the 3 symbols? > > Yes, and they have all been around since Unicode 1.0: > > U+22D6 ? > U+22D7 ? > U+2250 ? (named APPROACHES THE LIMIT) > > Andrew > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrewcwest at gmail.com Sun Jun 26 08:12:56 2016 From: andrewcwest at gmail.com (Andrew West) Date: Sun, 26 Jun 2016 14:12:56 +0100 Subject: Are there Unicode symbols for parenthesis generator symbols? In-Reply-To: References: Message-ID: On 26 June 2016 at 14:00, Philippe Verdy wrote: > But there are also variants of U+2264 (?) and U+2265 (?) with dots within > the bracket (starting page 973 in the same book) for "weak precedence" of > operators... starting page 273 > These variants (used to compine ? or ? with ?) don't seem to be encoded. No, but there are U+2A7F ? and U+2A80 ? with slanted equals which might suffice. Andrew From verdy_p at wanadoo.fr Sun Jun 26 08:13:11 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 26 Jun 2016 15:13:11 +0200 Subject: Are there Unicode symbols for parenthesis generator symbols? In-Reply-To: References: Message-ID: The encoded variants are U+2A7E (?) and U+2A7F (?) but with the lower bar slanted rather than horizontal. May be we could encode them with variant selectors (like for the two known variants of ? and ?) ? 2016-06-26 15:00 GMT+02:00 Philippe Verdy : > But there are also variants of U+2264 (?) and U+2265 (?) with dots within > the bracket (starting page 973 in the same book) for "weak precedence" of > operators... > > These variants (used to compine ? or ? with ?) don't seem to be encoded. > > > 2016-06-26 11:38 GMT+02:00 Andrew West : > >> On 26 June 2016 at 09:37, Costello, Roger L. wrote: >> > >> > In the book Parsing Techniques the authors use a less than symbol with >> a dot tucked inside for the open parenthesis and a greater than symbol with >> a dot tucked insider for the close parenthesis. Also, they use an equal >> sign with a dot over it. You can see the 3 symbols here: >> > >> > >> https://books.google.com/books?id=05xA_d5dSwAC&pg=PA267&lpg=PA267&dq=parenthesis+generator+symbols&source=bl&ots=3OwyeBndO8&sig=ZhwoeYRJjm3GTzNNP1vgsAVRisc&hl=en&sa=X&sqi=2&ved=0ahUKEwi577X-o8XNAhWBaz4KHc0QA_EQ6AEIIzAB#v=onepage&q=parenthesis%20generator%20symbols&f=false >> > >> > Are there Unicode symbols for the 3 symbols? >> >> Yes, and they have all been around since Unicode 1.0: >> >> U+22D6 ? >> U+22D7 ? >> U+2250 ? (named APPROACHES THE LIMIT) >> >> Andrew >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tfujiwar at redhat.com Sun Jun 26 23:09:47 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Jun 2016 13:09:47 +0900 Subject: Emoji and Annotation data In-Reply-To: References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> Message-ID: <2985c941-9c12-7b9f-7ec3-1fcf49ee5ea5@redhat.com> On 06/25/16 01:04, Mark Davis ??-san wrote: > You should never be scraping /any/ Unicode HTML files. They are not made for that, and there is no guarantee of stability. I cannot find the license or descriptions about the HTML files. > > The emoji files are built from data which is described in http://www.unicode.org/reports/tr51/ > (plus CLDR annotations and collation) OK, I need the data which packages Emoji unicode and the annotation. It would be great if the data could be provided besides the html files. Thanks, Fujiwara > > Mark > ////// > > On Fri, Jun 24, 2016 at 7:21 AM, Takao Fujiwara > wrote: > > Hi, > > I'm working on IBus - the input method framework for Linux. > I parse http://unicode.org/emoji/charts/emoji-list.html and create a dictionary between the annotations and the Emoji characters. > Since the file size is large and it's often updated, I'm thinking how to maintain the file. > > I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for the build at the moment. > > I have questions: > - if unicode.org provides the tarball of the stable html files or other data. > - what is the license of the html files. > > Do you have any ideas? > > Thanks, > Fujiwara > > From tfujiwar at redhat.com Sun Jun 26 23:13:55 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Jun 2016 13:13:55 +0900 Subject: Emoji and Annotation data In-Reply-To: References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> Message-ID: <2e621021-44bd-12f0-932a-a1d6b50c361b@redhat.com> Thanks for that info and contribution. Probably I will package the emojione for Fedora to use emoji.json. Why you don't use only annotations? E.g. "us" hits too many Emoji. Fujiwara On 06/26/16 18:12, Ori Avtalion-san wrote: > Hey, > > I maintain an IBus module(?) that allows inputting emojis [1] (I think > I mentioned it before on IRC). > I use the data provided by EmojiOne, which also includes aliases and > the popular (but unofficial) "shortnames". You might find it useful > [2]. > > [1] https://github.com/salty-horse/ibus-uniemoji > [2] https://github.com/Ranks/emojione/emoji.json > > On Fri, Jun 24, 2016 at 8:21 AM, Takao Fujiwara wrote: >> Hi, >> >> I'm working on IBus - the input method framework for Linux. >> I parse http://unicode.org/emoji/charts/emoji-list.html and create a >> dictionary between the annotations and the Emoji characters. >> Since the file size is large and it's often updated, I'm thinking how to >> maintain the file. >> >> I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for >> the build at the moment. >> >> I have questions: >> - if unicode.org provides the tarball of the stable html files or other >> data. >> - what is the license of the html files. >> >> Do you have any ideas? >> >> Thanks, >> Fujiwara > From tfujiwar at redhat.com Mon Jun 27 00:34:59 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Jun 2016 14:34:59 +0900 Subject: Emoji and Annotation data In-Reply-To: References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> <2985c941-9c12-7b9f-7ec3-1fcf49ee5ea5@redhat.com> Message-ID: <6ad6f653-7d3d-412d-f914-51eade9ebe9a@redhat.com> Hi, E.g. http://unicode.org/emoji/charts/emoji-list.html "??" has the annotations of "face" and "grin". The data is available in only the html files. Fujiwara On 06/27/16 14:16, Peter Edberg-san wrote: > Fujiwara-san, > If you follow the information indicated by UTR 51 (as Mark had suggested), you will see that: > > 1. The annotations data is available in CLDR here, in English: > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/en.xml > (or in many other languages, such as Japanese:) > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml > > The description of the format for those xml files is here: > http://www.unicode.org/reports/tr35/tr35-general.html#Annotations > > 2. Other emoji data files are here: > http://www.unicode.org/Public/emoji/latest/ > > These data files are what drive the generation of the charts. > > Best regards, > Peter Edberg > > > >> On Jun 26, 2016, at 9:09 PM, Takao Fujiwara wrote: >> >> On 06/25/16 01:04, Mark Davis ??-san wrote: >>> You should never be scraping /any/ Unicode HTML files. They are not made for that, and there is no guarantee of stability. >> >> I cannot find the license or descriptions about the HTML files. >> >>> >>> The emoji files are built from data which is described in http://www.unicode.org/reports/tr51/ >>> (plus CLDR annotations and collation) >> >> OK, I need the data which packages Emoji unicode and the annotation. >> It would be great if the data could be provided besides the html files. >> >> Thanks, >> Fujiwara >> >>> >>> Mark >>> ////// >>> >>> On Fri, Jun 24, 2016 at 7:21 AM, Takao Fujiwara > wrote: >>> >>> Hi, >>> >>> I'm working on IBus - the input method framework for Linux. >>> I parse http://unicode.org/emoji/charts/emoji-list.html and create a dictionary between the annotations and the Emoji characters. >>> Since the file size is large and it's often updated, I'm thinking how to maintain the file. >>> >>> I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for the build at the moment. >>> >>> I have questions: >>> - if unicode.org provides the tarball of the stable html files or other data. >>> - what is the license of the html files. >>> >>> Do you have any ideas? >>> >>> Thanks, >>> Fujiwara >>> >>> >> > > From ori at avtalion.name Mon Jun 27 01:58:45 2016 From: ori at avtalion.name (Ori Avtalion) Date: Mon, 27 Jun 2016 09:58:45 +0300 Subject: Emoji and Annotation data In-Reply-To: <2e621021-44bd-12f0-932a-a1d6b50c361b@redhat.com> References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> <2e621021-44bd-12f0-932a-a1d6b50c361b@redhat.com> Message-ID: On Mon, Jun 27, 2016 at 7:13 AM, Takao Fujiwara wrote: > Why you don't use only annotations? E.g. "us" hits too many Emoji. It's for all kinds of Unicode symbols, not just those that have emoji representation. Sometimes I find myself searching by the "real" Unicode name, and sometimes by keyword, if I don't know what I'm looking for. I keep tweaking it to provide better results, and I'm pretty pleased with its current state. It currently has a ranking algorithm based on what it matched on (name, annotation/emojione keyword), and how successfully. From tfujiwar at redhat.com Mon Jun 27 02:48:20 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Jun 2016 16:48:20 +0900 Subject: Emoji and Annotation data In-Reply-To: <017A6AB6-EB2B-4126-9866-C6FD13E17149@me.com> References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> <2985c941-9c12-7b9f-7ec3-1fcf49ee5ea5@redhat.com> <6ad6f653-7d3d-412d-f914-51eade9ebe9a@redhat.com> <017A6AB6-EB2B-4126-9866-C6FD13E17149@me.com> Message-ID: <8281cb72-c09b-1a8f-0257-4af462d1c9c7@redhat.com> On 06/27/16 16:01, Peter Edberg-san wrote: > I had suggested that you check > http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/en.xml > which has the line > face; grin > > Is that not what you want? I'm sorry. I missed that. OK, it seems emoji-list.html is the combination of en.xml and /Public/emoji/3.0/emoji-*.txt However I cannot find some annotations. E.g. "america". BTW, I think more categories are useful for the annotations likes "animal", "country". Fujiwara > > - Peter > > > On Jun 26, 2016, at 10:34 PM, Takao Fujiwara wrote: >> >> Hi, >> >> E.g. http://unicode.org/emoji/charts/emoji-list.html >> "??" has the annotations of "face" and "grin". >> >> The data is available in only the html files. >> >> Fujiwara >> >> On 06/27/16 14:16, Peter Edberg-san wrote: >>> Fujiwara-san, >>> If you follow the information indicated by UTR 51 (as Mark had suggested), you will see that: >>> >>> 1. The annotations data is available in CLDR here, in English: >>> http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/en.xml >>> (or in many other languages, such as Japanese:) >>> http://unicode.org/cldr/trac/browser/tags/latest/common/annotations/ja.xml >>> >>> The description of the format for those xml files is here: >>> http://www.unicode.org/reports/tr35/tr35-general.html#Annotations >>> >>> 2. Other emoji data files are here: >>> http://www.unicode.org/Public/emoji/latest/ >>> >>> These data files are what drive the generation of the charts. >>> >>> Best regards, >>> Peter Edberg >>> >>> >>> >>>> On Jun 26, 2016, at 9:09 PM, Takao Fujiwara wrote: >>>> >>>> On 06/25/16 01:04, Mark Davis ??-san wrote: >>>>> You should never be scraping /any/ Unicode HTML files. They are not made for that, and there is no guarantee of stability. >>>> >>>> I cannot find the license or descriptions about the HTML files. >>>> >>>>> >>>>> The emoji files are built from data which is described in http://www.unicode.org/reports/tr51/ >>>>> (plus CLDR annotations and collation) >>>> >>>> OK, I need the data which packages Emoji unicode and the annotation. >>>> It would be great if the data could be provided besides the html files. >>>> >>>> Thanks, >>>> Fujiwara >>>> >>>>> >>>>> Mark >>>>> ////// >>>>> >>>>> On Fri, Jun 24, 2016 at 7:21 AM, Takao Fujiwara > wrote: >>>>> >>>>> Hi, >>>>> >>>>> I'm working on IBus - the input method framework for Linux. >>>>> I parse http://unicode.org/emoji/charts/emoji-list.html and create a dictionary between the annotations and the Emoji characters. >>>>> Since the file size is large and it's often updated, I'm thinking how to maintain the file. >>>>> >>>>> I copied the file as http://ibus.github.io/files/ibus/emoji-list.html for the build at the moment. >>>>> >>>>> I have questions: >>>>> - if unicode.org provides the tarball of the stable html files or other data. >>>>> - what is the license of the html files. >>>>> >>>>> Do you have any ideas? >>>>> >>>>> Thanks, >>>>> Fujiwara >>>>> >>>>> >>>> >>> >>> >> > > From tfujiwar at redhat.com Mon Jun 27 03:00:51 2016 From: tfujiwar at redhat.com (Takao Fujiwara) Date: Mon, 27 Jun 2016 17:00:51 +0900 Subject: Emoji and Annotation data In-Reply-To: References: <07d3a922-b9a3-e8cb-df09-746796c8e0d4@redhat.com> <2e621021-44bd-12f0-932a-a1d6b50c361b@redhat.com> Message-ID: <412dcbbd-f803-df98-7172-489bec525335@redhat.com> On 06/27/16 15:58, Ori Avtalion-san wrote: > On Mon, Jun 27, 2016 at 7:13 AM, Takao Fujiwara wrote: >> Why you don't use only annotations? E.g. "us" hits too many Emoji. > > It's for all kinds of Unicode symbols, not just those that have emoji > representation. > Sometimes I find myself searching by the "real" Unicode name, and > sometimes by keyword, if I don't know what I'm looking for. It's a bit strange for me to type "us" and hits "bus" and "muscle". The following the current implementation in IBus core: https://github.com/ibus/ibus/commit/160d3c975a Fujiwara > > I keep tweaking it to provide better results, and I'm pretty pleased > with its current state. > It currently has a ranking algorithm based on what it matched on > (name, annotation/emojione keyword), and how successfully. > From drmccreedy at gmail.com Tue Jun 28 00:09:38 2016 From: drmccreedy at gmail.com (drmccreedy .) Date: Mon, 27 Jun 2016 23:09:38 -0600 Subject: USAT value in the kIRG_USource property Message-ID: I see one codepoint now has the kIRG_USource property value of "USAT" in the Unihan_IRGSources.txt file from Unihan.zip: U+20991 kIRG_USource USAT-00061 UAX #45 (U-source Ideographs, http://www.unicode.org/reports/tr45/index.html) mentions UTC and UCI but not USAT. UAX #38 (Unicode Han Database, http://www.unicode.org/reports/tr38/) updated the syntax for the kIRG_USource property (but not the description) to U(TC|CI|SAT)-[0-9]{5} so I'm pretty sure it's not a typo. Where can I find a description of the USAT value? Thanks, David McCreedy -------------- next part -------------- An HTML attachment was scrubbed... URL: From mpsuzuki at hiroshima-u.ac.jp Tue Jun 28 01:17:45 2016 From: mpsuzuki at hiroshima-u.ac.jp (suzuki toshiya) Date: Tue, 28 Jun 2016 15:17:45 +0900 Subject: [Unicode] USAT value in the kIRG_USource property In-Reply-To: <6da30a5ddd554ad2b1709155972d8b0d@KL1PR04MB1637.apcprd04.prod.outlook.com> References: <6da30a5ddd554ad2b1709155972d8b0d@KL1PR04MB1637.apcprd04.prod.outlook.com> Message-ID: <57721689.3030305@hiroshima-u.ac.jp> Dear David, Although your confusion is unavoidable, USAT source characters are submitted by Taisho Tripitaka digitization project named "SAT" ( http://21dzk.l.u-tokyo.ac.jp/SAT/index_en.html ), not by UTC. Therefore, current UAX#45 lacks the information. # originally, IRG experts had once recommended to use "Z" # prefix for SAT characters, but WG2 experts decided to # use "U" prefix. Anyway, UAX#38 is expected to be updated for syntax and appropriate reference. I will ask SAT experts comments. I don't think there is requirement to update UAX#45 to include all USAT characters, but the title of UAX "U-Source Ideographs" might be arguable. Regards, mpsuzuki drmccreedy . wrote:: > I see one codepoint now has the kIRG_USource property value of "USAT" in > the Unihan_IRGSources.txt file from Unihan.zip: > U+20991 kIRG_USource USAT-00061 > > UAX #45 (U-source Ideographs, > http://www.unicode.org/reports/tr45/index.html) mentions UTC and UCI but > not USAT. > > UAX #38 (Unicode Han Database, http://www.unicode.org/reports/tr38/) > updated the syntax for the kIRG_USource property (but not the > description) to U(TC|CI|SAT)-[0-9]{5} so I'm pretty sure it's not a typo. > > Where can I find a description of the USAT value? > > Thanks, > > David McCreedy From andrewcwest at gmail.com Tue Jun 28 06:26:44 2016 From: andrewcwest at gmail.com (Andrew West) Date: Tue, 28 Jun 2016 12:26:44 +0100 Subject: USAT value in the kIRG_USource property In-Reply-To: References: Message-ID: David, As Mr Suzuki says, despite the U prefix, USAT is not a Unicode source character. The reason why a solitary USAT source reference has suddenly popped up in Ext. B is that several thousand ideographs were proposed for encoding by SAT in what will be CJK Ext. F in Unicode 10.0 next year (there are currently 2,884 USAT characters in Ext. F). At the WG2 meeting in Matsue Japan last year, in response to UK ballot comments, USAT-00061 was unified with U+20991 in Ext. B (see http://www.unicode.org/wg2/docs/n4701-M64-Recommendations.pdf Recommendation M64.05c). I suppose that the Unicode Standard will be updated with a description of SAT when Ext. F is included in v. 10 next year. Andrew On 28 June 2016 at 06:09, drmccreedy . wrote: > I see one codepoint now has the kIRG_USource property value of "USAT" in the > Unihan_IRGSources.txt file from Unihan.zip: > U+20991 kIRG_USource USAT-00061 > > UAX #45 (U-source Ideographs, > http://www.unicode.org/reports/tr45/index.html) mentions UTC and UCI but not > USAT. > > UAX #38 (Unicode Han Database, http://www.unicode.org/reports/tr38/) updated > the syntax for the kIRG_USource property (but not the description) to > U(TC|CI|SAT)-[0-9]{5} so I'm pretty sure it's not a typo. > > Where can I find a description of the USAT value? > > Thanks, > > David McCreedy From drmccreedy at gmail.com Tue Jun 28 21:41:51 2016 From: drmccreedy at gmail.com (drmccreedy .) Date: Tue, 28 Jun 2016 20:41:51 -0600 Subject: USAT value in the kIRG_USource property In-Reply-To: References: Message-ID: Thank you both for the background. David McCreedy On Tue, Jun 28, 2016 at 5:26 AM, Andrew West wrote: > David, > > As Mr Suzuki says, despite the U prefix, USAT is not a Unicode source > character. The reason why a solitary USAT source reference has > suddenly popped up in Ext. B is that several thousand ideographs were > proposed for encoding by SAT in what will be CJK Ext. F in Unicode > 10.0 next year (there are currently 2,884 USAT characters in Ext. F). > At the WG2 meeting in Matsue Japan last year, in response to UK ballot > comments, USAT-00061 was unified with U+20991 in Ext. B (see > http://www.unicode.org/wg2/docs/n4701-M64-Recommendations.pdf > Recommendation M64.05c). I suppose that the Unicode Standard will be > updated with a description of SAT when Ext. F is included in v. 10 > next year. > > Andrew > > > > On 28 June 2016 at 06:09, drmccreedy . wrote: > > I see one codepoint now has the kIRG_USource property value of "USAT" in > the > > Unihan_IRGSources.txt file from Unihan.zip: > > U+20991 kIRG_USource USAT-00061 > > > > UAX #45 (U-source Ideographs, > > http://www.unicode.org/reports/tr45/index.html) mentions UTC and UCI > but not > > USAT. > > > > UAX #38 (Unicode Han Database, http://www.unicode.org/reports/tr38/) > updated > > the syntax for the kIRG_USource property (but not the description) to > > U(TC|CI|SAT)-[0-9]{5} so I'm pretty sure it's not a typo. > > > > Where can I find a description of the USAT value? > > > > Thanks, > > > > David McCreedy > -------------- next part -------------- An HTML attachment was scrubbed... URL: