From kimslawson at gmail.com Wed Aug 3 14:26:59 2016 From: kimslawson at gmail.com (Kim Slawson) Date: Wed, 3 Aug 2016 15:26:59 -0400 Subject: combining marks for currency characters? general combining character? Message-ID: It's nice to see a good selection of currency symbols defined in unicode, but I wonder if it might be useful to add a few combining marks for the purpose of constructing currency symbols. For example, many currency symbols use single or double horizontal lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as combining marks would simplify the creation of new currency symbols, as many are simply overstruck letters. Would these be good candidates for proposed combining characters? Alternately (and I have no clue if this has been addressed), why not allow arbitrary combining characters? ZWJ does not currently work for this, but it could be amended to, or another joining character introduced. [image: KP logo] Kim Slawson Kernel Panic Consulting kim at slawson.org 207-370-7401 <+1-207-370-7401> -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Aug 3 16:17:14 2016 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 3 Aug 2016 14:17:14 -0700 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: Hi Kim, While it can be argued that the "NON-DESTRUCTIVE BACKSPACE" capability of a typewriter, allowing arbitrary overstruck characters, belongs to plain text, it is more akin to creating subscripts and superscripts by rotating the platen knob up or down by half-interval, which Unicode considers to be within the domain of markup rather than plain text. Regards, Leo On Wed, Aug 3, 2016 at 12:26 PM, Kim Slawson wrote: > It's nice to see a good selection of currency symbols defined in unicode, > but I wonder if it might be useful to add a few combining marks for the > purpose of constructing currency symbols. > > For example, many currency symbols use single or double horizontal lines, > vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as > combining marks would simplify the creation of new currency symbols, as > many are simply overstruck letters. > > Would these be good candidates for proposed combining characters? > > Alternately (and I have no clue if this has been addressed), why not allow > arbitrary combining characters? ZWJ does not currently work for this, but > it could be amended to, or another joining character introduced. > > [image: KP logo] Kim Slawson > Kernel Panic Consulting > kim at slawson.org > 207-370-7401 <+1-207-370-7401> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Wed Aug 3 17:57:49 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Thu, 4 Aug 2016 06:57:49 +0800 Subject: New olympic sport emoji Message-ID: In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo Organising Committee of the Olympic and Paralympic Games think twitter shall add five new emoji for each of those new sports that just get approved into 2020 Olympic game by IOC in four year's timr https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 , but had any proposal be submitted to Unicode about addition of symbol for those sports into Unicode yet? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Aug 3 18:11:14 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 3 Aug 2016 16:11:14 -0700 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > https://twitter.com/Tokyo2020/status/760930003760492544 ?No proposal has been received for these 5 items. FYI: any proposal for emoji for inclusion in 2017 needs to be received by Oct 1, and follow the guidelines in http://www.unicode.org/emoji/selection.html? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Wed Aug 3 20:40:20 2016 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 3 Aug 2016 17:40:20 -0800 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: Unicode encodes what is or what will be rather than what might/should/could be. The ZWJ character is way to indicate a request for a more joined form of the two characters surrounding it?at the encoding level. As such, it's already in place in the standard. The ability to reasonably display arbitrary combinations depends upon computer software, but such combinations can already be entered, stored, and exchanged as data. Best regards, James Kass -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Aug 4 01:30:43 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 3 Aug 2016 23:30:43 -0700 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: Judging by the attached gif, it looks like they actually mean hashflags, not Unicode emoji. On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo > Organising Committee of the Olympic and Paralympic Games think twitter > shall add five new emoji for each of those new sports that just get > approved into 2020 Olympic game by IOC in four year's timr > https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 > , but had any proposal be submitted to Unicode about addition of symbol for > those sports into Unicode yet? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From philip_chastney at yahoo.com Thu Aug 4 03:27:25 2016 From: philip_chastney at yahoo.com (philip chastney) Date: Thu, 4 Aug 2016 08:27:25 +0000 (UTC) Subject: combining marks for currency characters? general combining character? References: <1054061351.5517451.1470299245568.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1054061351.5517451.1470299245568.JavaMail.yahoo@mail.yahoo.com> FontLab provides facilities for combining two outlines which correspond to the set operations of union, intersection and set difference they take no discernible time to execute, and could therefore be made available at print-time, via the rendering engine this suggests that the specification of which outlines to combine should be done within HTML (or similar) this approach (i) would require no additional characters within Unicode, (ii) would allow greater generality (the symbols you mention are often used in mathematics to denote negation, while other symbols are combined in other contexts), (iii) the combined outline needs to be generated before rasterization, but (iv) the maths involved would pose no problem to the clever people who wrote the routines to rasterize outlines in the first place (though hinting would obviously no longer be possible, of course) all the best . . . /phil -------------------------------------------- On Wed, 3/8/16, Kim Slawson wrote: Subject: combining marks for currency characters? general combining character? To: unicode at unicode.org Date: Wednesday, 3 August, 2016, 7:26 PM It's nice to see a good selection of currency symbols defined in unicode, but I wonder if it might be useful to add a few combining marks for the purpose of constructing currency symbols. For example, many currency symbols use single or double horizontal lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as combining marks would simplify the creation of new currency symbols, as many are simply overstruck letters. Would these be good candidates for proposed combining characters? Alternately (and I have no clue if this has been addressed), why not allow arbitrary combining characters? ZWJ does not currently work for this, but it could be amended to, or another joining character introduced. ?Kim Slawson Kernel Panic Consulting kim at slawson.org 207-370-7401 From verdy_p at wanadoo.fr Thu Aug 4 09:33:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Aug 2016 16:33:28 +0200 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: May be, but using such sequence will not work in many cases: - the display will be almost always wrong due to lack of cont support for some unspecified combinations, or because the usage is too recent - the parsing will not recognize the sequznce as a currecy symbol but as a random "word" - the presence of ZWJ could violate expected data formats (currency amounts largely need to be parsed and processed automatically, they are not just standard text) - these symbols do not belong to any script even if they are most often derived from actual letters from a local script) - users will just prefer using the 3-letter ISO currency code or the name of the currency, or known abbreviations, using more conventional notations for abbreviations that you can detect in text: input with sequnce is just an horror Anyway, these symbols are not created very often. There's not a lot of currencies in the world. If one country decides changing its currency or assigning it a symbol, it will be announced largely in advance (before it gets legal tender) and the Unicode standard can track this in its yearly updates. Once it is announced, its usage will explode and users will want a simple symbol to be used in lots of context. So these sequences will typically have a temporary usage, at the early time of adoption in the interim time where fonts are still not updated and available in OSes, in contexts were using images or rich text formats allowing the inclusion of web fonts or embedded fonts will not work. But they will not be used in short messaging systems (chat, SMS, twitts...) where abbreviations and ISO currency codes will largely be prefered. 2016-08-04 3:40 GMT+02:00 James Kass : > Unicode encodes what is or what will be rather than what > might/should/could be. > > The ZWJ character is way to indicate a request for a more joined form of > the two characters surrounding it?at the encoding level. As such, it's > already in place in the standard. The ability to reasonably display > arbitrary combinations depends upon computer software, but such > combinations can already be entered, stored, and exchanged as data. > > Best regards, > > James Kass > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tiemevanveen at hotmail.com Thu Aug 4 02:08:53 2016 From: tiemevanveen at hotmail.com (Tieme van Veen) Date: Thu, 4 Aug 2016 09:08:53 +0200 Subject: New olympic sport emoji In-Reply-To: References: , Message-ID: Nice! I think you're right, they're meaning the Rio-style emoji's that twitter appends after olympic hashes. Still, it would be cool if those 5 new sports could be expressed in emoji's right? People will need them a lot in 2020! I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I chose Climbing instead of SportClimbing to make the icon more generic and useful for all kinds of climbers instead of just 'SportClimbing'. Proposal will be ready by the end of the month, draft is here:https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn-SuIfwN0TD6i3RrkMQorCRY6g/edit# Surfing is already in ??, so is a baseball ?? and Martial arts. That leaves Skateboarding. Tieme From: gwalla at gmail.com Date: Wed, 3 Aug 2016 23:30:43 -0700 Subject: Re: New olympic sport emoji To: c933103 at gmail.com; unicode at unicode.org Judging by the attached gif, it looks like they actually mean hashflags, not Unicode emoji. On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo Organising Committee of the Olympic and Paralympic Games think twitter shall add five new emoji for each of those new sports that just get approved into 2020 Olympic game by IOC in four year's timr https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 , but had any proposal be submitted to Unicode about addition of symbol for those sports into Unicode yet? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Aug 4 10:06:59 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Aug 2016 17:06:59 +0200 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: For softball I would expect a better icon such as https://pixabay.com/static/uploads/photo/2014/04/02/14/13/softball-306540_960_720.png if you use only a ball, that ball should be yellow, not white, but it will be confusive with a tennis ball. 2016-08-04 9:08 GMT+02:00 Tieme van Veen : > Nice! > > I think you're right, they're meaning the Rio-style emoji's that twitter > appends after olympic hashes > . > > Still, it would be cool if those 5 new sports could be expressed in > emoji's right? People will need them a lot in 2020! > > I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I > chose Climbing instead of SportClimbing to make the icon more generic and > useful for all kinds of climbers instead of just 'SportClimbing'. > > Proposal will be ready by the end of the month, draft is here: > https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn- > SuIfwN0TD6i3RrkMQorCRY6g/edit# > > Surfing is already in ??, so is a baseball ?? and Martial arts[image: ??]. That > leaves Skateboarding. > > Tieme > > ------------------------------ > From: gwalla at gmail.com > Date: Wed, 3 Aug 2016 23:30:43 -0700 > Subject: Re: New olympic sport emoji > To: c933103 at gmail.com; unicode at unicode.org > > Judging by the attached gif, it looks like they actually mean hashflags, > not Unicode emoji. > > On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > > In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo > Organising Committee of the Olympic and Paralympic Games think twitter > shall add five new emoji for each of those new sports that just get > approved into 2020 Olympic game by IOC in four year's timr > https://www.olympic.org/news/ioc-approves-five-new-sports- > for-olympic-games-tokyo-2020 , but had any proposal be submitted to > Unicode about addition of symbol for those sports into Unicode yet? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Aug 4 11:19:49 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 4 Aug 2016 09:19:49 -0700 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: Personally, I think Unicode should just encode a set of sports pictograms of the Olympic type (stylized figures engaged in activity, rather than pieces of equipment) and be done with it, but the Consortium clearly disagrees. On Thu, Aug 4, 2016 at 8:06 AM, Philippe Verdy wrote: > For softball I would expect a better icon such as > https://pixabay.com/static/uploads/photo/2014/04/02/14/ > 13/softball-306540_960_720.png > > if you use only a ball, that ball should be yellow, not white, but it will > be confusive with a tennis ball. > > > 2016-08-04 9:08 GMT+02:00 Tieme van Veen : > >> Nice! >> >> I think you're right, they're meaning the Rio-style emoji's that twitter >> appends after olympic hashes >> . >> >> Still, it would be cool if those 5 new sports could be expressed in >> emoji's right? People will need them a lot in 2020! >> >> I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I >> chose Climbing instead of SportClimbing to make the icon more generic and >> useful for all kinds of climbers instead of just 'SportClimbing'. >> >> Proposal will be ready by the end of the month, draft is here: >> https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn-SuIfw >> N0TD6i3RrkMQorCRY6g/edit# >> >> Surfing is already in ??, so is a baseball ?? and Martial arts[image: ??] >> . That leaves Skateboarding. >> >> Tieme >> >> ------------------------------ >> From: gwalla at gmail.com >> Date: Wed, 3 Aug 2016 23:30:43 -0700 >> Subject: Re: New olympic sport emoji >> To: c933103 at gmail.com; unicode at unicode.org >> >> Judging by the attached gif, it looks like they actually mean hashflags, >> not Unicode emoji. >> >> On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: >> >> In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo >> Organising Committee of the Olympic and Paralympic Games think twitter >> shall add five new emoji for each of those new sports that just get >> approved into 2020 Olympic game by IOC in four year's timr >> https://www.olympic.org/news/ioc-approves-five-new-sports-fo >> r-olympic-games-tokyo-2020 , but had any proposal be submitted to >> Unicode about addition of symbol for those sports into Unicode yet? >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Aug 4 12:44:29 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 4 Aug 2016 10:44:29 -0700 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: On 8/3/2016 12:26 PM, Kim Slawson wrote: > It's nice to see a good selection of currency symbols defined in > unicode, but I wonder if it might be useful to add a few combining > marks for the purpose of constructing currency symbols. > > For example, many currency symbols use single or double horizontal > lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these > available as combining marks would simplify the creation of new > currency symbols, as many are simply overstruck letters. Unicode's policy is to disregard combining marks for overlays (as opposed to other categories of combining marks) and code the relevant combined glyph anyway. That goes for letters that are members for alphabets and is done for a number of reasons that all equally well apply to currency symbols. So, the short answer is that even with many overly marks already defined, these would be disregarded as would any additional ones. They are generically useful in some cases, such as to indicate negation for arbitrary mathematical symbols and the like, but not to compose letterlike glyphs. A./ From c933103 at gmail.com Thu Aug 4 13:32:14 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Fri, 5 Aug 2016 02:32:14 +0800 Subject: Implementation of ideographic description characters Message-ID: Hello, As I read that it is possible for an implementation of Unicode that can render those ideographic description characters into rendering the kanji it describe, but is there any known/existing system or font or implementation that would do exactly this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at gmail.com Thu Aug 4 13:49:35 2016 From: leoboiko at gmail.com (Leonardo Boiko) Date: Thu, 4 Aug 2016 15:49:35 -0300 Subject: Implementation of ideographic description characters In-Reply-To: References: Message-ID: Hi, the IDS provide too little information for rendering kanji properly. Take a look into https://en.m.wikipedia.org/wiki/Chinese_character_description_languages . Hello, As I read that it is possible for an implementation of Unicode that can render those ideographic description characters into rendering the kanji it describe, but is there any known/existing system or font or implementation that would do exactly this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Thu Aug 4 14:37:14 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Thu, 4 Aug 2016 12:37:14 -0700 Subject: Whitespace characters in Unicode Message-ID: Hi Unicode Folks: I am trying to come up with a sensible sets of characters that are considered whitespace or newlines in Unicode, and to understand the relative stability policy with respect to them. (This is for a formal syntax where the definition of "whitespace" matters, e.g., to separate identifiers, and I want to be as conservative as possible.) Please let me know if the stuff below is correct, or needs work. The following characters / sequences are considered line breaking characters, per UAX #14 and Section 5.8 of UNICODE: CRLF CR LF FF VT NEL LS PS So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination U+000D U+000A (treated as one line break). These characters / sequences are called "newlines". There will not be any additional code points that are assigned to be line breaks. (Correct?) CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. These are distinguished from other codes (above) that also mean line breaks, mainly because of historical and widespread use of them. There are several formatting characters that affect word wrapping and line breaking, as discussed in those documents...but they are not line breaking characters. **** The following characters are whitespaces: characters (code points) with the property WSpace=Y (or White_Space). This is: newlines U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 Assigned characters that are not listed above, can never be whitespace (according to Unicode). However, the set is not closed, so unassigned code points *could* be assigned to whitespace. It is (unlikely? very unlikely? Pretty much never going to happen?) that additional code points will be assigned to whitespace. **** There are some other characters that Unicode does not consider whitespace, but deserve discussion: U+180E MONGOLIAN VOWEL SEPARATOR: U+200B ZERO WIDTH SPACE U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER U+200E LEFT-TO-RIGHT MARK* U+200F RIGHT-TO-LEFT MARK* U+2060 WORD JOINER U+FEFF ZERO WIDTH NON-BREAKING SPACE *These appear in Pattern_White_Space, but Pattern_White_Space excludes U+2000-200A characters, which are obviously spaces. This is confusing and I would appreciate clarification /why/ Pattern_White_Space is significantly disjoint from White_Space. ******** The borderline characters above are not considered WSpace=Y, but sometimes might have space-like properties. ZWP and ZWNBP are obviously "space" characters, but they never generate whitespace. I suppose that conversely LTRM and RTLM are obviously "not space" characters, but they could generate whitespace under certain circumstances. Ditto for other formatting characters in general (for which the class is much larger). Therefore I guess a Unicode definition of "whitespace" (or "space characters") is: an assigned code point that *always* (is supposed to) generates white space (empty space between graphemes). ******** Are there other standards that Unicode people recommend, that have addressed whether certain borderline characters are considered whitespace vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax component)? Regards, Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Aug 4 14:51:06 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 4 Aug 2016 12:51:06 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: There are 25 Whitespace characters. Here they are grouped by LineBreak property: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%3Awhitespace%3A&g=Lb&i= Don't have time to respond more now. Mark On Thu, Aug 4, 2016 at 12:37 PM, Sean Leonard wrote: > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand the > relative stability policy with respect to them. (This is for a formal > syntax where the definition of "whitespace" matters, e.g., to separate > identifiers, and I want to be as conservative as possible.) Please let me > know if the stuff below is correct, or needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination > U+000D U+000A (treated as one line break). These characters / sequences are > called "newlines". > > There will not be any additional code points that are assigned to be line > breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. > These are distinguished from other codes (above) that also mean line > breaks, mainly because of historical and widespread use of them. > > There are several formatting characters that affect word wrapping and line > breaking, as discussed in those documents...but they are not line breaking > characters. > > **** > > The following characters are whitespaces: characters (code points) with > the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be whitespace > (according to Unicode). However, the set is not closed, so unassigned code > points *could* be assigned to whitespace. It is (unlikely? very unlikely? > Pretty much never going to happen?) that additional code points will be > assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider whitespace, > but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: 2014/12/01/when-is-an-identifier-not-an-identifier- > attack-of-the-mongolian-vowel-separator/> > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space excludes > U+2000-200A characters, which are obviously spaces. This is confusing and I > would appreciate clarification *why* Pattern_White_Space is significantly > disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but sometimes > might have space-like properties. ZWP and ZWNBP are obviously "space" > characters, but they never generate whitespace. I suppose that conversely > LTRM and RTLM are obviously "not space" characters, but they could generate > whitespace under certain circumstances. Ditto for other formatting > characters in general (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed to) > generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered whitespace > vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax > component)? > > Regards, > > Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Thu Aug 4 15:17:04 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 4 Aug 2016 17:17:04 -0300 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: What Mark Davis said; also, depending on what you need, consider taking a look at the definitions used by Unicode regexpes, at http://unicode.org/reports/tr18/ . 2016-08-04 16:37 GMT-03:00 Sean Leonard : > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand the > relative stability policy with respect to them. (This is for a formal > syntax where the definition of "whitespace" matters, e.g., to separate > identifiers, and I want to be as conservative as possible.) Please let me > know if the stuff below is correct, or needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination > U+000D U+000A (treated as one line break). These characters / sequences are > called "newlines". > > There will not be any additional code points that are assigned to be line > breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. > These are distinguished from other codes (above) that also mean line > breaks, mainly because of historical and widespread use of them. > > There are several formatting characters that affect word wrapping and line > breaking, as discussed in those documents...but they are not line breaking > characters. > > **** > > The following characters are whitespaces: characters (code points) with > the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be whitespace > (according to Unicode). However, the set is not closed, so unassigned code > points *could* be assigned to whitespace. It is (unlikely? very unlikely? > Pretty much never going to happen?) that additional code points will be > assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider whitespace, > but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: 2014/12/01/when-is-an-identifier-not-an-identifier- > attack-of-the-mongolian-vowel-separator/> > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space excludes > U+2000-200A characters, which are obviously spaces. This is confusing and I > would appreciate clarification *why* Pattern_White_Space is significantly > disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but sometimes > might have space-like properties. ZWP and ZWNBP are obviously "space" > characters, but they never generate whitespace. I suppose that conversely > LTRM and RTLM are obviously "not space" characters, but they could generate > whitespace under certain circumstances. Ditto for other formatting > characters in general (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed to) > generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered whitespace > vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax > component)? > > Regards, > > Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Thu Aug 4 15:44:46 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Thu, 4 Aug 2016 13:44:46 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> I read through TR18...it mainly says that == \s == \p{Whitespace} == property White_Space is true. Does it say anything else or more significant than that, that I'm missing? Sean On 8/4/2016 1:17 PM, Leonardo Boiko wrote: > What Mark Davis said; also, depending on what you need, consider > taking a look at the definitions used by Unicode regexpes, at > http://unicode.org/reports/tr18/ . > > 2016-08-04 16:37 GMT-03:00 Sean Leonard >: > > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand > the relative stability policy with respect to them. (This is for a > formal syntax where the definition of "whitespace" matters, e.g., > to separate identifiers, and I want to be as conservative as > possible.) Please let me know if the stuff below is correct, or > needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the > combination U+000D U+000A (treated as one line break). These > characters / sequences are called "newlines". > > There will not be any additional code points that are assigned to > be line breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or > NLF. These are distinguished from other codes (above) that also > mean line breaks, mainly because of historical and widespread use > of them. > > There are several formatting characters that affect word wrapping > and line breaking, as discussed in those documents...but they are > not line breaking characters. > > **** > > The following characters are whitespaces: characters (code points) > with the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be > whitespace (according to Unicode). However, the set is not closed, > so unassigned code points *could* be assigned to whitespace. It is > (unlikely? very unlikely? Pretty much never going to happen?) that > additional code points will be assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider > whitespace, but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: > > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space > excludes U+2000-200A characters, which are obviously spaces. This > is confusing and I would appreciate clarification /why/ > Pattern_White_Space is significantly disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but > sometimes might have space-like properties. ZWP and ZWNBP are > obviously "space" characters, but they never generate whitespace. > I suppose that conversely LTRM and RTLM are obviously "not space" > characters, but they could generate whitespace under certain > circumstances. Ditto for other formatting characters in general > (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed > to) generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered > whitespace vs. non-whitespace (e.g., possibly acceptable as an > identifier or syntax component)? > > Regards, > > Sean > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Thu Aug 4 16:28:55 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 4 Aug 2016 18:28:55 -0300 Subject: Whitespace characters in Unicode In-Reply-To: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: I'm sorry; I thought that, when you wanted to separate identifiers, it might be interesting to follow existing regexps definitions; this way your syntax would play along with already-existing tools (e.g. you'd be making it easy for someone to pipe your language into grep -P "\p{Whitespace}"). But I was talking out of my depth; I've never worked with defining Unicode identifiers, so I'm not really qualified to answer. I'm sure Davis and the others can give better answers to your questions. Meanwhile, I see that UAX #31 goes further into Unicode identifiers. It says that Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended for use in regexp-like "patterns" which mix literal characters, whitespace, and syntax (special characters), where the latter two would e.g. require quoting. For example, Perl has a "/x" flag which makes unquoted Pattern_White_Space characters be ignored in regexpes (so that you can make then less illegible). However, UAX #31 it also gives a Default Identifier Syntax, which bounds identifiers not by Whitespace but by their start characters, identified by ID_Start, defined like this: > ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points. So it makes reference only to Pattern_White_Space and not Whitespace. On the other hand, I guess the listing above will exclude Whitespace characters, since they don't count as any of letters, numbers, or Other_ID_Start? None of that is guaranteed to be stable, though. UAX #31 includes a separate definition for "Immutable identifiers", which are, and suggests various compromises between them. 2016-08-04 17:44 GMT-03:00 Sean Leonard : > I read through TR18...it mainly says that == \s == \p{Whitespace} > == property White_Space is true. Does it say anything else or more > significant than that, that I'm missing? > > Sean > > > On 8/4/2016 1:17 PM, Leonardo Boiko wrote: > > What Mark Davis said; also, depending on what you need, consider taking a > look at the definitions used by Unicode regexpes, at > http://unicode.org/reports/tr18/ . > > 2016-08-04 16:37 GMT-03:00 Sean Leonard : > >> Hi Unicode Folks: >> >> I am trying to come up with a sensible sets of characters that are >> considered whitespace or newlines in Unicode, and to understand the >> relative stability policy with respect to them. (This is for a formal >> syntax where the definition of "whitespace" matters, e.g., to separate >> identifiers, and I want to be as conservative as possible.) Please let me >> know if the stuff below is correct, or needs work. >> >> The following characters / sequences are considered line breaking >> characters, per UAX #14 and Section 5.8 of UNICODE: >> >> CRLF CR LF FF VT NEL LS PS >> >> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination >> U+000D U+000A (treated as one line break). These characters / sequences are >> called "newlines". >> >> There will not be any additional code points that are assigned to be line >> breaks. (Correct?) >> >> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >> These are distinguished from other codes (above) that also mean line >> breaks, mainly because of historical and widespread use of them. >> >> There are several formatting characters that affect word wrapping and >> line breaking, as discussed in those documents...but they are not line >> breaking characters. >> >> **** >> >> The following characters are whitespaces: characters (code points) with >> the property WSpace=Y (or White_Space). This is: >> >> newlines >> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >> >> Assigned characters that are not listed above, can never be whitespace >> (according to Unicode). However, the set is not closed, so unassigned code >> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >> Pretty much never going to happen?) that additional code points will be >> assigned to whitespace. >> >> **** >> >> There are some other characters that Unicode does not consider >> whitespace, but deserve discussion: >> U+180E MONGOLIAN VOWEL SEPARATOR: > 2014/12/01/when-is-an-identifier-not-an-identifier-attack- >> of-the-mongolian-vowel-separator/> >> >> U+200B ZERO WIDTH SPACE >> U+200C ZERO WIDTH NON-JOINER >> U+200D ZERO WIDTH JOINER >> U+200E LEFT-TO-RIGHT MARK* >> U+200F RIGHT-TO-LEFT MARK* >> U+2060 WORD JOINER >> U+FEFF ZERO WIDTH NON-BREAKING SPACE >> >> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >> U+2000-200A characters, which are obviously spaces. This is confusing and I >> would appreciate clarification *why* Pattern_White_Space is >> significantly disjoint from White_Space. >> >> ******** >> The borderline characters above are not considered WSpace=Y, but >> sometimes might have space-like properties. ZWP and ZWNBP are obviously >> "space" characters, but they never generate whitespace. I suppose that >> conversely LTRM and RTLM are obviously "not space" characters, but they >> could generate whitespace under certain circumstances. Ditto for other >> formatting characters in general (for which the class is much larger). >> >> Therefore I guess a Unicode definition of "whitespace" (or "space >> characters") is: an assigned code point that *always* (is supposed to) >> generates white space (empty space between graphemes). >> >> ******** >> >> Are there other standards that Unicode people recommend, that have >> addressed whether certain borderline characters are considered whitespace >> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >> component)? >> >> Regards, >> >> Sean >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Thu Aug 4 17:19:31 2016 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Thu, 4 Aug 2016 23:19:31 +0100 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: I'm not a Unicode expert, but I couldn't stop thinking about the following comic after reading "I am trying to come up with a sensible sets of characters that are considered whitespace" https://xkcd.com/927/ Apologies for bringing pretty much nothing to this discussion but I'm pretty sure there's much more to discuss in this ML than another whitespace set on top of 25 characters already. Thanks for your patience and your understanding. Have a great weekend everyone! Best Regards On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko wrote: > I'm sorry; I thought that, when you wanted to separate identifiers, it > might be interesting to follow existing regexps definitions; this way your > syntax would play along with already-existing tools (e.g. you'd be making > it easy for someone to pipe your language into grep -P "\p{Whitespace}"). > > But I was talking out of my depth; I've never worked with defining Unicode > identifiers, so I'm not really qualified to answer. I'm sure Davis and the > others can give better answers to your questions. Meanwhile, I see that > UAX #31 goes further into Unicode identifiers. It says that > Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended > for use in regexp-like "patterns" which mix literal characters, whitespace, > and syntax (special characters), where the latter two would e.g. require > quoting. For example, Perl has a "/x" flag which makes unquoted > Pattern_White_Space characters be ignored in regexpes (so that you can make > then less illegible). > > However, UAX #31 it also gives a Default Identifier Syntax, which bounds > identifiers not by Whitespace but by their start characters, identified by > ID_Start, defined like this: > > > ID_Start characters are derived from the Unicode General_Category of > uppercase letters, lowercase letters, titlecase letters, modifier letters, > other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax > and Pattern_White_Space code points. > > So it makes reference only to Pattern_White_Space and not Whitespace. On > the other hand, I guess the listing above will exclude Whitespace > characters, since they don't count as any of letters, numbers, or > Other_ID_Start? > > None of that is guaranteed to be stable, though. UAX #31 includes a > separate definition for "Immutable identifiers", which are, and suggests > various compromises between them. > > > 2016-08-04 17:44 GMT-03:00 Sean Leonard : > >> I read through TR18...it mainly says that == \s == \p{Whitespace} >> == property White_Space is true. Does it say anything else or more >> significant than that, that I'm missing? >> >> Sean >> >> >> On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >> >> What Mark Davis said; also, depending on what you need, consider taking a >> look at the definitions used by Unicode regexpes, at >> http://unicode.org/reports/tr18/ . >> >> 2016-08-04 16:37 GMT-03:00 Sean Leonard : >> >>> Hi Unicode Folks: >>> >>> I am trying to come up with a sensible sets of characters that are >>> considered whitespace or newlines in Unicode, and to understand the >>> relative stability policy with respect to them. (This is for a formal >>> syntax where the definition of "whitespace" matters, e.g., to separate >>> identifiers, and I want to be as conservative as possible.) Please let me >>> know if the stuff below is correct, or needs work. >>> >>> The following characters / sequences are considered line breaking >>> characters, per UAX #14 and Section 5.8 of UNICODE: >>> >>> CRLF CR LF FF VT NEL LS PS >>> >>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >>> combination U+000D U+000A (treated as one line break). These characters / >>> sequences are called "newlines". >>> >>> There will not be any additional code points that are assigned to be >>> line breaks. (Correct?) >>> >>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >>> These are distinguished from other codes (above) that also mean line >>> breaks, mainly because of historical and widespread use of them. >>> >>> There are several formatting characters that affect word wrapping and >>> line breaking, as discussed in those documents...but they are not line >>> breaking characters. >>> >>> **** >>> >>> The following characters are whitespaces: characters (code points) with >>> the property WSpace=Y (or White_Space). This is: >>> >>> newlines >>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >>> >>> Assigned characters that are not listed above, can never be whitespace >>> (according to Unicode). However, the set is not closed, so unassigned code >>> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >>> Pretty much never going to happen?) that additional code points will be >>> assigned to whitespace. >>> >>> **** >>> >>> There are some other characters that Unicode does not consider >>> whitespace, but deserve discussion: >>> U+180E MONGOLIAN VOWEL SEPARATOR: >> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of >>> -the-mongolian-vowel-separator/> >>> >>> U+200B ZERO WIDTH SPACE >>> U+200C ZERO WIDTH NON-JOINER >>> U+200D ZERO WIDTH JOINER >>> U+200E LEFT-TO-RIGHT MARK* >>> U+200F RIGHT-TO-LEFT MARK* >>> U+2060 WORD JOINER >>> U+FEFF ZERO WIDTH NON-BREAKING SPACE >>> >>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >>> U+2000-200A characters, which are obviously spaces. This is confusing and I >>> would appreciate clarification *why* Pattern_White_Space is >>> significantly disjoint from White_Space. >>> >>> ******** >>> The borderline characters above are not considered WSpace=Y, but >>> sometimes might have space-like properties. ZWP and ZWNBP are obviously >>> "space" characters, but they never generate whitespace. I suppose that >>> conversely LTRM and RTLM are obviously "not space" characters, but they >>> could generate whitespace under certain circumstances. Ditto for other >>> formatting characters in general (for which the class is much larger). >>> >>> Therefore I guess a Unicode definition of "whitespace" (or "space >>> characters") is: an assigned code point that *always* (is supposed to) >>> generates white space (empty space between graphemes). >>> >>> ******** >>> >>> Are there other standards that Unicode people recommend, that have >>> addressed whether certain borderline characters are considered whitespace >>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >>> component)? >>> >>> Regards, >>> >>> Sean >>> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Thu Aug 4 17:36:32 2016 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Thu, 4 Aug 2016 23:36:32 +0100 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: Actually my apologies for my instinctive and quite rude answer, I've misunderstood the initial email thinking Sean was proposing extra whitespace for clarifications. I won't react a quickly in the future, go on with your question Sean, and I hope you'll get it right. Best Regards On Thu, Aug 4, 2016 at 11:19 PM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > I'm not a Unicode expert, but I couldn't stop thinking about the following > comic after reading "I am trying to come up with a sensible sets of > characters that are considered whitespace" https://xkcd.com/927/ > > Apologies for bringing pretty much nothing to this discussion but I'm > pretty sure there's much more to discuss in this ML than another whitespace > set on top of 25 characters already. > > Thanks for your patience and your understanding. > > Have a great weekend everyone! > Best Regards > > On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko > wrote: > >> I'm sorry; I thought that, when you wanted to separate identifiers, it >> might be interesting to follow existing regexps definitions; this way your >> syntax would play along with already-existing tools (e.g. you'd be making >> it easy for someone to pipe your language into grep -P "\p{Whitespace}"). >> >> But I was talking out of my depth; I've never worked with defining >> Unicode identifiers, so I'm not really qualified to answer. I'm sure Davis >> and the others can give better answers to your questions. Meanwhile, I see >> that UAX #31 goes further into Unicode identifiers. It says that >> Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended >> for use in regexp-like "patterns" which mix literal characters, whitespace, >> and syntax (special characters), where the latter two would e.g. require >> quoting. For example, Perl has a "/x" flag which makes unquoted >> Pattern_White_Space characters be ignored in regexpes (so that you can make >> then less illegible). >> >> However, UAX #31 it also gives a Default Identifier Syntax, which bounds >> identifiers not by Whitespace but by their start characters, identified by >> ID_Start, defined like this: >> >> > ID_Start characters are derived from the Unicode General_Category of >> uppercase letters, lowercase letters, titlecase letters, modifier letters, >> other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax >> and Pattern_White_Space code points. >> >> So it makes reference only to Pattern_White_Space and not Whitespace. On >> the other hand, I guess the listing above will exclude Whitespace >> characters, since they don't count as any of letters, numbers, or >> Other_ID_Start? >> >> None of that is guaranteed to be stable, though. UAX #31 includes a >> separate definition for "Immutable identifiers", which are, and suggests >> various compromises between them. >> >> >> 2016-08-04 17:44 GMT-03:00 Sean Leonard : >> >>> I read through TR18...it mainly says that == \s == >>> \p{Whitespace} == property White_Space is true. Does it say anything else >>> or more significant than that, that I'm missing? >>> >>> Sean >>> >>> >>> On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >>> >>> What Mark Davis said; also, depending on what you need, consider taking >>> a look at the definitions used by Unicode regexpes, at >>> http://unicode.org/reports/tr18/ . >>> >>> 2016-08-04 16:37 GMT-03:00 Sean Leonard : >>> >>>> Hi Unicode Folks: >>>> >>>> I am trying to come up with a sensible sets of characters that are >>>> considered whitespace or newlines in Unicode, and to understand the >>>> relative stability policy with respect to them. (This is for a formal >>>> syntax where the definition of "whitespace" matters, e.g., to separate >>>> identifiers, and I want to be as conservative as possible.) Please let me >>>> know if the stuff below is correct, or needs work. >>>> >>>> The following characters / sequences are considered line breaking >>>> characters, per UAX #14 and Section 5.8 of UNICODE: >>>> >>>> CRLF CR LF FF VT NEL LS PS >>>> >>>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >>>> combination U+000D U+000A (treated as one line break). These characters / >>>> sequences are called "newlines". >>>> >>>> There will not be any additional code points that are assigned to be >>>> line breaks. (Correct?) >>>> >>>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >>>> These are distinguished from other codes (above) that also mean line >>>> breaks, mainly because of historical and widespread use of them. >>>> >>>> There are several formatting characters that affect word wrapping and >>>> line breaking, as discussed in those documents...but they are not line >>>> breaking characters. >>>> >>>> **** >>>> >>>> The following characters are whitespaces: characters (code points) with >>>> the property WSpace=Y (or White_Space). This is: >>>> >>>> newlines >>>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >>>> >>>> Assigned characters that are not listed above, can never be whitespace >>>> (according to Unicode). However, the set is not closed, so unassigned code >>>> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >>>> Pretty much never going to happen?) that additional code points will be >>>> assigned to whitespace. >>>> >>>> **** >>>> >>>> There are some other characters that Unicode does not consider >>>> whitespace, but deserve discussion: >>>> U+180E MONGOLIAN VOWEL SEPARATOR: >>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of >>>> -the-mongolian-vowel-separator/> >>>> >>>> U+200B ZERO WIDTH SPACE >>>> U+200C ZERO WIDTH NON-JOINER >>>> U+200D ZERO WIDTH JOINER >>>> U+200E LEFT-TO-RIGHT MARK* >>>> U+200F RIGHT-TO-LEFT MARK* >>>> U+2060 WORD JOINER >>>> U+FEFF ZERO WIDTH NON-BREAKING SPACE >>>> >>>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >>>> U+2000-200A characters, which are obviously spaces. This is confusing and I >>>> would appreciate clarification *why* Pattern_White_Space is >>>> significantly disjoint from White_Space. >>>> >>>> ******** >>>> The borderline characters above are not considered WSpace=Y, but >>>> sometimes might have space-like properties. ZWP and ZWNBP are obviously >>>> "space" characters, but they never generate whitespace. I suppose that >>>> conversely LTRM and RTLM are obviously "not space" characters, but they >>>> could generate whitespace under certain circumstances. Ditto for other >>>> formatting characters in general (for which the class is much larger). >>>> >>>> Therefore I guess a Unicode definition of "whitespace" (or "space >>>> characters") is: an assigned code point that *always* (is supposed to) >>>> generates white space (empty space between graphemes). >>>> >>>> ******** >>>> >>>> Are there other standards that Unicode people recommend, that have >>>> addressed whether certain borderline characters are considered whitespace >>>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >>>> component)? >>>> >>>> Regards, >>>> >>>> Sean >>>> >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Aug 5 10:52:56 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 5 Aug 2016 08:52:56 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Here are specific questions (perhaps Mark Davis, but anyone really with experience, can respond): As Mark said, there are 25 whitespace characters. (I forgot to include HT, so that makes 25 from my original post.) What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? What are "Unicode-y" ways to compute word boundaries? Related to prior question--I suppose ZWSP is not "whitespace", but like whitespace, it separates words. I suppose that since it is not printable, it is "confusing", and therefore should be avoided in contexts where the printed representation of Unicode code points matters. Why is Pattern_White_Space significantly disjoint from White_Space, namely, why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the U+2000 range? Any implementation experience from other standards authors/implementers who have run into problems with shifty whitespace definitions? Regards, Sean On 8/4/2016 2:28 PM, Leonardo Boiko wrote: > I'm sorry; I thought that, when you wanted to separate identifiers, it > might be interesting to follow existing regexps definitions; this way > your syntax would play along with already-existing tools (e.g. you'd > be making it easy for someone to pipe your language into grep -P > "\p{Whitespace}"). > > But I was talking out of my depth; I've never worked with defining > Unicode identifiers, so I'm not really qualified to answer. I'm sure > Davis and the others can give better answers to your questions. > Meanwhile, I see that UAX #31 goes further into Unicode identifiers. > It says that Pattern_White_Space is stable (unlike Whitespace, > perhaps?), and intended for use in regexp-like "patterns" which mix > literal characters, whitespace, and syntax (special characters), where > the latter two would e.g. require quoting. For example, Perl has a > "/x" flag which makes unquoted Pattern_White_Space characters be > ignored in regexpes (so that you can make then less illegible). > > However, UAX #31 it also gives a Default Identifier Syntax, which > bounds identifiers not by Whitespace but by their start characters, > identified by ID_Start, defined like this: > > |> ID_Start| characters are derived from the Unicode General_Category > of uppercase letters, lowercase letters, titlecase letters, modifier > letters, other letters, letter numbers, plus Other_ID_Start, minus > Pattern_Syntax and Pattern_White_Space code points. > > So it makes reference only to Pattern_White_Space and not Whitespace. > On the other hand, I guess the listing above will exclude Whitespace > characters, since they don't count as any of letters, numbers, or > Other_ID_Start? > > None of that is guaranteed to be stable, though. UAX #31 includes a > separate definition for "Immutable identifiers", which are, and > suggests various compromises between them. > > > 2016-08-04 17:44 GMT-03:00 Sean Leonard >: > > I read through TR18...it mainly says that == \s == > \p{Whitespace} == property White_Space is true. Does it say > anything else or more significant than that, that I'm missing? > > Sean > > > On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >> What Mark Davis said; also, depending on what you need, consider >> taking a look at the definitions used by Unicode regexpes, at >> http://unicode.org/reports/tr18/ . >> >> 2016-08-04 16:37 GMT-03:00 Sean Leonard >> >: >> >> Hi Unicode Folks: >> >> I am trying to come up with a sensible sets of characters >> that are considered whitespace or newlines in Unicode, and to >> understand the relative stability policy with respect to >> them. (This is for a formal syntax where the definition of >> "whitespace" matters, e.g., to separate identifiers, and I >> want to be as conservative as possible.) Please let me know >> if the stuff below is correct, or needs work. >> >> The following characters / sequences are considered line >> breaking characters, per UAX #14 and Section 5.8 of UNICODE: >> >> CRLF CR LF FF VT NEL LS PS >> >> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >> combination U+000D U+000A (treated as one line break). These >> characters / sequences are called "newlines". >> >> There will not be any additional code points that are >> assigned to be line breaks. (Correct?) >> >> CRLF, CR, LF, and NEL are also considered "newline functions" >> or NLF. These are distinguished from other codes (above) that >> also mean line breaks, mainly because of historical and >> widespread use of them. >> >> There are several formatting characters that affect word >> wrapping and line breaking, as discussed in those >> documents...but they are not line breaking characters. >> >> **** >> >> The following characters are whitespaces: characters (code >> points) with the property WSpace=Y (or White_Space). This is: >> >> newlines >> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >> >> Assigned characters that are not listed above, can never be >> whitespace (according to Unicode). However, the set is not >> closed, so unassigned code points *could* be assigned to >> whitespace. It is (unlikely? very unlikely? Pretty much never >> going to happen?) that additional code points will be >> assigned to whitespace. >> >> **** >> >> There are some other characters that Unicode does not >> consider whitespace, but deserve discussion: >> U+180E MONGOLIAN VOWEL SEPARATOR: >> >> >> U+200B ZERO WIDTH SPACE >> U+200C ZERO WIDTH NON-JOINER >> U+200D ZERO WIDTH JOINER >> U+200E LEFT-TO-RIGHT MARK* >> U+200F RIGHT-TO-LEFT MARK* >> U+2060 WORD JOINER >> U+FEFF ZERO WIDTH NON-BREAKING SPACE >> >> *These appear in Pattern_White_Space, but Pattern_White_Space >> excludes U+2000-200A characters, which are obviously spaces. >> This is confusing and I would appreciate clarification /why/ >> Pattern_White_Space is significantly disjoint from White_Space. >> >> ******** >> The borderline characters above are not considered WSpace=Y, >> but sometimes might have space-like properties. ZWP and ZWNBP >> are obviously "space" characters, but they never generate >> whitespace. I suppose that conversely LTRM and RTLM are >> obviously "not space" characters, but they could generate >> whitespace under certain circumstances. Ditto for other >> formatting characters in general (for which the class is much >> larger). >> >> Therefore I guess a Unicode definition of "whitespace" (or >> "space characters") is: an assigned code point that *always* >> (is supposed to) generates white space (empty space between >> graphemes). >> >> ******** >> >> Are there other standards that Unicode people recommend, that >> have addressed whether certain borderline characters are >> considered whitespace vs. non-whitespace (e.g., possibly >> acceptable as an identifier or syntax component)? >> >> Regards, >> >> Sean >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Aug 5 12:07:17 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Aug 2016 10:07:17 -0700 Subject: Whitespace characters in Unicode In-Reply-To: <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard wrote: > What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and > ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? > I think "white space" basically wants to have an advance width (occupy space) but no ink (all white, no black) :-) ZWSP and ZWNBSP affect word and line breaking but have no advance width. Note that character names can be misleading, plain wrong, or even just misspelled, but they cannot be changed. Best to read the documentation. The charts are a good start: http://www.unicode.org/charts/PDF/U2000.pdf http://www.unicode.org/charts/PDF/UFE70.pdf In particular, don't build sets of Unicode characters just based on character name patterns. Use character properties as much as possible. What are "Unicode-y" ways to compute word boundaries? > http://www.unicode.org/reports/tr29/#Word_Boundaries Related to prior question--I suppose ZWSP is not "whitespace", but like > whitespace, it separates words. I suppose that since it is not printable, > it is "confusing", and therefore should be avoided in contexts where the > printed representation of Unicode code points matters. > Depends on what you do. Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping and line breaking in a browser or text field/editor. They are not allowed in identifiers, and removed from domain names (UTS #46). Why is Pattern_White_Space significantly disjoint from White_Space, namely, > why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS) > yet omit the spaces U+1680 and in the U+2000 range? > We wanted a simple, immutable definition for rule and pattern strings that programmers write and maintain. We included LRM and RLM so that they can be used (and will be ignored) in rules, for example collation rule strings, to keep them moderately readable when they contain RTL characters. Typographic spaces are unnecessary in this context, and could be confusing. In hindsight, LS and PS are probably mistakes. When we came up with Pattern_White_Space, we still liked the idea of unambiguous end-of-line controls, but in practice it looks like no one really uses them. Anyone who cares uses markup or rich-text formats. (Markup was not common when Unicode was "born".) Any implementation experience from other standards authors/implementers who > have run into problems with shifty whitespace definitions? > Use properties, not character name patterns. If you have strong reasons not to use a property as-is, then still use it, just with inclusion & exclusion overrides. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Aug 6 13:30:31 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 6 Aug 2016 12:30:31 -0600 Subject: LS and RS (was: Re: Whitespace characters in Unicode) In-Reply-To: References: Message-ID: Markus Scherer wrote: > In hindsight, LS and PS are probably mistakes. When we came up > with Pattern_White_Space, we still liked the idea of unambiguous > end-of-line controls, but in practice it looks like no one really uses > them. Anyone who cares uses markup or rich-text formats. (Markup was > not common when Unicode was "born".) I've often felt that the rise of UTF-8 spelled the end for LS and PS. Unicode was originally a completely new text format, exactly 16 bits per character. Conversion to ASCII and other byte-based encodings was an explicit process. Dedicated characters for LS and PS were a simplification, removing the notorious confusion over CR versus LF versus CRLF. UTF-8 brought ASCII backward compatibility to Unicode, removing early objections that "Unicode will double my text size" but requiring continued use of ASCII controls to maintain that compatibility. Implementers saw the existing CR/LF/CRLF muddle as a problem already solved, and LS and PS as new complications with no historical justification. Additionally, in UTF-8, either LS or PS actually takes more bytes than CR plus LF, so the "increased text size" argument also discouraged use of the new controls. -- Doug Ewell | Thornton, CO, US | ewellic.org From lists+unicode at seantek.com Sun Aug 7 18:08:58 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 7 Aug 2016 16:08:58 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: <44112777-6d5b-de46-a504-b435049248a2@seantek.com> On 8/5/2016 10:07 AM, Markus Scherer wrote: > On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard > > wrote: > > What makes a character a "whitespace" in Unicode, e.g., why are > ZWSP and ZWNBSP not "whitespace" even though they clearly say > "SPACE" in them? > > > I think "white space" basically wants to have an advance width (occupy > space) but no ink (all white, no black) :-) Yes, that is the thought that I had as well: whitespace characters always generate blank space between graphemes, whether horizontal or vertical. > > ZWSP and ZWNBSP affect word and line breaking but have no advance width. I suppose that these are "SPACE" characters, but not "WHITE space" characters, since there is no white in them. :) > > Note that character names can be misleading, plain wrong, or even just > misspelled, but they cannot be changed. Best to read the > documentation. The charts are a good start: > http://www.unicode.org/charts/PDF/U2000.pdf > http://www.unicode.org/charts/PDF/UFE70.pdf > > In particular, don't build sets of Unicode characters just based on > character name patterns. Use character properties as much as possible. > > What are "Unicode-y" ways to compute word boundaries? > > > http://www.unicode.org/reports/tr29/#Word_Boundaries > > Related to prior question--I suppose ZWSP is not "whitespace", but > like whitespace, it separates words. I suppose that since it is > not printable, it is "confusing", and therefore should be avoided > in contexts where the printed representation of Unicode code > points matters. > > > Depends on what you do. > > Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping > and line breaking in a browser or text field/editor. > > They are not allowed in identifiers, and removed from domain names > (UTS #46). > > Why is Pattern_White_Space significantly disjoint from > White_Space, namely, why does Pattern_White_Space include LTRM and > RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the > U+2000 range? > > > We wanted a simple, immutable definition for rule and pattern strings > that programmers write and maintain. We included LRM and RLM so that > they can be used (and will be ignored) in rules, for example collation > rule strings, to keep them moderately readable when they contain RTL > characters. Typographic spaces are unnecessary in this context, and > could be confusing. > > In hindsight, LS and PS are probably mistakes. When we came up > with Pattern_White_Space, we still liked the idea of unambiguous > end-of-line controls, but in practice it looks like no one really uses > them. Anyone who cares uses markup or rich-text formats. (Markup was > not common when Unicode was "born".) I like the premise of LS and PS: one (well, two) unambiguous characters to rule them all. But the execution was lacking, to put it mildly. And there aren't two keys on a common keyboard to distinguish between line and paragraph separation. On 8/6/2016 11:30 AM, Doug Ewell wrote: > Additionally, in UTF-8, either LS or PS actually takes more bytes than > CR plus LF, so the "increased text size" argument also discouraged use > of the new controls. That is true, it takes 3 bytes. However, the original UTF-8 proposal encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 : |10xxxxxx| |1xxxxxxx| So, the space block /just barely makes it/. Was this intentional during the original design of UTF-8, or just a coincidence? I think it was more than a coincidence. It is regrettable that the space block was too high to work in the final version of UTF-8...maybe it should have gone below U+07FF. (More motivation for my whitespace question in following message...) Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Sun Aug 7 18:46:27 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 7 Aug 2016 16:46:27 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: <58c83966-a9ba-8c97-dcfb-0fc9dbd5bef3@seantek.com> On 8/5/2016 10:07 AM, Markus Scherer wrote: > On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard > > wrote: > > What makes a character a "whitespace" in Unicode, e.g., why are > ZWSP and ZWNBSP not "whitespace" even though they clearly say > "SPACE" in them? > > > Any implementation experience from other standards > authors/implementers who have run into problems with shifty > whitespace definitions? > > > Use properties, not character name patterns. If you have strong > reasons not to use a property as-is, then still use it, just with > inclusion & exclusion overrides. Short answer: I cannot use character properties, and cannot use exclusion overrides. As I have posted publicly, I am proposing some experimental Unicode-friendly extensions to IETF ABNF (currently in https://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 , going to change that around a bit). There is (some) renewed interest in that part of the work since RFCs will permit UTF-8 in certain places, and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198. Being a BNF, ABNF does not have exclusion, only incremental alternatives. Character properties would require a runtime library, which significantly goes against the purpose of (A)BNF. The current proposed core rules have (scalar values = doughnut hole for surrogates) and (scalar values without the ASCII range). While these are technically accurate, they will not be particularly useful for protocol designers as they are over-inclusive. One of the rules I am working on is , which is like except for Unicode. That eliminates the noncharacter code points (which, technically, are characters...that are defined as "not characters") as well as NULL, which is already eliminated by . I was going to avoid extending (which is U+0021-U+007E, i.e., no spaces and no control characters) because it's a bit too complicated. However, there are actual protocols, including a protocol that I am working on, that define parts of the repertoire as "graphic symbols and spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces and no control characters). So the space characters are relevant at a level beneath requiring a full Unicode runtime to get at the character properties. The newline issue is related but separate, and since IETF continues to use CRLF as the standard for interchange, I don't see a reason to touch it further. Best regards, Sean From duerst at it.aoyama.ac.jp Mon Aug 8 02:07:59 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 8 Aug 2016 16:07:59 +0900 Subject: Whitespace characters in Unicode In-Reply-To: <44112777-6d5b-de46-a504-b435049248a2@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> <44112777-6d5b-de46-a504-b435049248a2@seantek.com> Message-ID: <59cd0feb-ac86-c520-c25b-19c2aa7f90fc@it.aoyama.ac.jp> On 2016/08/08 08:08, Sean Leonard wrote: > On 8/6/2016 11:30 AM, Doug Ewell wrote: >> Additionally, in UTF-8, either LS or PS actually takes more bytes than >> CR plus LF, so the "increased text size" argument also discouraged use >> of the new controls. > > That is true, it takes 3 bytes. However, the original UTF-8 proposal The term "original UTF-8 proposal" is quite misleading, because that proposal was never labeled as UTF-8. "FSS-UTF draft version" would be much better. > encoded U+0080 - U+207F in two octets: > https://en.wikipedia.org/wiki/UTF-8 : > |10xxxxxx| |1xxxxxxx| > > > So, the space block /just barely makes it/. Was this intentional during > the original design of UTF-8, or just a coincidence? I think it was more > than a coincidence. Just a coincidence, I'd say. When designing such schemes, trying to be compact is obviously one of the goals. But "how can I design it so that these two characters still make it as two bytes" isn't. > It is regrettable that the space block was too high > to work in the final version of UTF-8...maybe it should have gone below > U+07FF. There aren't too many line breaks (and usually even less paragraph breaks) in a text, so the overall effect of the encoding length for LS or PS were really not that much of an issue. The main reason for why they didn't spread was that everybody was already dealing with several variants of line breaks and didn't want more of these, even at the prospect of (potentially, eventually, in the very, very long run maybe) have only a single one. Regards, Martin. From doug at ewellic.org Mon Aug 8 11:30:04 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 08 Aug 2016 09:30:04 -0700 Subject: Whitespace characters in Unicode Message-ID: <20160808093004.665a7a7059d7ee80bb4d670165c8327d.da7b3527fd.wbe@email03.godaddy.com> Martin J. D?rst wrote: >> encoded U+0080 - U+207F in two octets: >> https://en.wikipedia.org/wiki/UTF-8 : >> |10xxxxxx| |1xxxxxxx| >> >> So, the space block /just barely makes it/. Was this intentional >> during the original design of UTF-8, or just a coincidence? I think >> it was more than a coincidence. > > Just a coincidence, I'd say. When designing such schemes, trying to be > compact is obviously one of the goals. But "how can I design it so > that these two characters still make it as two bytes" isn't. For actual Unicode compression schemes (SCSU and BOCU-1), certain design elements do exist to allow certain character blocks "in widespread use" to fit in minimal space. For byte-based UTFs, that wasn't a goal at all. ASCII in one byte was a given -- for compatibility with existing software, not favoritism toward English as was sometimes claimed -- but otherwise, algorithmic simplicity and reasonable overall efficiency were more important than optimizing for certain blocks. Replacing one encoding with ranges like "U+2080 through U+8207F" with another which architecturally allows non-shortest sequences, and then disallowing them, is simply a matter of different engineering solutions to the same problem. Each adds simplicity in one place and complexity in another. UTF-8 happened to tick more additional boxes (e.g. self-synchronization) than the others. -- Doug Ewell | Thornton, CO, US | ewellic.org From costello at mitre.org Wed Aug 10 03:45:08 2016 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Aug 2016 08:45:08 +0000 Subject: less-than or equal to with dot in the less-than part? Message-ID: Hi Folks, Here is the "less-than with dot" symbol: ? Here is the "less-than or equal to" symbol: ? I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. /Roger From andrewcwest at gmail.com Wed Aug 10 04:08:22 2016 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 10 Aug 2016 10:08:22 +0100 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: On 10 August 2016 at 09:45, Costello, Roger L. wrote: > > Here is the "less-than with dot" symbol: ? > Here is the "less-than or equal to" symbol: ? > > I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html Andrew From costello at mitre.org Wed Aug 10 06:21:53 2016 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Aug 2016 11:21:53 +0000 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: Andrew West graciously pointed me to this symbol: U+2A7F ? Thank you Andrew! Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) /Roger -----Original Message----- From: Andrew West [mailto:andrewcwest at gmail.com] Sent: Wednesday, August 10, 2016 5:08 AM To: Costello, Roger L. Cc: unicode at unicode.org Subject: Re: less-than or equal to with dot in the less-than part? On 10 August 2016 at 09:45, Costello, Roger L. wrote: > > Here is the "less-than with dot" symbol: ? Here is the "less-than or > equal to" symbol: ? > > I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html Andrew From andrewcwest at gmail.com Wed Aug 10 07:06:38 2016 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 10 Aug 2016 13:06:38 +0100 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: On 10 August 2016 at 12:21, Costello, Roger L. wrote: > > Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) No, but there are lots of standardized variants for mathematical glyph variants of this sort (see first section of http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so you could ask the UTC to define two more mathematical standardized variants: 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO WITH DOT INSIDE Then all you would need is to get someone to support the new standardized variants in a math font. Andrew From asmusf at ix.netcom.com Wed Aug 10 11:14:44 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Wed, 10 Aug 2016 09:14:44 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: On 8/10/2016 2:08 AM, Andrew West wrote: > On 10 August 2016 at 09:45, Costello, Roger L. wrote: >> Here is the "less-than with dot" symbol: ? >> Here is the "less-than or equal to" symbol: ? >> >> I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. > http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html The one sentence you need in following that link is: "No, but there are U+2A7F ? and U+2A80 ? with slanted equals which might suffice. " The principle seems to be that Unicode separately encodes slanted from non-slanted less-than-or-equal (and similar symbols), but has not done so for the ones with dot. The question would be whether the reason for making the distinction for the non-dotted code points also holds for the dotted ones. If it does, this might be an omission, if not, as Andrew said, the existing forms might suffice. A./ From asmusf at ix.netcom.com Wed Aug 10 11:16:45 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Wed, 10 Aug 2016 09:16:45 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: <4ed179c8-5535-d0cd-3543-0d55d6312825@ix.netcom.com> On 8/10/2016 5:06 AM, Andrew West wrote: > On 10 August 2016 at 12:21, Costello, Roger L. wrote: >> Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) > No, but there are lots of standardized variants for mathematical glyph > variants of this sort (see first section of > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so > you could ask the UTC to define two more mathematical standardized > variants: > > 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE > 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > Then all you would need is to get someone to support the new > standardized variants in a math font. > Unicode does not use standardized variants for that particular distinctions in the undotted part of that family of symbols. A./ From philip_chastney at yahoo.com Thu Aug 11 02:33:46 2016 From: philip_chastney at yahoo.com (philip chastney) Date: Thu, 11 Aug 2016 07:33:46 +0000 (UTC) Subject: less-than or equal to with dot in the less-than part? References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> there is another issue with these symbols -- they appear among the mathematical symbols but, in the reference given, they are used as delimiters I know of no other application for these symbols other than as delimiters -- are they used as mathematical operators? and how, in general, would one define the properties for characters which may sometimes be operators, and sometimes be delimiters? /phil -------------------------------------------- On Wed, 10/8/16, Asmus Freytag (c) wrote: Subject: Re: less-than or equal to with dot in the less-than part? To: unicode at unicode.org Date: Wednesday, 10 August, 2016, 4:16 PM On 8/10/2016 5:06 AM, Andrew West wrote: > On 10 August 2016 at 12:21, Costello, Roger L. wrote: >> Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) > No, but there are lots of standardized variants for mathematical glyph > variants of this sort (see first section of > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so > you could ask the UTC to define two more mathematical standardized > variants: > > 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE > 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > Then all you would need is to get someone to support the new > standardized variants in a math font. > Unicode does not use standardized variants for that particular distinctions in the undotted part of that family of symbols. A./ From asmusf at ix.netcom.com Thu Aug 11 03:24:30 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 11 Aug 2016 01:24:30 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> Message-ID: <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> On 8/11/2016 12:33 AM, philip chastney wrote: > there is another issue with these symbols -- they appear among the mathematical symbols but, in the reference given, they are used as delimiters > > I know of no other application for these symbols other than as delimiters -- are they used as mathematical operators? > > and how, in general, would one define the properties for characters which may sometimes be operators, and sometimes be delimiters? First and foremost. If the precise form of these (straight equals, but dotted) corresponds to a delimiter, whereas the other form (slanted equals) is an operator, then that would be even more reason to not unify these (whether with or without a variation sequence). Are the already encoded ones given the property of relational operators? Nothing prevents anyone from using an integral sing as a funny-looking fence. I would find it acceptable if the informative properties were based on majority or customary use (in the hopes that that would allow some picking of a preferred preference). A./ > /phil > > -------------------------------------------- > On Wed, 10/8/16, Asmus Freytag (c) wrote: > > Subject: Re: less-than or equal to with dot in the less-than part? > To: unicode at unicode.org > Date: Wednesday, 10 August, 2016, 4:16 PM > > On 8/10/2016 5:06 AM, > Andrew West wrote: > > On 10 August 2016 at > 12:21, Costello, Roger L. > wrote: > >> Do you know if there is > another version of the symbol, but with a straight equals > sign rather than a slanted equals sign? (The book that I > referred to uses a straight equals sign not a slanted equals > sign) > > No, but there are lots of > standardized variants for mathematical glyph > > variants of this sort (see first section > of > > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), > so > > you could ask the UTC to define two > more mathematical standardized > > > variants: > > > > 2A7F > FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > 2A80 FE00; with > straight equal; # GREATER-THAN OR SLANTED EQUAL TO > > WITH DOT INSIDE > > > > Then all you would need is to get someone > to support the new > > standardized > variants in a math font. > > > > Unicode does not use > standardized variants for that particular > distinctions in the undotted part of that > family of symbols. > > A./ > > From verdy_p at wanadoo.fr Thu Aug 11 10:20:43 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Aug 2016 17:20:43 +0200 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> Message-ID: the =equal sign= is also used as a delimiter (fancy quotation marks and brackets), this is also the case for < and > (see XML, also used as quotation marks in some contexts that want more). I don't see why these simple math operators would be restriced to math. Same remark about ++plus++ signs (emphasis marks). In those usages however, I do not think that there's a significant difference between the slanted or straight variants, fonts could choose one variant or the other. In maths, there's normally no difference, but possibly in some cases these could be distinctive (mathematicians love creating distinctive but simple symbols that are easily recognized because they need many distinctions when they work on various kinds of generalizations or extensions to wider topologies exhibiting some differences). 2016-08-11 10:24 GMT+02:00 Asmus Freytag (c) : > On 8/11/2016 12:33 AM, philip chastney wrote: > >> there is another issue with these symbols -- they appear among the >> mathematical symbols but, in the reference given, they are used as >> delimiters >> >> I know of no other application for these symbols other than as >> delimiters -- are they used as mathematical operators? >> >> and how, in general, would one define the properties for characters which >> may sometimes be operators, and sometimes be delimiters? >> > > First and foremost. If the precise form of these (straight equals, but > dotted) corresponds to a delimiter, whereas the other form (slanted equals) > is an operator, then that would be even more reason to not unify these > (whether with or without a variation sequence). > > Are the already encoded ones given the property of relational operators? > > Nothing prevents anyone from using an integral sing as a funny-looking > fence. I would find it acceptable if the informative properties were based > on majority or customary use (in the hopes that that would allow some > picking of a preferred preference). > > A./ > > /phil >> >> -------------------------------------------- >> On Wed, 10/8/16, Asmus Freytag (c) wrote: >> >> Subject: Re: less-than or equal to with dot in the less-than part? >> To: unicode at unicode.org >> Date: Wednesday, 10 August, 2016, 4:16 PM >> On 8/10/2016 5:06 AM, >> Andrew West wrote: >> > On 10 August 2016 at >> 12:21, Costello, Roger L. >> wrote: >> >> Do you know if there is >> another version of the symbol, but with a straight equals >> sign rather than a slanted equals sign? (The book that I >> referred to uses a straight equals sign not a slanted equals >> sign) >> > No, but there are lots of >> standardized variants for mathematical glyph >> > variants of this sort (see first section >> of >> > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), >> so >> > you could ask the UTC to define two >> more mathematical standardized >> > >> variants: >> > >> > 2A7F >> FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO >> WITH DOT INSIDE >> > 2A80 FE00; with >> straight equal; # GREATER-THAN OR SLANTED EQUAL TO >> > WITH DOT INSIDE >> > >> > Then all you would need is to get someone >> to support the new >> > standardized >> variants in a math font. >> > >> Unicode does not use >> standardized variants for that particular >> distinctions in the undotted part of that >> family of symbols. >> A./ >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Aug 11 13:29:21 2016 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 11 Aug 2016 12:29:21 -0600 Subject: Where are the tools to generate posix and json from cldr? Message-ID: I can't find these that are mentioned in http://cldr.unicode.org/ "For those interested in the source CLDR data, it is available for each release in the XML format specified by LDML. There are also tools that will convert to JSON and POSIX format. For more information, see CLDR Releases/Downloads." If you follow that link, the page contains this text: "POSIX Data "Note: Beginning with CLDR v21, the CLDR project will no longer publish POSIX-format locale sources as part of its distribution. The POSIX locale generation tools will continue to be made available as a part of the release. Developers who require POSIX compliant locales can generate them using these tools." But I can't find those tools. From mark at macchiato.com Thu Aug 11 13:59:35 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 11 Aug 2016 20:59:35 +0200 Subject: Where are the tools to generate posix and json from cldr? In-Reply-To: References: Message-ID: ?That is a bit obscure! We stopped generating the source for POSIX because essentially every user customized it in some way, so was better to do with a tool. We need to add a pointer to where to get the tools and how to use them. http://cldr.unicode.org/index/downloads#Repository_Organization shows where they are. Above that are the details for SVN access.? But we really need a page that describes the specific tools and how to use them. Filed as http://unicode.org/cldr/trac/ticket/9695 Mark On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson wrote: > I can't find these that are mentioned in http://cldr.unicode.org/ > > "For those interested in the source CLDR data, it is available for each > release in the XML format specified by LDML. There are also tools that will > convert to JSON and POSIX format. For more information, see CLDR > Releases/Downloads." > > If you follow that link, the page contains this text: > > "POSIX Data > > "Note: Beginning with CLDR v21, the CLDR project will no longer publish > POSIX-format locale sources as part of its distribution. The POSIX locale > generation tools will continue to be made available as a part of the > release. Developers who require POSIX compliant locales can generate them > using these tools." > > But I can't find those tools. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Aug 11 14:19:11 2016 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 11 Aug 2016 13:19:11 -0600 Subject: Where are the tools to generate posix and json from cldr? In-Reply-To: References: Message-ID: <1d2adb32-9c64-610c-924f-5dda05bd2184@khwilliamson.com> On 08/11/2016 12:59 PM, Mark Davis ?? wrote: > ?That is a bit obscure! We stopped generating the source for POSIX > because essentially every user customized it in some way, so was better > to do with a tool. We need to add a pointer to where to get the tools > and how to use them. > > http://cldr.unicode.org/index/downloads#Repository_Organization shows > where they are. > Above that are the details for SVN access.? > But we really need a page that describes the specific tools and how to > use them. Filed as http://unicode.org/cldr/trac/ticket/9695 > > Mark I had looked at that, and downloaded the latest data, and still could not find the tools in it. One would think that the tools directory contains it, and I did not look in every sub-directory in it, but none looked likely. I then tried transforms, but came up empty there too. > ////// > > On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson > > wrote: > > I can't find these that are mentioned in http://cldr.unicode.org/ > > "For those interested in the source CLDR data, it is available for > each release in the XML format specified by LDML. There are also > tools that will convert to JSON and POSIX format. For more > information, see CLDR Releases/Downloads." > > If you follow that link, the page contains this text: > > "POSIX Data > > "Note: Beginning with CLDR v21, the CLDR project will no longer > publish POSIX-format locale sources as part of its distribution. > The POSIX locale generation tools will continue to be made available > as a part of the release. Developers who require POSIX compliant > locales can generate them using these tools." > > But I can't find those tools. > > From taylorcanning at outlook.com Thu Aug 11 20:32:49 2016 From: taylorcanning at outlook.com (Taylor Canning) Date: Fri, 12 Aug 2016 01:32:49 +0000 Subject: Myanmar character set Message-ID: Hi there, has anyone had any issues with the Myanmar character set ? i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S?Gaw Karen language, which is a lot of people ! Thanks, Taylor Sent from my Windows 10 phone -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Thu Aug 11 22:50:37 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 12 Aug 2016 13:50:37 +1000 Subject: Myanmar character set In-Reply-To: References: Message-ID: Hi Taylor, This should work fine in theory. Are you using a mymr or mym2 style opentype font? What applications, operating system and fonts are you using? Andrew On 12 Aug 2016 12:55 pm, "Taylor Canning" wrote: > Hi there, has anyone had any issues with the Myanmar character set ? i > have raised an issue recently where the combination ? and ? does not > combine correctly to make ?? on my windows devices. It used to work just > fine. It is am extremely common tonal marker and is a big issue for anyone > who types the S?Gaw Karen language, which is a lot of people ! > > Thanks, Taylor > > > > Sent from my Windows 10 phone > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Aug 12 06:41:48 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 12 Aug 2016 13:41:48 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs Message-ID: > 2640 FE0E; text style; # FEMALE SIGN > 2640 FE0F; emoji style; # FEMALE SIGN > 2642 FE0E; text style; # MALE SIGN > 2642 FE0F; emoji style; # MALE SIGN Since U+240 and U+2642 double as symbols for the planets (and ancient gods) Venus and Mars, respectively, users will rightfully expect VS-16 to have an effect on the other planet symbols as well (probably including U+2647 Pluto). Both symbols are also sometimes used to represent Friday and Tuesday, respectively, so some users may expect the symbols for the other 5 days of the week also react on U+FE0E/F. 1. Monday ? U+263D Moon or ? U+263E 2. Tuesday ? U+2642 Mars 3. Wednesday ? U+263F Mercury 4. Thursday ? U+2643 Jupiter 5. Friday ? U+2640 Venus 6. Saturday ? U+2644 Saturn 7. Sunday ? U+2609 Sun or ? U+263C U+2640/2 are also part of common sets of gender, sex and sexuality symbols which, again, some users will expect to have emoji forms now and ? be prepared for the ?????? ? also work in ZWJ or Open Type ligature sequences. (I?m not sure how lesbian or gay versions of emojis, as proposed before in L2/15-013 for instance, could become anything other than stereotypical through offensive.) The real-world use may be a bit different from what the annotations in the standard say, e.g. distinction of transgender and intersex or sexuality and gender identity: > * ? U+26A2 Doubled Female Sign > = lesbianism > * ? U+26A3 Doubled Male Sign > ? a glyph variant has the two circles on the same line > = male homosexuality > * ? U+26A4 Interlocked Female and Male Sign > ? a glyph variant has the two circles on the same line > = bisexuality > * ? U+26A5 Male and Female Sign > = transgendered sexuality > = hermaphrodite (in entomology) > * ? U+26A6 Male with Stroke Sign > = transgendered sexuality > * ? U+26A7 Male with Stroke and Male and Female Sign > = transgendered sexuality > * ? U+26B2 Neuter Lastly, the 2 signs are also recognized by Unicode to be alchemical symbols of copper and iron, respectively, but since that set is much larger and even more esoteric I expect not much demand for emoji versions of all of them. In conclusion, I see no good way other than to add a lot of additional codepoints from the Miscellaneous Symbols block to StandardizedVariants.txt. Cheers Christoph From christoph.paeper at crissov.de Fri Aug 12 07:09:09 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 12 Aug 2016 14:09:09 +0200 Subject: [UTR#51-8] 2.4 Emoji Implementation Notes Message-ID: > ? including the user of ? Should be just ?use?. > * emoji zwj sequence > - may have an emoji variation selector. > - should be displayed with an emoji presentation by default, even when an emoji zwj element is a singleton with Emoji_Presentation=No. ?zwj? should be ?ZWJ? in all instances, also found elsewhere. If I don?t misread, this seems to be saying nothing about a (hypothetical) emoji ZWJ sequence consisting of 2 or more elements with `Emoji_Presentation=No` without any VS-16. What?s the actual intention? 1. If there?s any VS-16 or any character with `Emoji_Presentation=Yes` in a ZWJ sequence, the whole sequence SHOULD be treated as emoji(s). 2. A ZWJ sequence SHOULD be treated as emoji(s) if it contains only characters that either have `Emoji_Presentation=Yes` or whose glyph *can* be affected by VS-16. Only #2 would cover a ZWJ sequence of `Emoji_Presentation=No` characters without any VS-16 stuck on them. From zelpahd at gmail.com Fri Aug 12 02:44:10 2016 From: zelpahd at gmail.com (zelpa) Date: Fri, 12 Aug 2016 17:44:10 +1000 Subject: ZWJ sequences in UTR #51 v4 Message-ID: Some of the ZWJ sequences in the latest revision seem sort of arbitrary, why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? The current proposition also doesn't seem to allow for a gender-neutral doctor(?) -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Fri Aug 12 11:54:25 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Fri, 12 Aug 2016 16:54:25 +0000 (UTC) Subject: ZWJ sequences in UTR #51 v4 References: <1378418133.13608237.1471020865650.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1378418133.13608237.1471020865650.JavaMail.yahoo@mail.yahoo.com> The problem with a ?Doctor Emoji? is that new characters need to be approved by the ISO (International Standards Organization), in a long process, which means that any new characters will not be available (officially) until Unicode 10 in June of next year. Vendors have decided they want these gendered Emoji ASAP (see the latest news). The work-around is to treat **sequences** of existing characters as a new Emoji, like some sort of very weird ligature. Unicode is scrambling to catch up to what Vendors have suddenly decided they want (although in my opinion, this could have been predicted last year). David -------------------------------------------- On Fri, 8/12/16, zelpa wrote: Subject: ZWJ sequences in UTR #51 v4 To: unicode at unicode.org Received: Friday, August 12, 2016, 3:44 AM Some of the ZWJ sequences in the latest revision seem sort of arbitrary, why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? The current proposition also doesn't seem to allow for a gender-neutral doctor(?) From Andrew.Glass at microsoft.com Fri Aug 12 14:02:23 2016 From: Andrew.Glass at microsoft.com (Andrew Glass) Date: Fri, 12 Aug 2016 19:02:23 +0000 Subject: Myanmar character set In-Reply-To: References: Message-ID: Hi Taylor and Andrew, This is a known issue with the Myanmar engine on Windows. We are tracking the issue, but don?t have a date for the fix at this time. Cheers, Andrew From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew Cunningham Sent: Thursday, August 11, 2016 8:51 PM To: Taylor Canning Cc: Unicode Mailing List Subject: Re: Myanmar character set Hi Taylor, This should work fine in theory. Are you using a mymr or mym2 style opentype font? What applications, operating system and fonts are you using? Andrew On 12 Aug 2016 12:55 pm, "Taylor Canning" > wrote: Hi there, has anyone had any issues with the Myanmar character set ? i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S?Gaw Karen language, which is a lot of people ! Thanks, Taylor Sent from my Windows 10 phone -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Aug 12 18:29:47 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 13 Aug 2016 01:29:47 +0200 Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: References: Message-ID: zelpa : > > Some of the ZWJ sequences in the latest revision seem sort of arbitrary, ?Some?? It?s a fundamental principle of linguistics that signs connect representation and meaning arbitrarily, but this doesn?t apply to pictures and proto-writing, which are not (quite/yet) linguistic signs. > why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? I do agree with the general approach to encode additional professions as ZWJ sequences. Ideally, people would already be using emoji sequences for professions (without ZWJ, ?emoji words?) and there was research of such compounds, so Unicode could document existing conventions. Otherwise, one could also go ahead and conduct a user study by letting a representative sample of people express a meaning with a restricted repertoire (i.e. emojis already in Unicode). Alas, neither seems not to have been done, instead a committee of experts chose canonic sequences based upon vendor proposals (Google and Apple). Interestingly, the result ? currently in beta state ? is not systematic in any way whatsoever: Professions are arbitrarily identified by a tool ????????????, clothing ??, accessory ??, product ??, building ????, vehicle ?????? or already conventionalized symbol ??. Often these are directly featured in the example image, but not always. Chances are high that sequences in the wild, which are intended to represent the same professions, are using different components. With family emojis, ZWJ sequences (and Fitzpatrick modifiers) are very similar to classic ligatures, because the resulting glyph is just an elaborate composition of its bases. If the example images were intuitively obvious or mandatory design recommendations, this could also be true for many of the new profession emoji sequences, but this is in fact not the case since 1) font vendors are free to design an arbitrary iconographic *picture to represent the compound meaning*, 2) the sequences are not empirically founded and 3) are culturally biased (e.g. ?????). If future emoji selection UIs offered the sequences by showing precomposed glyphs (like many do with families and flags), the problem would be hidden away for a while, but this will become unmanageable eventually. I expect IMEs to adopt a different approach soon: auto-correction. If a user successively enters two emojis that form an officially registered ZWJ sequence, the system will automatically insert U+200D and use a single glyph ? hopefully the user will be able to revert or edit that composition, e.g. ZWJ?ZWNJ. The system will also try to identify juxtaposed (e.g. ????) or synonymous sequences (e.g. ???? or ???? for a farmer and ???? or ???? for a health worker) and suggest to replace them by the canonic sequence or even by a single character (e.g. ????, ??? or ???? to ??). That?s basically `<3` and `:-)` TNG. To make it simpler to learn the canonic sequences I?d strongly urge the people in charge to select as few generic patterns as possible, e.g. + or , and this should be based upon actual research. > The current proposition also doesn't seem to allow for a gender-neutral doctor(?) Yes, this is a problem with the ZWJ profession sequences, but, at least in theory, not with the ZWJ sequences, because they should be neutral by default. There absolutely should be a neutral base character to accompany Man and Woman, maybe U+263A ?? or U+1F610 ??, and perhaps more: Codepoint | | | Meaning ----------|--------------|----|---------- U+263A | White Smiley | ?? | Neutral, (details unknown, unimportant, unavailable) U+1F469 | Woman | ?? | Female, woman, feminine U+1F468 | Man | ?? | Male, man, masculine U+1F475 | Older Woman | ?? | Retired female, senior woman, female expert U+1F474 | Older Man | ?? | Retired male, senior man, male expert U+1F476 | Baby | ?? | Trainee, learner, student, beginner, intern U+1F467 | Girl | ?? | Female trainee, learner, student, beginner, intern U+1F466 | Boy | ?? | Male trainee, learner, student, beginner, intern U+1F47D | Alien | ?? | Extraterrestrial, alien, foreign, out-sourced, anonymous U+1F916 | Robot | ?? | Android, robot, automated service, machine, self-service, bot U+1F63A | Cat | ?? | Furry, humanoid/anthropomorphous animal, toon From lists+unicode at seantek.com Sat Aug 13 01:22:50 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 12 Aug 2016 23:22:50 -0700 Subject: U+hhhh[h[h]] NAME syntax Message-ID: It appears that U+hhhh[h[h]] NAME syntax is a very common--one might say "standard"--way of representing a particular Unicode character or code point in text. It is the way that the Unicode Standard 9.0.0 refers to particular characters, and I have seen it around quite a bit. The Unicode Standard appears to put the NAME in small-caps format (but a plain text PDF search using Adobe Acrobat DC suggests that the underlying characters are lowercase), while in plain text, the name is generally all-capitalized (as it appears in the UCD). Is there a section of the Unicode Standard, or some TR, that discusses this format or gives it a formal name? (I hunted but did not find discussion in the Unicode Standard.) Is it given any kind of preference or recommendations over other forms of identifying Unicode code points or characters? Thanks, Sean From gwalla at gmail.com Sat Aug 13 01:53:52 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 12 Aug 2016 23:53:52 -0700 Subject: U+hhhh[h[h]] NAME syntax In-Reply-To: References: Message-ID: Appendix A: Notational Conventions On Friday, August 12, 2016, Sean Leonard wrote: > It appears that U+hhhh[h[h]] NAME syntax is a very common--one might say > "standard"--way of representing a particular Unicode character or code > point in text. > > It is the way that the Unicode Standard 9.0.0 refers to particular > characters, and I have seen it around quite a bit. The Unicode Standard > appears to put the NAME in small-caps format (but a plain text PDF search > using Adobe Acrobat DC suggests that the underlying characters are > lowercase), while in plain text, the name is generally all-capitalized (as > it appears in the UCD). > > Is there a section of the Unicode Standard, or some TR, that discusses > this format or gives it a formal name? (I hunted but did not find > discussion in the Unicode Standard.) Is it given any kind of preference or > recommendations over other forms of identifying Unicode code points or > characters? > > Thanks, > > Sean > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 13 02:12:43 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Aug 2016 09:12:43 +0200 (CEST) Subject: =?UTF-8?Q?Re:_[UTR#51-8]_2.4_Emoji_Implementation_Notes_=C2=A0?= Message-ID: <653180830.981.1471072363408.JavaMail.www@wwinf1c20> On Fri, 12 Aug 2016 14:09:09 +0200, Christoph P?per wrote: > 1. If there?s any VS-16 or any character with `Emoji_Presentation=Yes` > in a ZWJ sequence, the whole sequence SHOULD be treated as emoji(s). > > 2. A ZWJ sequence SHOULD be treated as emoji(s) if it contains only > characters that either have `Emoji_Presentation=Yes` or whose glyph > *can* be affected by VS-16. One fine thing about discussing emoji is that we aren?t really meant to bother whether to append a plural s: http://www.theatlantic.com/technology/archive/2016/01/whats-the-plural-of-emoji-emojis/422763/ [Please, read down until where the Consortium is cited.] http://blog.emojipedia.org/emojis-on-the-rise-as-plural/ It?s all about following the ?tsunamis? or the ?sushi? pattern. I believe the latter is more appropriate for emoji, as we?re at ease acknowledging their birth country. Marcel From verdy_p at wanadoo.fr Sat Aug 13 02:29:05 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sat, 13 Aug 2016 09:29:05 +0200 Subject: U+hhhh[h[h]] NAME syntax In-Reply-To: References: Message-ID: These are just conventions to use when there are no other context explaining why some other notation could be more useful or more readable. In plain text, we are not supposed to parse the content by automated tool but read it. And even in the standard itself there are cases where shorter notations are used, and they are explained in each of them, because this makes the overall text easier to read, or allows compressing the tables. Notable the character names are frequently abbreviated (e.g. CR, LF, CR+LF), or completely omitted if the code point is specified. The UCD itself contains lots of data that just reference the code points by alsso omitting the "U+" prefix In all cases this is a local convention that applies instead of the generic convention which is just suggested for use out of contexts. In some prammatic contexts, the notation used is language-specific and used appropriately (such as \uNNNN in Javascript/JSON/Java) without needing any prior explaination (these notations are already explained for those languages in their own standard). In emails, use any convention you want: an email normally explains its own context when needed (it may be needed to read other messages in a discussion thread to explain these personal conventions), and then people write them the way they want as long as it is clear for readers. Emails will also refer frequently to other conventions used in the standard or in programing languages. The interest of these notations however may be found when performing full text searches in collections of emails or messages in a forum to see where a particular character was cited and discussed. But generally many discussions are also speaking about other related characters and not all of them are cited because discussions are relating to some of their common properties: you'll need to search for other terms (not always part of the standard or its technical annexes as they may be talking about non-standardized but common usages, or could speak about proposals or changes in existing properties, notably the informative properties) I see little interest to force anyone to use the U+NNNN NAME convention everywhere, as it is overlong and may instead obscure the discussions. Even when it is used, the NAME will be frequently abbreviated (such as dropping the script name prefix or common words such as LETTER or DIGIT). And given that character names are not case-significant, they will be frequently written using lowercase, or mixed case, or just by presenting the verbatim character itself. 2016-08-13 8:53 GMT+02:00 Garth Wallace : > Appendix A: Notational Conventions > > > On Friday, August 12, 2016, Sean Leonard > wrote: > >> It appears that U+hhhh[h[h]] NAME syntax is a very common--one might say >> "standard"--way of representing a particular Unicode character or code >> point in text. >> >> It is the way that the Unicode Standard 9.0.0 refers to particular >> characters, and I have seen it around quite a bit. The Unicode Standard >> appears to put the NAME in small-caps format (but a plain text PDF search >> using Adobe Acrobat DC suggests that the underlying characters are >> lowercase), while in plain text, the name is generally all-capitalized (as >> it appears in the UCD). >> >> Is there a section of the Unicode Standard, or some TR, that discusses >> this format or gives it a formal name? (I hunted but did not find >> discussion in the Unicode Standard.) Is it given any kind of preference or >> recommendations over other forms of identifying Unicode code points or >> characters? >> >> Thanks, >> >> Sean >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From zelpahd at gmail.com Sat Aug 13 02:37:04 2016 From: zelpahd at gmail.com (zelpa) Date: Sat, 13 Aug 2016 17:37:04 +1000 Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: <1050889064.862.1471071620287.JavaMail.www@wwinf1c20> References: <1050889064.862.1471071620287.JavaMail.www@wwinf1c20> Message-ID: On Sat, Aug 13, 2016 at 5:00 PM, Marcel Schneider wrote: > On Fri, 12 Aug 2016 17:44:10 +1000, zelpa wrote: > > > Some of the ZWJ sequences in the latest revision seem sort of arbitrary, > why is > > male health worker Man + Staff of Asclepius instead of introducing a > Doctor emoji > > and simply using the female of male modifiers? The current proposition > also > > doesn't seem to allow for a gender-neutral doctor(?) > > As far as I know, the category ?health worker? is more general than > ?doctor?, > as it includes many professionals who are not physicians. > > Not surprisingly, the Consortium?s choice of encoding the MALE HEALTH > WORKER emoji > as a MAN associated with a STAFF OF AESCULAPIUS seems to me plain accurate. > > Marcel > MALE HEALTH WORKER was just an example, any of the ZWJ sequences that follow the PROFESSION ZWJ GENDER can be left gender neutral simply by leaving out the gender(At least in theory, god knows what vendors would actually choose to show) the sequences that follow the pattern PERSON ZWJ OBJECT can only be male or female in the current proposition. Of course health worker is more general than doctor, shouldn't have used that word. My point was it's currently not possible to show a gender-neutral health worker, student, farmer, teacher, judge, cook, mechanic, factory worker, office worker, scientist, etc. using the current proposition. Kind of seems backwards to force people to either pick female or male when using these sequences. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Sat Aug 13 03:32:56 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Aug 2016 10:32:56 +0200 (CEST) Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: References: <1050889064.862.1471071620287.JavaMail.www@wwinf1c20> Message-ID: <22376935.1278.1471077176337.JavaMail.www@wwinf1n16> On Sat, 13 Aug 2016 17:37:04 +1000, "zelpa" wrote: > On Sat, Aug 13, 2016 at 5:00 PM, Marcel Schneider wrote: > > > On Fri, 12 Aug 2016 17:44:10 +1000, zelpa wrote: > > > > > > Some of the ZWJ sequences in the latest revision seem sort of arbitrary, why is > > > male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji > > > and simply using the female of male modifiers? The current proposition also > > > doesn't seem to allow for a gender-neutral doctor(?) > > > > As far as I know, the category ?health worker? is more general than ?doctor?, > > as it includes many professionals who are not physicians. > > > > Not surprisingly, the Consortium?s choice of encoding the MALE HEALTH WORKER emoji > > as a MAN associated with a STAFF OF AESCULAPIUS seems to me plain accurate. > > > > Marcel > > MALE HEALTH WORKER was just an example, any of the ZWJ sequences that follow > the PROFESSION ZWJ GENDER can be left gender neutral simply by leaving out the > gender(At least in theory, god knows what vendors would actually choose to > show) the sequences that follow the pattern PERSON ZWJ OBJECT can only be male > or female in the current proposition. Of course health worker is more general > than doctor, shouldn't have used that word. My point was it's currently not > possible to show a gender-neutral health worker, student, farmer, teacher, > judge, cook, mechanic, factory worker, office worker, scientist, etc. using > the current proposition. Kind of seems backwards to force people to either > pick female or male when using these sequences. > I see, you are right. Profession emoji should be available gender neutral throughout. One workaround while waiting for an accurate encoding could be to quickly define a WOMAN ZWJ MAN ZWJ OBJECT pattern to be rendered with a neutral emoji akin to the intended profession. From charupdate at orange.fr Sat Aug 13 03:42:18 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Aug 2016 10:42:18 +0200 (CEST) Subject: U+hhhh[h[h]] NAME syntax Message-ID: <1281409096.1329.1471077738203.JavaMail.www@wwinf1n16> On Fri, 12 Aug 2016 23:22:50 -0700, Sean Leonard wrote: [?] > > It is the way that the Unicode Standard 9.0.0 refers to particular > characters, and I have seen it around quite a bit. The Unicode Standard > appears to put the NAME in small-caps format (but a plain text PDF > search using Adobe Acrobat DC suggests that the underlying characters > are lowercase), while in plain text, the name is generally > all-capitalized (as it appears in the UCD). > [?] I see your concern with the casing issue. Indeed, when we copy a snippet of TUS to the clipboard, the character names are all lowercased and need an additional step to become conformant, whether case conversion if remaining in plain text, or small caps formatting again. Automating the process would be possible however by writing up a script parsing code points and matching names with UCD. BTW this was one of the issues I fed back when v8.0.0 was in beta past year: http://www.unicode.org/review/pri297/feedback.html >>> To improve quotability, I would suggest to typeset the character >>> names (which actually are in small caps) in uppercase throughout, >>> and to apply rather a reduced font size like specified in the style >>> sheet of UAX #9 (where, however, redundant formatting leads to lowercase >>> and small-cap the uppercase source text at the same time (?span.name { >>> text-transform: lowercase; font-variant: small-caps; font-size: 75%; }?). >>> The result was not convincing as it appeared in UAX #9, section 3.2. Actually I?m managing this with a dedicated CSS style that sets character names back to lowercase: .uniname { ? ? ? ? ? ? ? ? ? /* CHAR STYLES */ text-transform: lowercase; font-variant: small-caps; font-size: 110%; } /* as opposed to: */ .name { font-variant: small-caps; font-size: 110%; } Additionally, to throttle up your work speed, you might wish to have the ?U+? sequence on your keyboard, precisely on your numerical keypad along with hexadecimal digits in the Shift shift state. I?m actually using this feature, which I?ve added in my layout on Windows, but I haven?t yet documented it on line. If you or somebody else are interested, please follow up off list. Marcel From charupdate at orange.fr Sat Aug 13 04:33:27 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Aug 2016 11:33:27 +0200 (CEST) Subject: U+hhhh[h[h]] NAME syntax In-Reply-To: References: Message-ID: <1716360025.1857.1471080807274.JavaMail.www@wwinf1n16> On Sat, 13 Aug 2016 09:29:05 +0200, Philippe Verdy wrote: [?] > I see little interest to force anyone to use the U+NNNN NAME convention > everywhere, as it is overlong and may instead obscure the discussions. Even > when it is used, the NAME will be frequently abbreviated (such as dropping the > script name prefix or common words such as LETTER or DIGIT). And given that > character names are not case-significant, they will be frequently written > using lowercase, or mixed case, or just by presenting the verbatim character > itself. > One advantage I see in using capitalized character names is in making them unambigously recognizable as identifiers, in order to prevent readers from mistaking them as descriptors. However I admit that I often unify casing pairs by dropping the CAPITAL and SMALL attributes, as in LATIN LETTER AE, but it would be more accurate to write LATIN CAPITAL/SMALL LETTER AE. By contrast I wouldn?t do that when referring to the LATIN CAPITAL and SMALL LIGATURE OE, because the term ?ligature? is an abusive relict enforced by the ISO redactor at the time, and set back to ?letter? in the case of the ? (as discussed past year). Here the advantage of using a translation is to be able to correct without risking confusions. Another advantage is in highlighting the names against the surrounding text. Avoiding uppercase?e.g. from people hating their Caps Lock toggle key, who I?ve read they do exist but are very uncommon in the country where we live? would need workarounds like using quotation marks, which in this context are almost always misleading. As of the U+ notational prefix for current text, I see it as extremely useful and I always apply it except, as Philippe states, in some tabular data, which is but following the pattern used in the NamesList (which I?m keeping constantly opened in my text editor). Using the U+ prefix throughout has the additional advantage of promoting Unicode in the mind of people?an urgent challenge, accordingly to the recent ?Unicode in the Curriculum?? thread: http://www.unicode.org/mail-arch/unicode-ml/y2015-m12/0073.html [Note to the attention of Archive Readers: Please don?t omit to jump manually over the year?s boundary when the ?Next in thread? link is missing. The discussion is continued here: http://www.unicode.org/mail-arch/unicode-ml/y2016-m01/0000.html ] M. From christoph.paeper at crissov.de Sat Aug 13 07:44:21 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 13 Aug 2016 14:44:21 +0200 Subject: =?utf-8?Q?Re=3A_=5BUTR=2351-8=5D_2=2E4_Emoji_Implementation_Note?= =?utf-8?Q?s_=C2=A0?= In-Reply-To: <653180830.981.1471072363408.JavaMail.www@wwinf1c20> References: <653180830.981.1471072363408.JavaMail.www@wwinf1c20> Message-ID: <8B30F0FF-816E-4EFA-BE1F-0053D33377C8@crissov.de> Marcel Schneider : > > One fine thing about discussing emoji is that we aren?t really meant to bother > whether to append a plural s: Actually I intended ?emoji(s)? to mean ?a single emoji or multiple emojis?. From charupdate at orange.fr Sat Aug 13 08:51:59 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 13 Aug 2016 15:51:59 +0200 (CEST) Subject: =?UTF-8?Q?Re:_[UTR#51-8]_2.4_Emoji_Implementation_Notes_=C2=A0?= In-Reply-To: <8B30F0FF-816E-4EFA-BE1F-0053D33377C8@crissov.de> References: <653180830.981.1471072363408.JavaMail.www@wwinf1c20> <8B30F0FF-816E-4EFA-BE1F-0053D33377C8@crissov.de> Message-ID: <1410813234.3136.1471096319741.JavaMail.www@wwinf1h27> On Sat, 13 Aug 2016 14:44:21 +0200, Christoph P?per wrote: > Marcel Schneider : >> >> One fine thing about discussing emoji is that we aren?t really meant to bother >> whether to append a plural s: > > Actually I intended ?emoji(s)? to mean ?a single emoji or multiple emojis?. Thanks for this clarification. Indeed, grammatically undefined forms are lacking expressiveness. Using them in discussions makes people run into problems. I realize that when talking about emojis, it?s well done to follow the ?tsunami, tsunamis? pattern. Sorry for the noise. Marcel From lists+unicode at seantek.com Sat Aug 13 10:12:06 2016 From: lists+unicode at seantek.com (lists+unicode at seantek.com) Date: Sat, 13 Aug 2016 08:12:06 -0700 Subject: U+hhhh[h[h]] NAME syntax In-Reply-To: <1716360025.1857.1471080807274.JavaMail.www@wwinf1n16> References: <1716360025.1857.1471080807274.JavaMail.www@wwinf1n16> Message-ID: > On Aug 13, 2016, at 2:33 AM, Marcel Schneider wrote: > > On Sat, 13 Aug 2016 09:29:05 +0200, Philippe Verdy wrote: > [?] >> I see little interest to force anyone to use the U+NNNN NAME convention >> everywhere, as it is overlong and may instead obscure the discussions. Even >> when it is used, the NAME will be frequently abbreviated (such as dropping the >> script name prefix or common words such as LETTER or DIGIT). And given that >> character names are not case-significant, they will be frequently written >> using lowercase, or mixed case, or just by presenting the verbatim character >> itself. >> > One advantage I see in using capitalized character names is in making them > unambigously recognizable as identifiers, in order to prevent readers from > mistaking them as descriptors. > > However I admit that I often unify casing pairs by dropping the CAPITAL and > SMALL attributes, as in LATIN LETTER AE, but it would be more accurate to write > LATIN CAPITAL/SMALL LETTER AE. By contrast I wouldn?t do that when referring to > the LATIN CAPITAL and SMALL LIGATURE OE, because the term ?ligature? is an abusive > relict enforced by the ISO redactor at the time, and set back to ?letter? in the > case of the ? (as discussed past year). Here the advantage of using a translation > is to be able to correct without risking confusions. > > Another advantage is in highlighting the names against the surrounding text. > Avoiding uppercase?e.g. from people hating their Caps Lock toggle key, who > I?ve read they do exist but are very uncommon in the country where we live? > would need workarounds like using quotation marks, which in this context are > almost always misleading. > > As of the U+ notational prefix for current text, I see it as extremely useful > and I always apply it except, as Philippe states, in some tabular data, > which is but following the pattern used in the NamesList (which I?m keeping > constantly opened in my text editor). > > Using the U+ prefix throughout has the additional advantage of promoting > Unicode in the mind of people?an urgent challenge, [?] Thank you. I have been reviewing draft-iab-rfc-nonascii-02 , which formally opens the RFC series to UTF-8 encoded characters. (Look at the PDF version, which shows characters beyond the ASCII range.) I was surprised that Section 3.4 provides no less than *six* notational alternatives, none of which conform to Appendix A of TUS. There might be valid grammatical reasons to notate differently than Appendix A, but I would think that Appendix A style U+2206 INCREMENT would be the best choice, as in: 1. Temperature changes in the Temperature Control Protocol are indicated by "?" U+2206 INCREMENT. where U+ NAME replaces the part-of-speech ?the XYZ character?, the character itself is quoted directly in front of the U+, and parentheses are not needed. (I am actually in favor of curly quotes ??? in such a case, but that discussion should probably be had in the IETF.) Interestingly, TUS 9.0.0 is not internally consistent, but there is a trend that when the character is quoted, it is put in curly quotes and is placed between the U+ syntax and the NAME, as in: Section 3.13 Uppercasing of U+00DF ??? latin small letter sharp s to ? Section 5.21 U+2061 ? function application has no effect on the text display? (Note: the ? character appears in TUS as f() in a box?I am copying and pasting the text directly on my Mac from Acrobat to Mail.app. And, obviously, it?s copying and pasting the small-caps in lowercase.) In plain text, ALL-CAPS names are superior to mixed case or lowercase names. However, in stylized text, small-caps not only looks better but offers a more convenient visual and semantic way to differentiate the part-of-speech. I may have to suggest that small-caps be added as a stylistic element to the new xml2rfc format, or, that a new element be provisioned specifically to identify Unicode code points, which automatically get stylized appropriately to the output format (ALL-CAPS for plain text, stylized small-caps for marked up text). Sean From doug at ewellic.org Sat Aug 13 16:47:15 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 13 Aug 2016 15:47:15 -0600 Subject: U+hhhh[h[h]] NAME syntax Message-ID: <5232A2A31EF64763A66F1556A0C3B29A@DougEwell> PDF is a presentation format. If the editorial committee sets character names in lowercase "under the hood" so that they will end up looking good in Minion smallcaps in the PDF file, and a user subsequently scrapes the PDF file for content, it doesn't mean there's anything formal or normative about setting character names in lowercase. -- Doug Ewell | Thornton, CO, US | ewellic.org From nobody_uses at outlook.com Sat Aug 13 16:06:48 2016 From: nobody_uses at outlook.com (eduardo marin) Date: Sat, 13 Aug 2016 21:06:48 +0000 Subject: Counting rods alternate forms Message-ID: It is well known that the southern song style of counting rods, had different forms for the digits 4, 5 and 9 https://en.wikipedia.org/wiki/Counting_rods , however currently there is no way to represent such forms, a proposal to add them would only occupy five code points, since number four is identical both vertical and horizontally. [http://upload.wikimedia.org/wikipedia/commons/thumb/e/ea/Yanghui_triangle.gif/200px-Yanghui_triangle.gif] Counting rods - Wikipedia, the free encyclopedia en.wikipedia.org Counting rods represent digits by the number of rods, and the perpendicular rod represents five. To avoid confusion, vertical and horizontal forms are alternately used. -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Aug 13 19:19:15 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sat, 13 Aug 2016 17:19:15 -0700 Subject: U+hhhh[h[h]] NAME syntax In-Reply-To: <5232A2A31EF64763A66F1556A0C3B29A@DougEwell> References: <5232A2A31EF64763A66F1556A0C3B29A@DougEwell> Message-ID: <25d87c79-8adb-d9ff-3408-8ff588116588@ix.netcom.com> On 8/13/2016 2:47 PM, Doug Ewell wrote: > PDF is a presentation format. If the editorial committee sets > character names in lowercase "under the hood" so that they will end up > looking good in Minion smallcaps in the PDF file, and a user > subsequently scrapes the PDF file for content, it doesn't mean there's > anything formal or normative about setting character names in lowercase. > Character names, when presented in the Unicode character database are uppercase. The general approach by Unicode is to define property names and values so that case distinctions are not needed to unambiguously resolve identifiers (same for space and most hyphens). That means, the presentation can be flexibly adapted to the style of the document (e.g. the Core Specification has a different style than other documents), yet still retain unambiguous identification of the character. I believe that small-caps generally looks nice and distinctive. For HTML the way to do this is with a CSS style that allows the underlying text representation to be uppercase while showing lowercase small-cap letters. Marcel, I believe, gave some example, although something like this was used as early as Unicode 5.0 for the UAXs, when we printed them as part of the book. For plain text, all caps is the easiest way to make the character name stick out and prevent misinterpretation of it as part of the surrounding text. The question becomes then, how much of the character name to show and in which order. I'm personally partial to U+nnnn (x) CHARACTER NAME. In some cases, this requires some edits to make the text flow, but it has the advantage of being unambiguous, and something that works well for characters of all scripts and categories, including marks and punctuation. In some instances U+nnnn (x) transliterated name works well. I like the use of ( ) instead of " " (curly or not) because the latter is hopeless in showing any combining marks above (the get lost among the ""). However, notations like x (U+nnnn) work pretty well, also, especially when all the "x" are from a distinct-looking script. The same goes for x CHARACTER NAME (U+nnnn). In many cases, there really isn't a need to quote the glyph, and not doing so, can reduce clutter. In short, this isn't a one-size fits all kind of situation. A./ From lang.support at gmail.com Sat Aug 13 22:43:18 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Sun, 14 Aug 2016 13:43:18 +1000 Subject: Myanmar character set In-Reply-To: References: Message-ID: Hi Andrew, I assume the issue is with mym2 shaper? Andrew C On 13 Aug 2016 5:02 am, "Andrew Glass" wrote: > > Hi Taylor and Andrew, > > > > This is a known issue with the Myanmar engine on Windows. We are tracking the issue, but don?t have a date for the fix at this time. > > > > Cheers, > > > > Andrew > > > > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew Cunningham > Sent: Thursday, August 11, 2016 8:51 PM > To: Taylor Canning > Cc: Unicode Mailing List > > Subject: Re: Myanmar character set > > > > Hi Taylor, > > This should work fine in theory. Are you using a mymr or mym2 style opentype font? > > What applications, operating system and fonts are you using? > > Andrew > > > > On 12 Aug 2016 12:55 pm, "Taylor Canning" wrote: >> >> Hi there, has anyone had any issues with the Myanmar character set ? i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S?Gaw Karen language, which is a lot of people ! >> >> Thanks, Taylor >> >> >> >> Sent from my Windows 10 phone >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From Andrew.Glass at microsoft.com Sun Aug 14 11:21:10 2016 From: Andrew.Glass at microsoft.com (Andrew Glass) Date: Sun, 14 Aug 2016 16:21:10 +0000 Subject: Myanmar character set In-Reply-To: References: , Message-ID: Yes, this impacts mym2 only. From: Andrew Cunningham Sent: Saturday, August 13, 2016 8:43 PM To: Andrew Glass Cc: Unicode Mailing List; Taylor Canning Subject: RE: Myanmar character set Hi Andrew, I assume the issue is with mym2 shaper? Andrew C On 13 Aug 2016 5:02 am, "Andrew Glass" > wrote: > > Hi Taylor and Andrew, > > > > This is a known issue with the Myanmar engine on Windows. We are tracking the issue, but don't have a date for the fix at this time. > > > > Cheers, > > > > Andrew > > > > From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew Cunningham > Sent: Thursday, August 11, 2016 8:51 PM > To: Taylor Canning > > Cc: Unicode Mailing List > > > Subject: Re: Myanmar character set > > > > Hi Taylor, > > This should work fine in theory. Are you using a mymr or mym2 style opentype font? > > What applications, operating system and fonts are you using? > > Andrew > > > > On 12 Aug 2016 12:55 pm, "Taylor Canning" > wrote: >> >> Hi there, has anyone had any issues with the Myanmar character set - i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S'Gaw Karen language, which is a lot of people ! >> >> Thanks, Taylor >> >> >> >> Sent from my Windows 10 phone >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Tue Aug 16 18:22:35 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 17 Aug 2016 01:22:35 +0200 Subject: The Relationship of 1-Line Character Art and Emojis Message-ID: <05ED5874-D855-4482-8EF6-3FC331E36D20@crissov.de> Is a combination of characters solely for the combined glyphic appearance, which is frequently used over a long time, commonly considered a proof of demand for encoding additional symbols or emojis? Most Western emoticons have one or more conventional ASCII-only strings representing a sideways face, e.g. ?;-)? and ?;)? for ?? U+1F609 or the infamous heart ?<3? ?? U+2764 etc. Many also (or only) have an upright Eastern emote form which often uses characters way beyond U+007F or U+00FF, e.g. ??\_(?)_/?? who brought us ?? U+1F937 and ?^_^? or ?^^? for ?? U+1F60A or the IPA-Cyrillic butterfly ?????. In many a messaging software (mail, texting, chat, forum, board, blog ?) a large (partially proprietary, partially conventionalized) set of those is supported to be converted to images or Unicode code-points. In fact, most original smiley repertoires were probably based upon prior art, i.e. already established character sequences. In the 200X years, there was quite a competition in supporting new codes and designing themes for them. Today the kids got stickers and GIF memes to supplement standard Unicode emojis. Is there any UTR or the like that tracks canonic non-emoji character sequences for emoji characters? On a related matter, is there any document issued by the Unicode Consortium which acknowledges a standard set of ?short names? as used in :colon_codes:? There are several more or less diverging collections: - Emoji One?s EAC: https://github.com/Ranks/emojione/blob/master/emoji.json - Github?s Gemoji: https://github.com/github/gemoji/blob/master/db/emoji.json - Muan.co?s Emojilib: https://github.com/muan/emojilib/blob/master/emojis.json - Unicodey?s Emoji Data: https://github.com/iamcal/emoji-data/blob/master/emoji.json - ? From srl at icu-project.org Tue Aug 16 20:27:45 2016 From: srl at icu-project.org (Steven R. Loomis) Date: Tue, 16 Aug 2016 18:27:45 -0700 Subject: The Relationship of 1-Line Character Art and Emojis In-Reply-To: <05ED5874-D855-4482-8EF6-3FC331E36D20@crissov.de> References: <05ED5874-D855-4482-8EF6-3FC331E36D20@crissov.de> Message-ID: <311EC31E-C0E6-4352-BFDD-98ACF4CFE9D0@icu-project.org> El 8/16/16 4:22 PM, "Unicode en nombre de Christoph P?per" escribi?: >On a related matter, is there any document issued by the Unicode Consortium which acknowledges a standard set of ?short names? as used in :colon_codes:? There are several more or less diverging collections: See Annotations under http://www.unicode.org/reports/tr51/#Input and related. -s From christoph.paeper at crissov.de Wed Aug 17 01:56:52 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 17 Aug 2016 08:56:52 +0200 Subject: The Relationship of 1-Line Character Art and Emojis In-Reply-To: <311EC31E-C0E6-4352-BFDD-98ACF4CFE9D0@icu-project.org> References: <05ED5874-D855-4482-8EF6-3FC331E36D20@crissov.de> <311EC31E-C0E6-4352-BFDD-98ACF4CFE9D0@icu-project.org> Message-ID: <791CCB5A-22F7-4FB5-9FB5-0E56307F93C4@crissov.de> Steven R. Loomis : > El 8/16/16 4:22 PM, "Unicode en nombre de Christoph P?per": > >> is there any document issued by the Unicode Consortium which acknowledges a standard set of ?short names? as used in :colon_codes:? There are several more or less diverging collections: > > See Annotations under http://www.unicode.org/reports/tr51/#Input and related. Oops, missed that part, thanks. So UTR51 does acknowledge the existence of non-emoji emoticons and short names inside a pair of colons, but (implicitly) puts their standardization outside the scope of Unicode. There is probably more diversity among them than there was among single-codepoint emojis used by Japanese telcos. Still, a file like *could* be produced for character sequences that map to emojis. Btw., the Input section didn?t change in the current draft. From rscook at unicode.org Wed Aug 17 13:15:02 2016 From: rscook at unicode.org (Richard Cook) Date: Wed, 17 Aug 2016 11:15:02 -0700 Subject: Counting rods alternate forms In-Reply-To: References: Message-ID: On Aug 13, 2016, at 2:06 PM, eduardo marin wrote: > > It is well known that the southern song style of counting rods, had different forms for the digits 4, 5 and 9 https://en.wikipedia.org/wiki/Counting_rods , however currently there is no way to represent such forms, ?[U+3024] ?[U+3025] ?[U+3029] > a proposal to add them would only occupy five code points, since number four is identical both vertical and horizontally. > > Counting rods - Wikipedia, the free encyclopedia > en.wikipedia.org > Counting rods represent digits by the number of rods, and the perpendicular rod represents five. To avoid confusion, vertical and horizontal forms are alternately used. From doug at ewellic.org Wed Aug 17 15:16:13 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 17 Aug 2016 13:16:13 -0700 Subject: Counting rods alternate forms Message-ID: <20160817131613.665a7a7059d7ee80bb4d670165c8327d.6e1124fa29.wbe@email03.godaddy.com> Richard Cook wrote: >> It is well known that the southern song style of counting rods, had >> different forms for the digits 4, 5 and 9 >> https://en.wikipedia.org/wiki/Counting_rods , however currently there >> is no way to represent such forms, > > ?[U+3024] ?[U+3025] ?[U+3029] The Wikipedia article (which, as such, needs to be corroborated) says: "In the 13th century, Southern Song mathematicians changed digits for 4, 5, and 9 to reduce strokes. The new horizontal forms eventually transformed into Suzhou numerals. Japanese continued to use the traditional forms." So there is a distinction between the alternate forms Eduardo is describing, and the Suzhou ("Hangzhou") numerals derived from them that Richard cited. -- Doug Ewell | Thornton, CO, US | ewellic.org From christoph.paeper at crissov.de Thu Aug 18 05:19:04 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 18 Aug 2016 12:19:04 +0200 Subject: Completing Emoji Sets: Animals and Zodiac Signs Message-ID: <08520FEF-D450-4CE8-AC82-A4910B933BF7@crissov.de> Not least the TERIS proposal has taught me that, at least for emojis, Unicode?s encoding policy has changed from ?show use, get character? to ?show system with gap, get character?. Well, then, I?ve noticed several incomplete sets or systems in the emoji ranges. This is the first of a planned series of messages describing such sets. ---- For Japanese telcos, one original purpose of animal emojis seems to have been **Eastern zodiac signs**, as documented in the annotations of U+1F400?18. Without consulting other sources, we get these (perhaps inaccurate) sets: * China: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Japan: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Thai: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Kazakh: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Vietnam: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Persian 1: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Persian 2: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? The original animal faces closely match these, if we assume that some were unified and others have no facial variant. * ?? = ?? / ?? * ?? = ?? / ?? / ?? * ?? = ?? / ?? * ?? = ?? / ?? * ??? = ?? / ?? * ??? = ?? / ?? / ?? * China: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? Persian 1 * Japan: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Thai: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Kazakh: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * Persian: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? = Persian 2 * Vietnam: ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? Chicken ?? is often rendered as a hen?s head, but not explicitly so. The front-facing chick ?? could be substituted. The Boar is also often just a head, but maybe that?s what the snout ?? is for? Elephant, alligator, snail, snake, sheep / ram / goat have no portrait equivalent. I wonder if they should be added. The **Western signs of the zodiac** have long been supported by Unicode as abstract symbols U+2648?58(/CE), which gained colorful (now default) variants with the initial emoji release. Most of them also represent animals or certain kinds of humans. Some emoji characters are straightforward, others have been added specifically for this purpose (L2/14-284: Cancer, Leo, Scorpio, Sagittarius, Aquarius), but at least for Virgo (???) and Gemini (???) I?m not sure which character would be the canonic one. The Goat ?? isn?t really a sea-goat either, and the Ram ?? no ibex. There should at least be annotation, if not new dedicated code-points. * ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? * ?? ?? ?? ?? ?? ?? ? ?? ?? ?? ?? ?? As seen above, there?s already a Monkey and a Monkey Face, also the **Three Wise Monkeys** who see ??, hear ?? and speak ?? no evil (U+1F648?A). They are often depicted with a fourth friend, though, who is most often associated with either Think-No-Evil or Do-No-Evil, who usually has its hands folded, arms crossed, hands in lap, hands covering genitals or a combination thereof. One could get by with ???????? perhaps, but a dedicated monkey would be swell. (Btw., I?ve also found an example of Smell-No-Evil.) There are different conventional **food taboos** or dietary restrictions among human cultures and cults, especially regarding meat, and sometimes they?re just temporary (fasts). Emojis should be available to state either that a meal contains a certain ?ingredient? or does/should not (using ?? for instance). Although there?s no generic emoji for predators, rodents, insects or seafood, the only ones missing I could think of now that dairy ?? and eggs ?? have been added, are Jellyfish, Nest and body parts (entrails, genitals, brain). Food and other allergies are a separate matter by the way, because there are hardly any allergies to certain kinds of meat (except when counting crustaceans), but most major allergens already have sufficient emoji representatives. More complex or combined religious rules can also be handled by the respective symbols, e.g. ?? for ?kosher? and ?? for ?halal?. From christoph.paeper at crissov.de Thu Aug 18 05:19:30 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 18 Aug 2016 12:19:30 +0200 Subject: Registry for Emoji Ligature and Metaphor Sequences Message-ID: <1440AECE-0C3F-46B4-A713-D468C76D37EC@crissov.de> Situation --------- Ever since the advent of emojis in Unicode they have been combined with each other to convey new meaning in a number of ways, similar to what has been described for other (ancient) pictographic codes and had been the base for some proper scripts. Many combinations can be observed in actual conversations, others are mostly restricted to contrived puzzles or riddles: * Rebus: An emoji needs to be read as a particular word in a certain language but must then be interpreted as one of its homonyms, e.g. ?? EYE ? English pronunciation /ai/ ? ?I?. * Linguistic metaphor: An emoji needs to be read as a particular synonymous word in a certain language but must then be interpreted as one of its other synonyms, e.g. ?? CAT ? English synonym ?pussy? ? ?vagina/vulva? or ?? CREATURE WITH HORNS ? English ?horny?. This is also used a lot for compounds, e.g. ???? COW AND POO ? English ?bullshit?. * Graphical metaphor: An emoji needs to be reinterpreted as a similar looking object, e.g. ?? AUBERGINE, ?? BANANA, ?? CORN COB and many other phallus symbols. There are even second-degree metaphors such as ?? CANCER ? ?69? ?69? ? mutual oral sex (existing pan-linguistic metaphor). * Ideographic metaphor: An emoji needs to be read as a (conventionalized) semantic modifier of adjacent emojis, e.g. ? SCALES ? related to the judiciary: ??? ? ?male judge? or ? VENUS SIGN ? related to the female sex or feminine gender: ???? ? ?female construction worker? and some other of the proposed gender-dependent ZWJ sequences for professions. * Metaphor sign: An emoji needs to be reinterpreted as a symbol or icon ? gotta get my Peirce terms straight ? for an existing metaphor in a particular language or culture, e.g. gestures like ???? INDEX FINGER POINTING AT RING FORMED OF THUMB AND INDEX FINGER for phallic sex or animal heads ?? LION and ?? DRAGON representing the English and Welsh national soccer teams, because the constituents of GB or the UK are not national entities (?countries?) in the sense of ISO 3166 which is used by Regional Indicator Symbol sequences like ???? and thus cannot be represented by emoji flags yet. The ?correct? interpretation (i.e. the one desired by the author) often relies on textual, discursive, cultural and even technical context, e.g. ?? PEACH is often used as a visual metaphor for a human body part, but it can be either of ?butt?, ?vulva? and (rarely) ?cleavage?, which may lead to misunderstandings as in ???? ?anal sex? / ?vaginal sex? / ?mammal sex?. Depending on the actual glyph and context, ?? EYES and ?? NOSE may be (mis)read as ?boobs? and ?penis?, respectively. Also consider the ongoing controversy regarding ?? being rendered as a toy gun, and the directionality of emojis, e.g. ?????? or ????, which doesn?t display as intended in Emoji One. Proposed Consequences --------------------- The conventional metaphors that are frequently used without much context and even in isolation, e.g. the eggplant penis, show an obvious demand for the encoding of distinguished emojis. The peach controversy highlights that emojis become ambiguous if too many desired ones are not available. Anyhow, a proper proposal for the encoding of emojis for genitals and other missing body parts shall follow at a later time or be written by someone else, e.g. Emojidex. The classic phonographic rebus combinations and the linguistic metaphors are language-dependent and must therefore remain out of scope for Unicode. The other types of emoji metaphors, however, often span across languages and cultures. They may rely on a certain degree of glyphic similarity among fonts and thus benefit from standardization. These sequences are, in my opinion, in scope of UTC specifications. There are two types of emoji ligatures already: Emoji Modifier Sequences and Emoji ZWJ Sequences. The former joins an emoji with indeterminate main color with an explicit modifier (currently only for human skin tones). Most deployed examples of the latter regroup a sequence of emoji characters (mostly ??????????, possibly with ?? and ??) into a single glyph, so U+1F46A ?? ? U+1F468-200D-1F469-200D-1F466 (although the precomposed character doesn?t strictly specify the gender of the child or the parents). Both kinds may be combined. Not least with Google?s recent gendered profession emojis proposal, the concept of ZWJ sequences has been expanded from visual ligatures to semantic and metaphoric compounds. That means, the can of worms has already been opened. Yet, the same concept seems to have been shunned for Regional Indicators for sub-regions (i.e. flags for Scotland, Wales, England etc.) in favor of a generic system without proof of demand (TERIS etc.). I firmly believe that the best solution to these related problems is a less formal Registry for Emoji Character Sequences (RECS, to give it a catchy acronym). It should be hosted and its rules be defined by UTC, but mostly run on itself. Proposed RECS process: - Anyone may claim a conventional meaning for a sequence of emojis free of charge via a web form. - This includes decompositions of existing emojis (e.g. ???? = ???? = ??? ? ??). - If the reading depends on a particular language (or meets other exclusion criteria, like codifying a name or title) it is rejected by the RECS Board, otherwise it is open for approval. - While multiple sequences may map to the same meaning redundantly, every character sequence is listed only once until either rejected or approved. - All variants resulting from applicable modifiers are included automatically by default. - An emoji sequence is approved when a non-trivial font implemented an appropriate composite glyph (routine). - The implementation may rely upon the presence of ZWJs or upon users activating features specific to the font technology (e.g. `liga` and `clig` in Open Type). - Every emoji sequence shall be formally registered for inclusion in StandardizedVariants.txt if two independent vendors support it in an interoperable manner. - The RECS is also used to record non-PUA characters ? perhaps restricted to `So` ?other symbols? ? with property `Emoji=No` that at least one publicly available font represents (only) as emoji glyphs and hence may become proper candidates. VS15/16 variation sequences should be automatically registered for all of them and the UTC needs to decide on the value of the `Emoji_Representation` property individually. The current equivalent of the RECS would be which is currently outdated or incomplete as, for instance, the Windows 10 Anniversary Update added multiple emoji character sequences to represent more types of families. PS: The RECS may also be used to record universal named character entities `&foo;` and short codes `:foo:`, although both usually rely on English, just like UTC character names. Finally, it may also be used to collect classic Western and Eastern emoticons like `:-)` and `?\_(?)_/?`. From christoph.paeper at crissov.de Thu Aug 18 05:20:11 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 18 Aug 2016 12:20:11 +0200 Subject: Emoji Variation Sequences: relaxing VS15/16 Message-ID: While VS1?VS14 (and VS17?256) are used for more or less arbitrary variant selection (VS4?VS14 are actually still unused), VS15 and VS16 have conventional de-facto semantics: select text style or emoji style. StandardizedVariants.txt explicitly claims, referring to Section 23.4 of Unicode 9, that implementations must ignore VSs that don?t form a standardized or ideographic variation sequence : > Standardized variation sequences are defined in this file. > Ideographic variation sequences are defined according to the registration > process specified in UTS #37, and are listed in the Ideographic > Variation Database. Only those two types of variation sequences > are sanctioned for use by conformant implementations. > In all other cases, use of a variation selector character does > not change the visual appearance of the preceding base character > from what it would have had in the absence of the variation selector. > > For more information on standardized variation sequences, > see Section 23.4, Variation Selectors, > in The Unicode Standard, Version 9.0. Can this be relaxed for VS15 and VS16? Unlike VS1?VS3 they don?t operate on arbitrary glyph differences but on the actual Unicode property `Emoji_Presentation`. From christoph.paeper at crissov.de Thu Aug 18 05:21:35 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 18 Aug 2016 12:21:35 +0200 Subject: Fwd: Text vs. Emoji: Default Text Style; no VS References: Message-ID: The first column of the table on that page is completely empty. I assume it was needed in a previous version, but now it should be removed or am I missing something? Total counts of characters for cells, rows and columns would be nice. From wjgo_10009 at btinternet.com Fri Aug 19 05:44:06 2016 From: wjgo_10009 at btinternet.com (William_J_G Overington) Date: Fri, 19 Aug 2016 11:44:06 +0100 (BST) Subject: Could there be UTR #53 (one of TTS Names, Read-out labels, Localization labels) and their application please? Message-ID: <19125945.23989.1471603446630.JavaMail.defaultUser@defaultHost> Unicode Technical Report #51 includes the following, in section 7. > There is one further kind of annotation, called a TTS name, for text-to-speech processing. and > TTS names are also outside the current scope of this document. What are now each named as a TTS name were once, by implication, each named as a read-out label. Quote from http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0083.html which is quoting from an early draft of UTR #51. > There is one further kind of label, called a "read-out", for text-to-speech. There was also a separate thread at the time. http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0086.html http://www.unicode.org/mail-arch/unicode-ml/y2014-m04/0096.html Now that there is discussion over the possibility of including emoji ZWJ sequences, I suggest that there are now so many images being considered for encoding each either as a regular character or as an emoji ZWJ sequence and that the trend is to encode more and more emoji and that image differences of some from others are often not great and that the intended meanings are not always immediately obvious if they are displayed in running text, particularly at small sizes, that the following be considered. 1. Rename TTS Name as Localization label. This change widens the scope of what is presently named a TTS name from just text-to-speech to become text-to-speech and image-to-written-natural-language. An application program, maybe an email application, maybe a PDF document display progarm, maybe something else, could, at the request of an end user displaying the page use either or both of text-to-speech and image-to-written-natural-language as desired. Image-to-written-natural-language could be either inline or on a tooltip type label when there is a mouse-over of the particular image. When image-to-natural-language is used inline, it could either replace the image or be in addition to the image as desired. 2. Produce Unicode Technical Report #53 Localization labels and their application UTR #51 states as follows. > TTS names are also outside the current scope of this document. So it seems reasonable to have a separate Unicode Technical Report that is about that topic. Certainly, there is nothing in principle, as far as I am currently aware, to stop any manufacturer adding such functionality as I have mentioned in the previous section of this post, yet I opine that it would be best if a standardized way of doing so were in a Unicode Technical Report and that there are standardized localization files available, preferably from the Unicode webspace. I fully appreciate that if there is a file format put forward in a Unicode Technical Report that it might well not be the format that I suggested in 2014. I appreciate that there are experts in localization who may well have better ideas for the particular format that would be specified if Unicode Technical Report #53 Localization labels and their application is produced. William Overington Friday 19 August 2016 From christoph.paeper at crissov.de Sun Aug 21 08:51:50 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sun, 21 Aug 2016 15:51:50 +0200 Subject: "Emojis" in Reading Texts for Beginners Message-ID: <07F331BE-B021-4966-AB8E-6F8951F35DC4@crissov.de> Are in-line pictures in reading instruction books, standing in mostly for nouns, considered supporting proof of existing use of proposed symbols or emojis? I recently realized, reading a children?s book to/with my sons, that a lot of the pictograms ? I estimated 80% in my sample ? could actually be represented reasonably well by existing emojis. Most of the ones that were missing were either very specific to the story (like the *?? ?tower? of a ?? and the *?? ?cannon ball? attached to the ? of a ??) or were closely related to the everyday life of a European toddler (e.g. a tricycle and a bike helmet). The glyphs are usually individual and specific to each book, especially if there are also full-page pictures in it, but I wouldn?t be the least surprised if a study found that the things ? and it?s mostly things indeed ? depicted in such books from different authors, publishers and languages came from a quite limited common vocabulary (for the most frequent parts at least). Different readings of the same pictogram, e.g ?truck? vs. ?lorry? for ??, are usually not a problem in this application. Has such research been conducted and been presented to the UTC already? From mark at macchiato.com Sun Aug 21 09:48:36 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 21 Aug 2016 16:48:36 +0200 Subject: "Emojis" in Reading Texts for Beginners In-Reply-To: <07F331BE-B021-4966-AB8E-6F8951F35DC4@crissov.de> References: <07F331BE-B021-4966-AB8E-6F8951F35DC4@crissov.de> Message-ID: The selection criteria for emoji are unlike those of other characters, because their primary usage is different. If there is a particular set of emoji characters that you would like to propose, see information at http://unicode.org/emoji/selection.html for how to do so, and what the selection factors are. There is a link to that page at the top of most of the charts, such as http://unicode.org/emoji/charts-beta/full-emoji-list.html. Is there a way we can make that link more prominent, so that readers like you will notice it more easily? Mark Mark On Sun, Aug 21, 2016 at 3:51 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > Are in-line pictures in reading instruction books, standing in mostly for > nouns, considered supporting proof of existing use of proposed symbols or > emojis? > > I recently realized, reading a children?s book to/with my sons, that a lot > of the pictograms ? I estimated 80% in my sample ? could actually be > represented reasonably well by existing emojis. Most of the ones that were > missing were either very specific to the story (like the *?? ?tower? of a > ?? and the *?? ?cannon ball? attached to the ? of a ??) or were closely > related to the everyday life of a European toddler (e.g. a tricycle and a > bike helmet). The glyphs are usually individual and specific to each book, > especially if there are also full-page pictures in it, but I wouldn?t be > the least surprised if a study found that the things ? and it?s mostly > things indeed ? depicted in such books from different authors, publishers > and languages came from a quite limited common vocabulary (for the most > frequent parts at least). Different readings of the same pictogram, e.g > ?truck? vs. ?lorry? for ??, are usually not a problem in this applicati! > on. > > Has such research been conducted and been presented to the UTC already? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Aug 21 09:49:27 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 21 Aug 2016 16:49:27 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: Message-ID: Similarity or containment in the same block as current emoji characters is not sufficient grounds for changing characters to have the Emoji property (and thus being eligible for the text/emoji VS). If there is a particular set of existing characters that you would like to propose to become Emoji (eg, to have the Emoji binary property), see information at http://unicode.org/emoji/selection.html#existing for how to do so. Mark On Fri, Aug 12, 2016 at 1:41 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > emoji_variation_sequence> > > > 2640 FE0E; text style; # FEMALE SIGN > > 2640 FE0F; emoji style; # FEMALE SIGN > > 2642 FE0E; text style; # MALE SIGN > > 2642 FE0F; emoji style; # MALE SIGN > > Since U+240 and U+2642 double as symbols for the planets (and ancient > gods) Venus and Mars, respectively, users will rightfully expect VS-16 to > have an effect on the other planet symbols as well (probably including > U+2647 Pluto). > > Both symbols are also sometimes used to represent Friday and Tuesday, > respectively, so some users may expect the symbols for the other 5 days of > the week also react on U+FE0E/F. > > 1. Monday ? U+263D Moon or ? U+263E > 2. Tuesday ? U+2642 Mars > 3. Wednesday ? U+263F Mercury > 4. Thursday ? U+2643 Jupiter > 5. Friday ? U+2640 Venus > 6. Saturday ? U+2644 Saturn > 7. Sunday ? U+2609 Sun or ? U+263C > > U+2640/2 are also part of common sets of gender, sex and sexuality symbols > which, again, some users will expect to have emoji forms now and ? be > prepared for the ?????? ? also work in ZWJ or Open Type ligature sequences. > (I?m not sure how lesbian or gay versions of emojis, as proposed before in > L2/15-013 for instance, could become anything other than stereotypical > through offensive.) The real-world use may be a bit different from what the > annotations in the standard say, e.g. distinction of transgender and > intersex or sexuality and gender identity: > > > * ? U+26A2 Doubled Female Sign > > = lesbianism > > * ? U+26A3 Doubled Male Sign > > ? a glyph variant has the two circles on the same line > > = male homosexuality > > * ? U+26A4 Interlocked Female and Male Sign > > ? a glyph variant has the two circles on the same line > > = bisexuality > > * ? U+26A5 Male and Female Sign > > = transgendered sexuality > > = hermaphrodite (in entomology) > > * ? U+26A6 Male with Stroke Sign > > = transgendered sexuality > > * ? U+26A7 Male with Stroke and Male and Female Sign > > = transgendered sexuality > > * ? U+26B2 Neuter > > Lastly, the 2 signs are also recognized by Unicode to be alchemical > symbols of copper and iron, respectively, but since that set is much larger > and even more esoteric I expect not much demand for emoji versions of all > of them. > > In conclusion, I see no good way other than to add a lot of additional > codepoints from the Miscellaneous Symbols block to StandardizedVariants.txt. > > Cheers > > Christoph > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Sun Aug 21 10:27:20 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Sun, 21 Aug 2016 17:27:20 +0200 Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: References: <1050889064.862.1471071620287.JavaMail.www@wwinf1c20> Message-ID: There have been discussions of how an "unmarked" (neutral, ungendered) form could be represented. Here are just some thoughts. There are currently three types of gender representation. 1. Intrinsic (eg, FATHER CHRISTMAS) 2. +ZWJ+ (eg, male vs female health worker) 3. +ZWJ+ (eg, woman running vs man running; female vs male police officer) For #1, it might be possible to use ? U+26B2 NEUTER (a character similar to ) in a zwj sequence to indicate a neutral form. Where we have paired forms, there'd need to be a consistent principle as to which to use. For #2, as Christoph points out, one could use a neutral base: a smiley, or perhaps a Unicode 10.0 ADULT emoji. For #3, the simplest mechanism would probably be to have the unmarked form be a neutral image. But for backwards compatibility, some might want to have a specific marker, eg ? U+26B2 NEUTER. However, any proposal needs to be fully fleshed out, and have a representative range of clear examples of how graphic designs for the neutral characters would work. That is, they need to be clearly interpreted as non-gendered, even at small sizes: an average person, when shown the design in isolation, would say that the person depicted is of neither or either gender. That's easy to do with smiley-styles, but surprising difficult to achieve with the realistic styles that are used for the other people emoji. Mark On Sat, Aug 13, 2016 at 9:37 AM, zelpa wrote: > On Sat, Aug 13, 2016 at 5:00 PM, Marcel Schneider > wrote: > >> On Fri, 12 Aug 2016 17:44:10 +1000, zelpa wrote: >> >> > Some of the ZWJ sequences in the latest revision seem sort of >> arbitrary, why is >> > male health worker Man + Staff of Asclepius instead of introducing a >> Doctor emoji >> > and simply using the female of male modifiers? The current proposition >> also >> > doesn't seem to allow for a gender-neutral doctor(?) >> >> As far as I know, the category ?health worker? is more general than >> ?doctor?, >> as it includes many professionals who are not physicians. >> >> Not surprisingly, the Consortium?s choice of encoding the MALE HEALTH >> WORKER emoji >> as a MAN associated with a STAFF OF AESCULAPIUS seems to me plain >> accurate. >> >> Marcel >> > > MALE HEALTH WORKER was just an example, any of the ZWJ sequences that > follow the PROFESSION ZWJ GENDER can be left gender neutral simply by > leaving out the gender(At least in theory, god knows what vendors would > actually choose to show) the sequences that follow the pattern PERSON ZWJ > OBJECT can only be male or female in the current proposition. Of course > health worker is more general than doctor, shouldn't have used that word. > My point was it's currently not possible to show a gender-neutral health > worker, student, farmer, teacher, judge, cook, mechanic, factory worker, > office worker, scientist, etc. using the current proposition. Kind of seems > backwards to force people to either pick female or male when using these > sequences. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zelpahd at gmail.com Sun Aug 21 10:34:34 2016 From: zelpahd at gmail.com (zelpa) Date: Mon, 22 Aug 2016 01:34:34 +1000 Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: References: <1050889064.862.1471071620287.JavaMail.www@wwinf1c20> Message-ID: > That's easy to do with smiley-styles, but surprising difficult to achieve with the realistic styles that are used for the other people emoji. Do they have to be realistic? Google's previous profession emoji were similar to the blob-like smilies, why not use that kind of representation unless given a gender? I think a profession without a specified gender should be just that, a display of what the profession, not of an actual human. On Mon, Aug 22, 2016 at 1:27 AM, Mark Davis ?? wrote: > There have been discussions of how an "unmarked" (neutral, ungendered) > form could be represented. Here are just some thoughts. > > There are currently three types of gender representation. > > 1. Intrinsic (eg, FATHER CHRISTMAS) > 2. +ZWJ+ (eg, male vs female health > worker) > 3. +ZWJ+ (eg, woman running vs man > running; female vs male police officer) > > For #1, it might be possible to use ? U+26B2 NEUTER (a character similar > to ) in a zwj sequence to indicate a neutral form. Where > we have paired forms, there'd need to be a consistent principle as to which > to use. > For #2, as Christoph points out, one could use a neutral base: a smiley, > or perhaps a Unicode 10.0 ADULT emoji. > For #3, the simplest mechanism would probably be to have the unmarked form > be a neutral image. But for backwards compatibility, some might want to > have a specific marker, eg ? U+26B2 NEUTER. > > However, any proposal needs to be fully fleshed out, and have a > representative range of clear examples of how graphic designs for the > neutral characters would work. That is, they need to be clearly interpreted > as non-gendered, even at small sizes: an average person, when shown the > design in isolation, would say that the person depicted is of neither or > either gender. > > That's easy to do with smiley-styles, but surprising difficult to achieve > with the realistic styles that are used for the other people emoji. > > Mark > > On Sat, Aug 13, 2016 at 9:37 AM, zelpa wrote: > >> On Sat, Aug 13, 2016 at 5:00 PM, Marcel Schneider >> wrote: >> >>> On Fri, 12 Aug 2016 17:44:10 +1000, zelpa wrote: >>> >>> > Some of the ZWJ sequences in the latest revision seem sort of >>> arbitrary, why is >>> > male health worker Man + Staff of Asclepius instead of introducing a >>> Doctor emoji >>> > and simply using the female of male modifiers? The current proposition >>> also >>> > doesn't seem to allow for a gender-neutral doctor(?) >>> >>> As far as I know, the category ?health worker? is more general than >>> ?doctor?, >>> as it includes many professionals who are not physicians. >>> >>> Not surprisingly, the Consortium?s choice of encoding the MALE HEALTH >>> WORKER emoji >>> as a MAN associated with a STAFF OF AESCULAPIUS seems to me plain >>> accurate. >>> >>> Marcel >>> >> >> MALE HEALTH WORKER was just an example, any of the ZWJ sequences that >> follow the PROFESSION ZWJ GENDER can be left gender neutral simply by >> leaving out the gender(At least in theory, god knows what vendors would >> actually choose to show) the sequences that follow the pattern PERSON ZWJ >> OBJECT can only be male or female in the current proposition. Of course >> health worker is more general than doctor, shouldn't have used that word. >> My point was it's currently not possible to show a gender-neutral health >> worker, student, farmer, teacher, judge, cook, mechanic, factory worker, >> office worker, scientist, etc. using the current proposition. Kind of seems >> backwards to force people to either pick female or male when using these >> sequences. >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Mon Aug 22 16:26:27 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Mon, 22 Aug 2016 23:26:27 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: Message-ID: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> JFTR, on the meantime, I had already resubmitted my original post through the proper feedback channel as a comment on the current draft of TR51. Mark Davis ?? : > > Similarity or containment in the same block as current emoji characters is not sufficient grounds for changing characters to have the Emoji property (?). I?m not arguing on either of those grounds. I?m saying that U+2640 and U+2642 are integral parts of several sets and it will confuse and annoy users of all kinds if some of them can be emojis and some cannot. If these two symbols had been used as emojis before, the case would be slightly different, but they?re becoming emojis just because they?re *intended* to be used in certain ZWJ sequences. That?s why, for instance, U+1F170/1 ???/??? (as in blood types) don?t necessarily constitute a reason to emojify U+1F172?89, too. If Female Sign and Male Sign shall be emojis, but - U+2609 ? Sun - U+263C ? White Sun with Rays - U+263F ? Mercury - U+2641 ? Earth - U+2643 ? Jupiter - U+2644 ? Saturn - U+2645 ? Uranus - U+26E2 ? Astronomical Symbol for Uranus - U+2646 ? Neptune - U+2647 ? Pluto - U+263D ? First Quarter Moon - U+263E ? Last Quarter Moon shall not, you better disunify them from the symbols for Venus and Mars first. You?d also need separate code-points for Male Heterosexuality and Female Heterosexuality to go with Homosexuality and Bisexuality signs (?sexuality? = preference) - ? U+26A2 Doubled Female Sign - ? U+26A3 Doubled Male Sign - ? U+26A4 Interlocked Female and Male Sign as well as Cisgendered Sexuality (maybe in male and female variants, ?sexuality? = identity) to go with any of the Transgendered Signs - ? U+26A5 Male and Female Sign - ? U+26A6 Male with Stroke Sign - ? U+26A7 Male with Stroke and Male and Female Sign and finally Feminine and Masculine to go with - ? U+26B2 Neuter. > If there is a particular set of existing characters that you would like to propose to become Emoji (eg, to have the Emoji binary property), see information at http://unicode.org/emoji/selection.html#existing for how to do so. I was only pointing out that the draft I was commenting on was making a mistake, in my humble opinion, which could be fixed easily by emojifying the mentioned sets of characters instead of just an arbitrary two-piece subset. If my objection was rejected I could and probably would then open a separate proposal, but it would seem ridiculously redundant doing that right now, because whether I proposed astro_o_ic or gender and sexuality emoji or all of them at once, it would have to include U+2640/2. >>> Proposing to change the emoji properties to include existing characters or sequences as emoji is a much simpler process than submitting a proposal for a new character. The proposal need only provide evidence that an emoji presentation of those characters or sequences would be supported by a reasonably broad set of vendors. That?s really only a simpler process if you are working for an important vendor and can convince another one ? whose representative-in-charge you may know from a random standardization committee ? to deploy fonts with an emoji glyph for a certain character. Even if someone like me would compile a number of fonts from FOSS image emoji collections (that go beyond the current canon) or could convince Emoji One and Twemoji to emojify a certain code-point, one couldn?t be sure at all that this was considered a ?reasonably broad set of vendors?. Take U+1F946 ?? as a counter-example: it used to have preliminary support as an emoji by Google/Android and I think still has in Emoji One, but was still successfully lobbied by other big names to not become an emoji character. When did character properties become vendor-driven? They should be based upon actual writing habits, i.e. expectations of writers and readers. The `Emoji` property is no exception to that, despite the history of the original set of emoji characters. I really can?t see how you could justify the emojification of U+2640 and U+2642 (and not related characters) except by saying ?Google and Apple want so, because they need them for some new thing?. Did anyone actually bother to gather empirical data of emoji/symbol sequences in existing content to denote gendered professions? It?s a really bad move and that?s what I?m criticizing about the eighth draft of UTR #51. I don?t disagree with ? and ? being used as the second part in ZWJ emoji sequences, but 1. it?s incomplete without an explicit neutral/ambiguous alternative and 2. if they need `Emoji=yes` as a result, this must also be applied to a bunch of related characters. From roozbeh at unicode.org Mon Aug 22 18:24:51 2016 From: roozbeh at unicode.org (Roozbeh Pournader) Date: Mon, 22 Aug 2016 16:24:51 -0700 Subject: Emoji Variation Sequences: relaxing VS15/16 In-Reply-To: References: Message-ID: I agree that this should be relaxed for VS15 and VS16. For example, the current draft version of UTR #51 even suggests that this be done for three sequences that can not get variation sequences defined until Unicode 10.0. Would you please write a proposal document or send a note through http://www.unicode.org/reporting.html? Thanks, Roozbeh On Thu, Aug 18, 2016 at 3:20 AM, Christoph P?per < christoph.paeper at crissov.de> wrote: > While VS1?VS14 (and VS17?256) are used for more or less arbitrary variant > selection (VS4?VS14 are actually still unused), VS15 and VS16 have > conventional de-facto semantics: select text style or emoji style. > StandardizedVariants.txt explicitly claims, referring to Section 23.4 of > Unicode 9, that implementations must ignore VSs that don?t form a > standardized or ideographic variation sequence UCD/latest/ucd/StandardizedVariants.txt>: > > > Standardized variation sequences are defined in this file. > > Ideographic variation sequences are defined according to the registration > > process specified in UTS #37, and are listed in the Ideographic > > Variation Database. Only those two types of variation sequences > > are sanctioned for use by conformant implementations. > > In all other cases, use of a variation selector character does > > not change the visual appearance of the preceding base character > > from what it would have had in the absence of the variation selector. > > > > For more information on standardized variation sequences, > > see Section 23.4, Variation Selectors, > > in The Unicode Standard, Version 9.0. > > > Can this be relaxed for VS15 and VS16? > Unlike VS1?VS3 they don?t operate on arbitrary glyph differences but on > the actual Unicode property `Emoji_Presentation`. Public/emoji/4.0/emoji-data.txt> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Aug 24 07:09:18 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 24 Aug 2016 14:09:18 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> Message-ID: On Mon, Aug 22, 2016 at 11:26 PM, Christoph P?per < christoph.paeper at crissov.de> wrote: > 1. it?s incomplete without an explicit neutral/ambiguous alternative and > ?As I said, people are actively investigating what to do about such cases. It may be that the solution is to add ? U+26B2 Neuter, but maybe not. We'll see as they develop further. 2. if they need `Emoji=yes` as a result, this must also be applied to a > bunch of related characters. > ?As I said, ?that is absolutely not a criterion. If one were to apply that principle ("must be applied to a bunch of related characters"), then because we have one playing-card emoji, we should make all of the playing cards be emoji; because of one Mahjong tile, one would add all of them. And then add all the chess pieces, and other game pieces. And because we have a few circled or squared ideograph and katakana emoji, make all the others emoji. And there are squared or negative ASCII emoji, so add all of the others as emoji. And alchemical symbols, and ... I suspect the transitive closure of this process could end up marking essentially all Unicode characters with the Emoji property. > commenting on was making a mistake, in my humble opinion, which could be fixed easily by emojifying the mentioned sets of characters The committee has and does consider related characters when looking at properties. But this case was not an oversight. Those particular characters were deliberately chosen. It is always possible to add other characters in the future; it will depend on whether they are deemed to be necessary. > When did character properties become vendor-driven? They should be based upon actual writing habits, i.e. expectations of writers and readers. An implementation can conform to Unicode and display all characters with a colorful presentation, or all characters with a text presentation. So there is nothing preventing you or anyone else from having an emoji font that displays a any particular character with a colorful glyph. That is not the purpose of the Emoji property. The purpose for character properties is to promote interoperability. That has always been the case. By having a property for Line_Break, for example, Unicode gives implementations a common mechanism for producing interoperable results (and for customization for particular environments). The goal of the emoji properties is to have structure that promotes the highest degree of interoperability among the major implementations supporting emoji. It doesn't do any good for Unicode to mark a character as being emoji unless that would result in it being widely deployed as such. So the committee has to consider carefully what implementations will do. That is nothing new; we have to consider carefully what the impact of any change in property (such as Line_Break) will do in implementations. You can certainly propose (via the reporting form), that any particular set of additional characters should get the Emoji property, and try to make a case for it. But I'd advise you to make a convincing case for your proposal ? *without using grounds that would apply to hundreds or thousands of other characters*. In particular, you should address the question ? for each of those characters ? of whether there is a strong expectation that it would be frequently used. Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Thu Aug 25 09:52:11 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Thu, 25 Aug 2016 16:52:11 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> Message-ID: <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> TL;DR: Unicode properties should reflect user expectations, not vendor choices. Mark Davis ?? : > On Mon, Aug 22, 2016 at 11:26 PM, Christoph P?per wrote: >> 1. it?s incomplete without an explicit neutral/ambiguous alternative and > > ?As I said, people are actively investigating what to do about such cases. It may be that the solution is to add ? U+26B2 Neuter, but maybe not. We'll see as they develop further. Natively speaking a language which can explicitly mark any actor noun with a morpheme as female/feminine, but neither as neutral nor as male/masculine ? a generic version of English ?actor/actress?, ?waiter/waitress?, ?prince/princess? ? and having intensely dealt with guidelines for corporate languages and public speech, I?ll assure you that a feminism/LGBT shitstorm will be heading for UTC and vendors if binary gender became mandatory for profession emojis. You should not approve Google?s and Apple?s ZWJ sequences without a neutral option. JFTR, I know that ? U+263F Mercury is also being proposed to denote androgynous/asexual emoji sequences. >> 2. if they need `Emoji=yes` as a result, this must also be applied to a bunch of related characters. > > ?As I said, ?that is absolutely not a criterion. As I said, it absolutely should be to honor user expectations. > If one were to apply that principle (?), then because we have one playing-card emoji, we should make all of the playing cards be emoji; because of one Mahjong tile, one would add all of them. And then add all the chess pieces, and other game pieces. It?s an open secret that all characters for game notations will have to become emojis sooner or later, regardless if one of them already had the emoji property. (I?m not sure I would have supported them being encoded in the first place, though, especially as lots of precomposed characters.) One big problem at the moment is, I think, that another user demand as anticipated by vendors is that every emoji font and UI should cover all of them. > And because we have a few circled or squared ideograph and katakana emoji, make all the others emoji. And there are squared or negative ASCII emoji, so add all of the others as emoji. I already addressed that strawman argument in my previous mail, regarding blood types. Precomposed characters with enclosing shapes are just there for compatibility reasons, so their Emoji property reflects compatibility needs. > And alchemical symbols, and ... I suspect the transitive closure of this process could end up marking essentially all Unicode characters with the Emoji property. No, but many, perhaps most of ?General Category = Other_Symbol (So), Script = Common, Bidirectional Category = Other_Neutral (ON)? probably and few others (e.g. with ?Bidirectional Category = L?). That?s little more than 3000 characters as of Unicode 9.0, which includes most existing emojis. Some of them, like reversed or rotated glyphs, would be simple to support for font designers, others could use identical emoji glyphs, e.g. lots of the Light/Medium/Bold/Heavy compatibility dingbat arrows, asterisks etc. Overall, the number of emojis (not counting Fitzpatrick and ZWJ variants) would less than double. > The committee has and does consider related characters when looking at properties. But this case was not an oversight. Those particular characters were deliberately chosen. It is always possible to add other characters in the future; it will depend on whether they are deemed to be necessary. The problem lies within the ?deemed to be necessary?. > The purpose for character properties is to promote interoperability. That has always been the case. Sure, but for almost all characters and properties this has mostly been a descriptive approach, based upon existing texts. Whether a certain character will be included in emoji fonts and IMEs very strongly depends on whether it has the Emoji property (and how it reacts on VS-15/16). Unicode is hence wandering into prescriptive territory here. In the Rifle case, for instance, vendors have even removed emoji glyphs after the character, which was specifically proposed for emoji purposes like similar ones, became non-emoji late in the standardization process. On the other side, there are lots of legacy emojis that noone uses (or at least not with the originally intended meaning), but every emoji font supports. Since emojis are often input on mobile devices with some OSes being quite restrictive on installing alternative fonts or keyboards, this problem becomes even more serious. > The goal of the emoji properties is to have structure that promotes the highest degree of interoperability among the major implementations supporting emoji. What?s that, a ?major implementation[] supporting emoji?? Is it a font, an OS component, a GUI picker, a soft keyboard, a text/input prediction algorithm, a text substitution feature ?? You seem to be talking about the default setup on stock iOS (and Mac OS) and Android, maybe Windows (Phone). This effectively means that few US-based multi-billion-dollar companies ? Apple, Google, Microsoft and Facebook basically ? decide which character can be used as an emoji and which one cannot (while making money on ?stickers? at the same time) and unlike Japanese telcos Docomo, KDDI and Softbank they increasingly do so with an agenda. This is a problem. The UTC could be the voice of the global multi-billion-head user base here, but, alas, it?s largely funded and staffed by the aforementioned companies and others like them. You see, if I was an ancient Egyptian chiseling an ejaculating/peeing penis ?? or a 19th-century typographer drawing a heart-shaped exclamation mark ? or a late 20th-century Japanese engineer encoding brothels ?? as POIs in my mobile map application, these would be considered characters and become part of the Unicode standard in the 21st century. If there are millions or even billions of people who use pictograms for human genitalia in electronic textual communication today (as their ancestors had been doing in analog media for millenia), they have to rely on conventionalized linguistic ?? or graphical ?? metaphors or they must abuse punctuation marks, digits and letters to ?draw? body parts inline, ({|}) 3==D (.Y.) (_!_) (and *many* variants thereof), if they don?t want to resort to actual pictures, which most users are bad at drawing and thus must acquire elsewhere which means additional efforts, costs and legal issues. The chance of these pictographs being encoded as single, unambiguous (see ??) characters is basically nil due to the mentioned gatekeepers. Even if they ever made it into the standard, there would still be font vendors who would either not ship any glyph for such characters (see U+130BA etc.), only an inferior one (see ??) or, perhaps worst, a misleading/wrong one (see ??) and OS vendors may exclude them from input methods (see ??) or search engines would ignore them (see #??) on religious, political or other non-technical grounds. And yes, I?m preparing a proper proposal for missing body part emojis nevertheless, but maybe someone beats me to it. > It doesn't do any good for Unicode to mark a character as being emoji unless that would result in it being widely deployed as such. Sorry, but you got that backwards. There are some characters that have non-intuitive or unsystematic properties in Unicode, due to mistakes in the standardization process or bugs in widespread implementations. This may apply as well to some existing emojis (or all of them, for some people), which shouldn?t have been in i-mode phones in the first place. It does not apply, however, to future emojis, whether made from existing characters or new ones. If a character is a pictogram that is less abstracted than sinograms and other signs used for writing proper, people will want to use it as an emoji (or at least find a use for it if it was available). They can only do so if fonts and software treat them as such. Most vendors will not make those do so unless the standard says they should, because only then they can expect the competition (i.e. potential partners in communication interchange) to do so, too. A major part of standardization is to document existing (best) practice, but another is to synthesize general concepts from this and to develop new solutions based there upon for better interoperability and user experience in the future. It is failing the latter to deny some characters the Emoji property on arbitrary grounds (incl. demands of high-profile stakeholders) or not including tabooish characters. > So the committee has to consider carefully what implementations will do. That is nothing new; we have to consider carefully what the impact of any change in property (such as Line_Break) will do in implementations. ?? What major implementers want. ?? Effect of change on (existing) implementations. > You can certainly propose (?), that any particular set of additional characters should get the Emoji property, and try to make a case for it. Will do, but I?m trying to find out here beforehand whether I?m just wasting my time and everyone else?s, because I?m afraid that could indeed be the case. > But I'd advise you to make a convincing case for your proposal ? without using grounds that would apply to hundreds or thousands of other characters. In particular, you should address the question ? for each of those characters ? of whether there is a strong expectation that it would be frequently used. That?s trying to scare away useful input from small and independent parties. The Unicode process is good at that, but at least it allows for it, unlike many other standardization bodies. Sorry, this got long. From verdy_p at wanadoo.fr Thu Aug 25 17:01:10 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Fri, 26 Aug 2016 00:01:10 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> Message-ID: 2016-08-25 16:52 GMT+02:00 Christoph P?per : > TL;DR: Unicode properties should reflect user expectations, not vendor > choices. > > Mark Davis ?? : > > On Mon, Aug 22, 2016 at 11:26 PM, Christoph P?per < > christoph.paeper at crissov.de> wrote: > >> 1. it?s incomplete without an explicit neutral/ambiguous alternative and > > > > ?As I said, people are actively investigating what to do about such > cases. It may be that the solution is to add ? U+26B2 Neuter, but maybe > not. We'll see as they develop further. > > Natively speaking a language which can explicitly mark any actor noun with > a morpheme as female/feminine, but neither as neutral nor as male/masculine > ? a generic version of English ?actor/actress?, ?waiter/waitress?, > ?prince/princess? ? and having intensely dealt with guidelines for > corporate languages and public speech, I?ll assure you that a feminism/LGBT > shitstorm will be heading for UTC and vendors if binary gender became > mandatory for profession emojis. You should not approve Google?s and > Apple?s ZWJ sequences without a neutral option. > In my opinion such sequence is not even needed. Unless sequences are annotated with a gender, or skin color or similar extension, they are neutral and may be represented using any available option (even if it's not really neutral). Joining a "neutral" character will not add any meaning, so it is overkill to just standardize it in sequences where it will be simply ignored/discarded to use the same glyph as the default glyph for the initial character outside any ligature (it should be the same glyph in all cases: if there's a really neutral glyph, it will be used by default for the base character). -------------- next part -------------- An HTML attachment was scrubbed... URL: From karl-pentzlin at acssoft.de Fri Aug 26 05:49:04 2016 From: karl-pentzlin at acssoft.de (Karl Pentzlin) Date: Fri, 26 Aug 2016 12:49:04 +0200 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech Message-ID: <572703342.20160826124904@acssoft.de> Today in the "Frankfurter Allgemeine Zeitung", one of the leading German newspapers: The comment regards an UTC decision to refuse the acceptance of emojis for Olympic rifles, as well as the fact that Apple's IOS 10 displays U+1F52B as a toy water pistol, as an attack on Free Speech: http://www.faz.net/aktuell/feuilleton/debatten/apple-emojis-die-zensur-der-symbole-14404026.html?printPagedArticle=true#pageIndex_2 "Das Unicode-Konsortium wirkt wie eine Neuauflage des Orwellschen Wahrheitsministeriums, das die englische Sprache durch eine um sch?dliche Begriffe gereinigte, neue Sprache ersetzte und die ?briggebliebenen Worte ?unorthodoxer? Nebenbedeutungen entkleidete." ("The Unicode Consortium appears like a reissue of Orwell's Ministry of Truth, which replaced the English language by a new one, sweeped clean from harmful terms, and which removed "unorthodox" connotations from the rest of the words.") - Karl Pentzlin From christoph.paeper at crissov.de Fri Aug 26 06:07:12 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 26 Aug 2016 13:07:12 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> Message-ID: <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> Christoph P?per : > > No, but many, perhaps most of ?General Category = Other_Symbol (So), Script = Common, Bidirectional Category = Other_Neutral (ON)? probably and few others (e.g. with ?Bidirectional Category = L?). That?s little more than 3000 characters as of Unicode 9.0, which includes most existing emojis. I just learned that recent Samsung phones already contain emoji representations for many of these symbols. From gwalla at gmail.com Fri Aug 26 11:16:02 2016 From: gwalla at gmail.com (Garth Wallace) Date: Fri, 26 Aug 2016 09:16:02 -0700 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> Message-ID: On Fri, Aug 26, 2016 at 4:07 AM, Christoph P?per < christoph.paeper at crissov.de> wrote: > Christoph P?per : > > > > No, but many, perhaps most of ?General Category = Other_Symbol (So), > Script = Common, Bidirectional Category = Other_Neutral (ON)? probably and > few others (e.g. with ?Bidirectional Category = L?). That?s little more > than 3000 characters as of Unicode 9.0, which includes most existing emojis. > > I just learned that recent Samsung phones already contain emoji > representations for many of these symbols. > > > > Samsung's emoji support is idiosyncratic, to say the least. They make the Orthodox typikon symbols, BLACK SNOWMAN, MUSIC SHARP SIGN, and the I Ching symbols into emoji for no apparent reason. It's especially baffling because the "emoji" versions are still black and white, just with a gradient applied to make them look shiny. WHITE CIRCLE WITH TWO DOTS is emoji on Samsung...why? The chess symbols get turned into emoji, breaking figurine notation. REVERSED ROTATED FLORAL HEART BULLET, which is an old typographical dingbat (really a stylized ivy leaf, neither a heart nor, strictly speaking, floral), gets put on a pink square. And that's not even getting into the bizarre and misleading design decisions on the emoticon emoji. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jsoconner at gmail.com Fri Aug 26 12:01:18 2016 From: jsoconner at gmail.com (John O'Conner) Date: Fri, 26 Aug 2016 17:01:18 +0000 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: <572703342.20160826124904@acssoft.de> Message-ID: What I find more interesting is how emoji (a small digital image or icon) was ever interpreted as encodable text for the Unicode Standard. If our German newspaper friends have made a mistake in interpreting emoji as speech, I think the Unicode consortium has made an even bigger mistake. Regards, John On Fri, Aug 26, 2016 at 5:26 AM Helena S Chapman wrote: > This is an interesting way of interpreting "speech". To understand that, > we need to look at what an emoji is: "A small digital image or icon used to > express an idea, emotion, etc., in electronic communication." In no way we > can agree Emoji "replaced the English language". The first Emoji was > designed Shigetaka Kurita on NTT's docomo, there is no indication it is > replacing Japanese language either. > > There isn't anything in Unicode that prevents people from expressing the > words "Rifles", "Guns", "Fire Arms", etc in various languages (real > languages such as German I meant). > > Best regards, > > Helena Shih Chapman > Director, IBM Globalization Executive *CISM* > > +1-720-396-6323 > www.ibm.com/globalization > > > > > From: Karl Pentzlin > To: unicode at unicode.org > Cc: "unicore at unicode.org" > Date: 08/26/2016 06:58 AM > Subject: Comment in a leading German newspaper regarding the way > UTC and Apple handle Emoji as an attack on Free Speech > Sent by: "Unicore" > ------------------------------ > > > > Today in the "Frankfurter Allgemeine Zeitung", one of the leading > German newspapers: The comment regards an UTC decision to refuse the > acceptance > of emojis for Olympic rifles, as well as the fact that Apple's IOS 10 > displays > U+1F52B as a toy water pistol, as an attack on Free Speech: > > > http://www.faz.net/aktuell/feuilleton/debatten/apple-emojis-die-zensur-der-symbole-14404026.html?printPagedArticle=true#pageIndex_2 > > "Das Unicode-Konsortium wirkt wie eine Neuauflage des Orwellschen > Wahrheitsministeriums, > das die englische Sprache durch eine um sch?dliche Begriffe gereinigte, > neue Sprache > ersetzte und die ?briggebliebenen Worte ?unorthodoxer? Nebenbedeutungen > entkleidete." > > ("The Unicode Consortium appears like a reissue of Orwell's Ministry > of Truth, which replaced the English language by a new one, sweeped clean > from harmful terms, and which removed "unorthodox" connotations from > the rest of the words.") > > - Karl Pentzlin > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenwhistler at att.net Fri Aug 26 12:26:44 2016 From: kenwhistler at att.net (Ken Whistler) Date: Fri, 26 Aug 2016 10:26:44 -0700 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: <572703342.20160826124904@acssoft.de> Message-ID: <5d3d750e-8c80-599c-9462-6b20dd3064f6@att.net> On 8/26/2016 10:01 AM, John O'Conner wrote: > What I find more interesting is how emoji (a small digital image or > icon) was ever interpreted as encodable text for the Unicode Standard. > If our German newspaper friends have made a mistake in interpreting > emoji as speech, I think the Unicode consortium has made an even > bigger mistake. > That particular horse left the barn over a decade ago, when Japanese telcom companies started extending Shift-JIS with emoji on various phones, and then connected those phones to the internet and started exchanging email with Unicode-based systems. The emoji were *already* *encoded* text by that point -- not merely some prospective, uncertain set of entities which *might* be *encodable*. You might not like that. It certainly is problematical in many regards and creates some erroneous expectations. But this is far from the first time that less-than-ideal characters have been encoded as characters in the Unicode Standard. Exhibit 1: box drawing characters: http://www.unicode.org/charts/PDF/U2500.pdf I would contend that encoding wildly popular and extensively used little pictographs as characters makes a whole lot more sense in the abstract than encoding box-drawing graphic pieces for completely obsolete screen technology ever did. And would people discussing this topic please pick *one* list to discuss it, and stop cross-posting to two lists? --Ken From markus.icu at gmail.com Fri Aug 26 13:34:51 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 26 Aug 2016 11:34:51 -0700 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <5d3d750e-8c80-599c-9462-6b20dd3064f6@att.net> References: <572703342.20160826124904@acssoft.de> <5d3d750e-8c80-599c-9462-6b20dd3064f6@att.net> Message-ID: On Fri, Aug 26, 2016 at 10:26 AM, Ken Whistler wrote: > On 8/26/2016 10:01 AM, John O'Conner wrote: > >> What I find more interesting is how emoji (a small digital image or icon) >> was ever interpreted as encodable text for the Unicode Standard. If our >> German newspaper friends have made a mistake in interpreting emoji as >> speech, I think the Unicode consortium has made an even bigger mistake. >> >> > That particular horse left the barn over a decade ago, when Japanese > telcom companies started extending Shift-JIS with emoji on various phones, > and then connected those phones to the internet and started exchanging > email with Unicode-based systems. The emoji were *already* *encoded* text > by that point -- not merely some prospective, uncertain set of entities > which *might* be *encodable*. > Several people over time have also pointed out that "small images or icons" already got a foot in the door with Dingbats in Unicode 1.0. markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From charupdate at orange.fr Fri Aug 26 19:33:46 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 27 Aug 2016 02:33:46 +0200 (CEST) Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <572703342.20160826124904@acssoft.de> References: <572703342.20160826124904@acssoft.de> Message-ID: <2005376640.12735.1472258026328.JavaMail.www@wwinf1k24> This FAZ article adds to the examples of today?s poor IT journalism in European newspapers, especially the most renowned ones. Despite of an occasionally misused literary background, authors are deliberately ignoring more up-to-date sources, among which: http://www.dezeen.com/2016/08/02/apple-swaps-revolver-emoji-water-pistol-ios-gun-violence/ https://www.buzzfeed.com/charliewarzel/thanks-to-apples-influence-youre-not-getting-a-rifle-emoji I?m glad of Apple?s courageous initiative. It was only about removing the emoji property. Finally it?s up to the vendors to endorse what their emoji keyboards will be looking like. More generally, it isn?t as if Unicode and big tech companies were good to wrap up in colorful emoji all and everything people daren?t write out with words. Marcel On 26/08/16 12:57, Karl Pentzlin wrote: > Today in the "Frankfurter Allgemeine Zeitung", one of the leading > German newspapers: The comment regards an UTC decision to refuse the acceptance > of emojis for Olympic rifles, as well as the fact that Apple's IOS 10 displays > U+1F52B as a toy water pistol, as an attack on Free Speech: > > http://www.faz.net/aktuell/feuilleton/debatten/apple-emojis-die-zensur-der-symbole-14404026.html?printPagedArticle=true#pageIndex_2 > > "Das Unicode-Konsortium wirkt wie eine Neuauflage des Orwellschen Wahrheitsministeriums, > das die englische Sprache durch eine um sch?dliche Begriffe gereinigte, neue Sprache > ersetzte und die ?briggebliebenen Worte ?unorthodoxer? Nebenbedeutungen entkleidete." > > ("The Unicode Consortium appears like a reissue of Orwell's Ministry > of Truth, which replaced the English language by a new one, sweeped clean > from harmful terms, and which removed "unorthodox" connotations from > the rest of the words.") > > - Karl Pentzlin > > From zelpahd at gmail.com Fri Aug 26 19:57:24 2016 From: zelpahd at gmail.com (zelpa) Date: Sat, 27 Aug 2016 10:57:24 +1000 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <2005376640.12735.1472258026328.JavaMail.www@wwinf1k24> References: <572703342.20160826124904@acssoft.de> <2005376640.12735.1472258026328.JavaMail.www@wwinf1k24> Message-ID: > I?m glad of Apple?s courageous initiative. If you're talking about the rifle thing I can understand that, but if you're also talking about the water pistol I have no clue what you're talking about. That decision by Apple is just absurd. -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Sat Aug 27 02:38:24 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 27 Aug 2016 09:38:24 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs In-Reply-To: References: <618E0497-8611-488D-AD57-6FABDCCB750C@crissov.de> <6F9FF9CB-1F25-454C-B002-8A606AEA109C@crissov.de> <821E905A-CCF2-4482-8976-614D9A8B6EA5@crissov.de> Message-ID: Garth Wallace : > On Fri, Aug 26, 2016 at 4:07 AM, Christoph P?per wrote: >> I just learned that recent Samsung phones already contain emoji representations for many of these symbols. >> >> > > Samsung's emoji support is idiosyncratic, to say the least. That may be so, but it is an example of a major vendor extending emoji rendering beyond code points which have Emoji=Yes. > It's especially baffling because the "emoji" versions are still black and white, just with a gradient applied to make them look shiny. It?s not perfect, for sure, but they probably work better with other emojis this way. > WHITE CIRCLE WITH TWO DOTS is emoji on Samsung?why? Electric socket or Go piece? > The chess symbols get turned into emoji, breaking figurine notation. That?s why there are (or shall be) VS15/16 and higher-level controls like the ones proposed for CSS . From doug at ewellic.org Sat Aug 27 12:15:11 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 27 Aug 2016 11:15:11 -0600 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: <0AE94CC08ED24A169E811E2EA1B81B8E@DougEwell> Ken Whistler wrote: > I would contend that encoding wildly popular and extensively used > little pictographs as characters makes a whole lot more sense in the > abstract than encoding box-drawing graphic pieces for completely > obsolete screen technology ever did. Though to be fair, the screen technology was a lot less "completely obsolete" in 1991, when the box drawing characters were encoded (Unicode 1.0), than it is today. -- Doug Ewell | Thornton, CO, US | ewellic.org From charupdate at orange.fr Sat Aug 27 14:45:38 2016 From: charupdate at orange.fr (Marcel Schneider) Date: Sat, 27 Aug 2016 21:45:38 +0200 (CEST) Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: <572703342.20160826124904@acssoft.de> <2005376640.12735.1472258026328.JavaMail.www@wwinf1k24> Message-ID: <1759415419.10864.1472327138817.JavaMail.www@wwinf1m21> Sat, 27 Aug 2016 10:57:24 +1000, zelpa wrote: >> I?m glad of Apple?s courageous initiative. > > If you're talking about the rifle thing I can understand that, but if > you're also talking about the water pistol I have no clue what you're > talking about. That decision by Apple is just absurd. > Well, this precise point is fundamentally out of the scope of Unicode. Now since we are on it, let?s (re-)read on p. 90 of TUS 9.0: ?D7 Abstract character: A unit of information used for the organization, control, or representation of textual data. ? When representing data, the nature of that data is generally symbolic as opposed to some other kind of data (for example, aural or visual). Examples of such symbolic data include letters, ideographs, digits, punctuation, technical symbols, and dingbats. ? An abstract character has no concrete form and should not be confused with a /glyph/. ? An abstract character does not necessarily correspond to what a user thinks of as a ?character? and should not be confused with a /grapheme/. [?]? So any vendor is free to choose for a given character the glyph that is most appropriate with respect to the business he?s running. iPhones being typically /given/ to children (among other people), it seems to me that they should be in the first place when it?s up to design emoji keyboards. Kind regards, Marcel From jameskasskrv at gmail.com Sat Aug 27 19:08:21 2016 From: jameskasskrv at gmail.com (James Kass) Date: Sat, 27 Aug 2016 16:08:21 -0800 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <1759415419.10864.1472327138817.JavaMail.www@wwinf1m21> References: <572703342.20160826124904@acssoft.de> <2005376640.12735.1472258026328.JavaMail.www@wwinf1k24> <1759415419.10864.1472327138817.JavaMail.www@wwinf1m21> Message-ID: Comparing icons to box drawing characters is a non-starter. BDCs were included at the inception of Unicode because of the very encoding principles which were scuttled in order to encode the icons. ?????????????????????????? Any vendor is free to provide any glyph for any character regardless of propriety. For example, a swastika could be shown whenever the star of David is encoded. Any vendor providing a glyph which doesn't represent the meaning of the character is doing a disservice to its users. Only a fool would bring a squirt gun to a gun fight. Best regards, James Kass -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Sat Aug 27 19:34:05 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Sat, 27 Aug 2016 17:34:05 -0700 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: <0AE94CC08ED24A169E811E2EA1B81B8E@DougEwell> References: <0AE94CC08ED24A169E811E2EA1B81B8E@DougEwell> Message-ID: On 8/27/2016 10:15 AM, Doug Ewell wrote: > Ken Whistler wrote: > >> I would contend that encoding wildly popular and extensively used >> little pictographs as characters makes a whole lot more sense in the >> abstract than encoding box-drawing graphic pieces for completely >> obsolete screen technology ever did. > > Though to be fair, the screen technology was a lot less "completely > obsolete" in 1991, when the box drawing characters were encoded > (Unicode 1.0), than it is today. They came into the draft in the period from 1988 to 1990; during that period, dialogs using "text mode" displays were common for many applications, not just pure terminal emulation. To demonstrate that it was "universal" Unicode had to show that it could be used to replace the entire range of actively used character encodings. Just as the same universality argument is what drove the initial acceptance of emoji. And will drive acceptance of a whole host of other symbols and characters, no matter how well they stack up against the purity of principle. A./ From verdy_p at wanadoo.fr Sat Aug 27 22:10:14 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Aug 2016 05:10:14 +0200 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: <0AE94CC08ED24A169E811E2EA1B81B8E@DougEwell> Message-ID: Well it is still not so universal as there are wide ranges of glyphs excluded for now to encoding as characters: - many icons used in cartography (there are progresses now, but in their emoji form for use in talks/instant messaging/SMS, where they are colorful but do not match the simple glyph used in maps that will use them in multiple distinctive colors and sizes; often the 3D effects are undesirable, most of the time shey should be monochomatic, and color/sizes and other styles applied conditionally by some stylesheet) - country flags have been included but many regional emblems are excluded (as they don't match any ISO 3166-1 code) - common road signs/street signs and signs for indoor facilities & services - various symbols used in software UIs: many OSes have to provide an additional font encoding them as PUAs or using some encoding specific to the font containing them (much like it was with most dingbats in older Adobe Postscript fonts) - various box drawing characters used in legacy terminals (notably in Teletext and on older 8-bit systems): a few of them were added from DOS/OEM codepages. Of course corporate logos used in proprietary fonts for specific OSes cannot be encoded for legal reasons (not as long as there's no licencing permitting its inclusion in other fonts for other OSes): e.g. logos from Apple and Microsoft for MacOS and Windows., but as well other logos for various Unix editions and even Linux distributions, including the green bot for Android), and other logos registered as trademarks, and logos used to identify some national technical standards and indicating a conformance (usage is restricted by the standard defining these logos, many of them being supported by private organizations selling their licences). All these logos have to be encoded transported as embedded or linked images carrying their own copyright (which must be also transported along with their graphic definition). As well we cannot encode glyphs representing physical persons (e.g. based on a photo of Barak Obama), or containing biometric data (e.g. fingerprints, DNA sequences, personal handwritten signatures), or some protected artitectural designs (even if these are old historic designs such as Greco-Roman designs), or logos representing some coin faces. As well we cnanot represent precise taxons (animalia or flora are very roughly represented, but we don't go up to the species level, or even just the gender) 2016-08-28 2:34 GMT+02:00 Asmus Freytag (c) : > On 8/27/2016 10:15 AM, Doug Ewell wrote: > >> Ken Whistler wrote: >> >> I would contend that encoding wildly popular and extensively used >>> little pictographs as characters makes a whole lot more sense in the >>> abstract than encoding box-drawing graphic pieces for completely >>> obsolete screen technology ever did. >>> >> >> Though to be fair, the screen technology was a lot less "completely >> obsolete" in 1991, when the box drawing characters were encoded (Unicode >> 1.0), than it is today. >> > > They came into the draft in the period from 1988 to 1990; during that > period, dialogs using "text mode" displays were common for many > applications, not just pure terminal emulation. > > To demonstrate that it was "universal" Unicode had to show that it could > be used to replace the entire range of actively used character encodings. > Just as the same universality argument is what drove the initial acceptance > of emoji. And will drive acceptance of a whole host of other symbols and > characters, no matter how well they stack up against the purity of > principle. > > A./ > -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sun Aug 28 12:22:06 2016 From: doug at ewellic.org (Doug Ewell) Date: Sun, 28 Aug 2016 11:22:06 -0600 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: Philippe Verdy wrote: > Well it is still not so universal as there are wide ranges of glyphs > excluded for now to encoding as characters: > [...] > - country flags have been included but many regional emblems are > excluded (as they don't match any ISO 3166-1 code) There are tentative plans (again) to provide a composite encoding for flags corresponding to country subdivisions encoded in ISO 3166-2. Unicode and 10646 have done well so far to avoid judging for themselves which regions or groups deserve encoding over others, and sticking with the decisions of ISO 3166/MA instead. > - common road signs/street signs and signs for indoor facilities & > services I wouldn't doubt those are coming soon. > - various box drawing characters used in legacy terminals (notably in > Teletext and on older 8-bit systems): a few of them were added from > DOS/OEM codepages. I thought that set had been pretty much completed by now. I wonder which one are supposedly still missing. -- Doug Ewell | Thornton, CO, US | ewellic.org From pandey at umich.edu Sun Aug 28 13:04:46 2016 From: pandey at umich.edu (Anshuman Pandey) Date: Sun, 28 Aug 2016 14:04:46 -0400 Subject: Offlist -- Re: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: Hi Doug, Do you know who represents the US on ISO 3166? Anshu > On Aug 28, 2016, at 1:22 PM, Doug Ewell wrote: > > Philippe Verdy wrote: > >> Well it is still not so universal as there are wide ranges of glyphs >> excluded for now to encoding as characters: >> [...] >> - country flags have been included but many regional emblems are >> excluded (as they don't match any ISO 3166-1 code) > > There are tentative plans (again) to provide a composite encoding for flags corresponding to country subdivisions encoded in ISO 3166-2. > > Unicode and 10646 have done well so far to avoid judging for themselves which regions or groups deserve encoding over others, and sticking with the decisions of ISO 3166/MA instead. > >> - common road signs/street signs and signs for indoor facilities & >> services > > I wouldn't doubt those are coming soon. > >> - various box drawing characters used in legacy terminals (notably in >> Teletext and on older 8-bit systems): a few of them were added from >> DOS/OEM codepages. > > I thought that set had been pretty much completed by now. I wonder which one are supposedly still missing. > > -- > Doug Ewell | Thornton, CO, US | ewellic.org From pandey at umich.edu Sun Aug 28 13:05:32 2016 From: pandey at umich.edu (Anshuman Pandey) Date: Sun, 28 Aug 2016 14:05:32 -0400 Subject: Offlist -- Re: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: <9A8601D1-8D52-4D71-8534-E6B819DF04F6@umich.edu> That should've been offlist... :) > On Aug 28, 2016, at 2:04 PM, Anshuman Pandey wrote: > > Hi Doug, > > Do you know who represents the US on ISO 3166? > > Anshu > > >> On Aug 28, 2016, at 1:22 PM, Doug Ewell wrote: >> >> Philippe Verdy wrote: >> >>> Well it is still not so universal as there are wide ranges of glyphs >>> excluded for now to encoding as characters: >>> [...] >>> - country flags have been included but many regional emblems are >>> excluded (as they don't match any ISO 3166-1 code) >> >> There are tentative plans (again) to provide a composite encoding for flags corresponding to country subdivisions encoded in ISO 3166-2. >> >> Unicode and 10646 have done well so far to avoid judging for themselves which regions or groups deserve encoding over others, and sticking with the decisions of ISO 3166/MA instead. >> >>> - common road signs/street signs and signs for indoor facilities & >>> services >> >> I wouldn't doubt those are coming soon. >> >>> - various box drawing characters used in legacy terminals (notably in >>> Teletext and on older 8-bit systems): a few of them were added from >>> DOS/OEM codepages. >> >> I thought that set had been pretty much completed by now. I wonder which one are supposedly still missing. >> >> -- >> Doug Ewell | Thornton, CO, US | ewellic.org From verdy_p at wanadoo.fr Sun Aug 28 13:34:11 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Sun, 28 Aug 2016 20:34:11 +0200 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech In-Reply-To: References: Message-ID: 2016-08-28 19:22 GMT+02:00 Doug Ewell : > Philippe Verdy wrote: > > - various box drawing characters used in legacy terminals (notably in >> > Teletext and on older 8-bit systems): a few of them were added from >> DOS/OEM codepages. >> > > I thought that set had been pretty much completed by now. I wonder which > one are supposedly still missing. Look at European Teletext standards. Yes they are now being outdated by web-based services or applets running in settop boxes (running Java, or Android) and they support now graphicvs and not just text with styling attributes. Try also creating tables with symbols lining up correctly in a grid. This is extremely complicate and tricky in HTML for most basic shapes (which are unfortunately only available in various fonts using incompatible metrics): once again we have to create table embedding multiple references to external images (this is overlong), or create custom scripts to create a complex HTML layout (and this does not work in word processors). Diagrams made with monospaced shapes are really difficult to layout. Common basic shapes (such as crossword grids) do not work as well : they were extremely easy to write using Teletext and BBS-like technologies. May be this should merit some extensions to CSS to offer better support for monospaced layouts. There are works to do also for connecting lines with arrow heads. Box drawing charactrers also do not have versions with rounded corners. -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Mon Aug 29 14:08:53 2016 From: public at khwilliamson.com (Karl Williamson) Date: Mon, 29 Aug 2016 13:08:53 -0600 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint Message-ID: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> "I'm excited about the proposal to add a brontosaurus emoji codepoint because it has the potential to bring together a half-dozen different groups of pedantic people together" From http://xkcd.com/1726/ I don't know if this is new, or I just never saw it before. From mark at macchiato.com Mon Aug 29 14:18:52 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 29 Aug 2016 21:18:52 +0200 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: On Mon, Aug 29, 2016 at 9:08 PM, Karl Williamson wrote: > http://xkcd.com/1726/ ?That's the newest one.... ? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Mon Aug 29 14:22:36 2016 From: leob at mailcom.com (Leo Broukhis) Date: Mon, 29 Aug 2016 12:22:36 -0700 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: It's new. Let's not tell Randall about the "completing the set" argument. Leo On Mon, Aug 29, 2016 at 12:08 PM, Karl Williamson wrote: > "I'm excited about the proposal to add a brontosaurus emoji codepoint > because it has the potential to bring together a half-dozen different > groups of pedantic people together" > > From http://xkcd.com/1726/ > > I don't know if this is new, or I just never saw it before. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Mon Aug 29 14:33:57 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Mon, 29 Aug 2016 21:33:57 +0200 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: I mean the comic is the newest. There have been dinosaur proposals; the emoji subcommittee is still looking at the priorities among animals. Mark On Mon, Aug 29, 2016 at 9:18 PM, Mark Davis ?? wrote: > > On Mon, Aug 29, 2016 at 9:08 PM, Karl Williamson > wrote: > >> http://xkcd.com/1726/ > > > ?That's the newest one.... ? > > > Mark > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zelpahd at gmail.com Mon Aug 29 14:51:26 2016 From: zelpahd at gmail.com (zelpa) Date: Tue, 30 Aug 2016 05:51:26 +1000 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: Apparently there are 700 named dinosaur species, we may need to make a new block specifically for dinosaur emoji if we want to complete the set. On Tue, Aug 30, 2016 at 5:22 AM, Leo Broukhis wrote: > It's new. Let's not tell Randall about the "completing the set" argument. > > Leo > > On Mon, Aug 29, 2016 at 12:08 PM, Karl Williamson > wrote: > >> "I'm excited about the proposal to add a brontosaurus emoji codepoint >> because it has the potential to bring together a half-dozen different >> groups of pedantic people together" >> >> From http://xkcd.com/1726/ >> >> I don't know if this is new, or I just never saw it before. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Mon Aug 29 14:56:53 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Mon, 29 Aug 2016 16:56:53 -0300 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: We obviously need an emoji for every species name listed within The Official Registry of Zoological Nomenclature. I propose a new set of Basic Latin characters, the Zoological Nomenclature Indicator Symbols, to be used for spelling scientific names, which are then rendered as cutesy colorful icons used as mood indicators. A Zoological Nomenclature Indicator Symbol Space must be included to separate name components; sequences including one such separator are assumed to be binomens, and two, trinomens. For example, a cat emoji can be encoded with the Zoological Nomenclature Indicator Symbols corresponding to [FELIS?CATUS] or, following modern practice, [FELIS?SILVESTRIS?CATUS] (biological homonyms are to be treated as alternative encodings of the same abstract emoji). Notice that the current emoji set include such characters as CRYING CAT FACE (U+1F63F)) and KISSING CAT FACE WITH CLOSED EYES (U+1F63D), in addition to the default human (or, in a certain vendor, disgusting yellow amoeb?) faces; but no such equivalents for, say, dogs or bunnies, which can be a very dangerous political slight towards dog-people and bunny-people. With some adjustment, Zoological Nomenclature Indicator Symbols can solve the issue once for all, with perfect neutrality. All of the current face expression emoji are to be decomposed as FACE plus abstract combining characters; for example, U+1F642 SLIGHTLY SMILING FACE will be considered a compatibility variant of FACE + COMBINING SMILE + COMBINING SLIGHT FACIAL EXPRESSION. This would allow a dog version of U+1F63D encoded as: [CANIS?LUPUS?FAMILIARIS] + COMBINING FACE + COMBINING KISSING FACIAL EXPRESSION + COMBINING CLOSED EYES, and similarly for any species and expression combination, like, say, a ring-tailed lemur rolling on the floor laughing, or an okapi with tears of joy. (Drawing all possible glyphs is of course not Unicode's problem.) 2016-08-29 16:22 GMT-03:00 Leo Broukhis : > It's new. Let's not tell Randall about the "completing the set" argument. > > Leo > > On Mon, Aug 29, 2016 at 12:08 PM, Karl Williamson > wrote: > >> "I'm excited about the proposal to add a brontosaurus emoji codepoint >> because it has the potential to bring together a half-dozen different >> groups of pedantic people together" >> >> From http://xkcd.com/1726/ >> >> I don't know if this is new, or I just never saw it before. >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From steffen at sdaoden.eu Mon Aug 29 15:20:03 2016 From: steffen at sdaoden.eu (Steffen Nurpmeso) Date: Mon, 29 Aug 2016 22:20:03 +0200 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: <20160829202003.C_wV55LaM%steffen@sdaoden.eu> Leonardo Boiko wrote: |We obviously need an emoji for every species name listed within The \ |Official Registry of Zoological Nomenclature. Ride it out. Ride it out. Oh, it shouldn't take that much longer if we all go for it. --steffen From everson at evertype.com Mon Aug 29 16:16:55 2016 From: everson at evertype.com (Michael Everson) Date: Mon, 29 Aug 2016 22:16:55 +0100 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: <896DACA9-DAA3-4934-A012-A347A8093FD6@evertype.com> > On 29 Aug 2016, at 20:33, Mark Davis ?? wrote: > > There have been dinosaur proposals; the emoji subcommittee is still looking at the priorities among animals. Andrew West?s dinosaur proposal was spot-on in its scope and prediction for popularity and usage. Having only one dinosaur emoji doesn?t make any sense taxonomically or in fact being realistic about user preferences. M From gwalla at gmail.com Mon Aug 29 17:05:05 2016 From: gwalla at gmail.com (Garth Wallace) Date: Mon, 29 Aug 2016 15:05:05 -0700 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: <896DACA9-DAA3-4934-A012-A347A8093FD6@evertype.com> References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> <896DACA9-DAA3-4934-A012-A347A8093FD6@evertype.com> Message-ID: On Mon, Aug 29, 2016 at 2:16 PM, Michael Everson wrote: > > > On 29 Aug 2016, at 20:33, Mark Davis ?? wrote: > > > > There have been dinosaur proposals; the emoji subcommittee is still > looking at the priorities among animals. > > Andrew West?s dinosaur proposal was spot-on in its scope and prediction > for popularity and usage. Having only one dinosaur emoji doesn?t make any > sense taxonomically or in fact being realistic about user preferences. > And the distinctions could reasonably be made at typical emoji sizes. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Mon Aug 29 17:24:28 2016 From: gwalla at gmail.com (Garth Wallace) Date: Mon, 29 Aug 2016 15:24:28 -0700 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: On Mon, Aug 29, 2016 at 12:56 PM, Leonardo Boiko wrote: > We obviously need an emoji for every species name listed within The > Official Registry of Zoological Nomenclature. > > I propose a new set of Basic Latin characters, the Zoological Nomenclature > Indicator Symbols, to be used for spelling scientific names, which are then > rendered as cutesy colorful icons used as mood indicators. A Zoological > Nomenclature Indicator Symbol Space must be included to separate name > components; sequences including one such separator are assumed to be > binomens, and two, trinomens. For example, a cat emoji can be encoded with > the Zoological Nomenclature Indicator Symbols corresponding to > [FELIS?CATUS] or, following modern practice, [FELIS?SILVESTRIS?CATUS] > (biological homonyms are to be treated as alternative encodings of the same > abstract emoji). > I disagree, the set of dinosaur emoji needed for communication is very limited, and consists of: TYRANNOSAURUS REX TYRANNOSAURUS REX FACE TYRANNOSAURUS REX WITH LEFT FOOT RAISED DROMICEIOMIMUS TYRANNOSAURUS REX WITH RIGHT FOOT RAISED UTAHRAPTOR WITH LEFT FOOT RAISED TYRANNOSAURUS REX WITH HEAD TURNED UTAHRAPTOR WITH FEET PLANTED TYRANNOSAURUS REX SHOUTING This set is sufficient to communicate any message one would wish, cf. qwantz.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Wed Aug 31 10:49:36 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Wed, 31 Aug 2016 17:49:36 +0200 Subject: I'm excited about the proposal to add a brontosaurus emoji codepoint In-Reply-To: References: <9a1f38fa-40c2-3564-b790-addde3727760@khwilliamson.com> Message-ID: Leonardo Boiko : > > All of the current face expression emoji are to be decomposed as FACE plus abstract combining characters; for example, U+1F642 SLIGHTLY SMILING FACE will be considered a compatibility variant of FACE + COMBINING SMILE + COMBINING SLIGHT FACIAL EXPRESSION This is actually not as absurd as you may want it to sound. There would probably have been less glyph ambiguity if the emoticons part of emoji had been encoded as character sequences (combining or not) using the building blocks of existing emoticons (sideways Western and upright Eastern style) as a base, e.g. Winking Eye ?;? ????, Smiling Eyes ?^^? ????????, Laughing Mouth ?D? ????????, Open Mouth ?o?/?O?/?0? ??, Halo ?o?/?O?/?0? ????, Clown Nose ?o?, Drop ?'? ??/?? (sweat ??????, tear ??????, snot, drool). Your example ?? has the default face, i.e. :-) or =) or (?_?) or U+263A, with a special mouth, so it would rather be either Face + Combining Slight Smile or Face + ZWJ + Slight Smile. U+263B would still be available as a neutral base, but you could actually try to combine U+1F636 with existing combining diacritics: ???. From doug at ewellic.org Wed Aug 31 11:25:05 2016 From: doug at ewellic.org (Doug Ewell) Date: Wed, 31 Aug 2016 09:25:05 -0700 Subject: Comment in a leading German newspaper regarding the way UTC and Apple handle Emoji as an attack on Free Speech Message-ID: <20160831092504.665a7a7059d7ee80bb4d670165c8327d.2c28aa3363.wbe@email03.godaddy.com> > ("The Unicode Consortium appears like a reissue of Orwell's Ministry > of Truth, which replaced the English language by a new one, sweeped > clean from harmful terms, and which removed "unorthodox" connotations > from the rest of the words.") So I took another look and saw that: (1) U+1F946 RIFLE has the following cross-reference in NamesList.txt: = marksmanship, shooting, hunting which does not include any mention of squirt guns or water pistols, or generally bowdlerizing the image or changing the intent of this code point; (2) Section 22.9 "Miscellaneous Symbols" in TUS 9.0 does not make any mention of modifying the RIFLE glyph, or symbol glyphs in general, so as to alter their meaning; (3) the code chart at http://www.unicode.org/charts/PDF/U1F900.pdf clearly shows a rifle, and not any other type of gun or non-gun. I can imagine people with time on their hands criticizing Apple for changing the glyph, but how did the Unicode Consortium itself get dragged into this? What obvious thing am I missing? -- Doug Ewell | Thornton, CO, US | ewellic.org