From kimslawson at gmail.com Wed Aug 3 14:26:59 2016 From: kimslawson at gmail.com (Kim Slawson) Date: Wed, 3 Aug 2016 15:26:59 -0400 Subject: combining marks for currency characters? general combining character? Message-ID: It's nice to see a good selection of currency symbols defined in unicode, but I wonder if it might be useful to add a few combining marks for the purpose of constructing currency symbols. For example, many currency symbols use single or double horizontal lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as combining marks would simplify the creation of new currency symbols, as many are simply overstruck letters. Would these be good candidates for proposed combining characters? Alternately (and I have no clue if this has been addressed), why not allow arbitrary combining characters? ZWJ does not currently work for this, but it could be amended to, or another joining character introduced. [image: KP logo] Kim Slawson Kernel Panic Consulting kim at slawson.org 207-370-7401 <+1-207-370-7401> -------------- next part -------------- An HTML attachment was scrubbed... URL: From leob at mailcom.com Wed Aug 3 16:17:14 2016 From: leob at mailcom.com (Leo Broukhis) Date: Wed, 3 Aug 2016 14:17:14 -0700 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: Hi Kim, While it can be argued that the "NON-DESTRUCTIVE BACKSPACE" capability of a typewriter, allowing arbitrary overstruck characters, belongs to plain text, it is more akin to creating subscripts and superscripts by rotating the platen knob up or down by half-interval, which Unicode considers to be within the domain of markup rather than plain text. Regards, Leo On Wed, Aug 3, 2016 at 12:26 PM, Kim Slawson wrote: > It's nice to see a good selection of currency symbols defined in unicode, > but I wonder if it might be useful to add a few combining marks for the > purpose of constructing currency symbols. > > For example, many currency symbols use single or double horizontal lines, > vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as > combining marks would simplify the creation of new currency symbols, as > many are simply overstruck letters. > > Would these be good candidates for proposed combining characters? > > Alternately (and I have no clue if this has been addressed), why not allow > arbitrary combining characters? ZWJ does not currently work for this, but > it could be amended to, or another joining character introduced. > > [image: KP logo] Kim Slawson > Kernel Panic Consulting > kim at slawson.org > 207-370-7401 <+1-207-370-7401> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From c933103 at gmail.com Wed Aug 3 17:57:49 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Thu, 4 Aug 2016 06:57:49 +0800 Subject: New olympic sport emoji Message-ID: In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo Organising Committee of the Olympic and Paralympic Games think twitter shall add five new emoji for each of those new sports that just get approved into 2020 Olympic game by IOC in four year's timr https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 , but had any proposal be submitted to Unicode about addition of symbol for those sports into Unicode yet? -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Wed Aug 3 18:11:14 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Wed, 3 Aug 2016 16:11:14 -0700 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > https://twitter.com/Tokyo2020/status/760930003760492544 ?No proposal has been received for these 5 items. FYI: any proposal for emoji for inclusion in 2017 needs to be received by Oct 1, and follow the guidelines in http://www.unicode.org/emoji/selection.html? Mark -------------- next part -------------- An HTML attachment was scrubbed... URL: From jameskasskrv at gmail.com Wed Aug 3 20:40:20 2016 From: jameskasskrv at gmail.com (James Kass) Date: Wed, 3 Aug 2016 17:40:20 -0800 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: Unicode encodes what is or what will be rather than what might/should/could be. The ZWJ character is way to indicate a request for a more joined form of the two characters surrounding it?at the encoding level. As such, it's already in place in the standard. The ability to reasonably display arbitrary combinations depends upon computer software, but such combinations can already be entered, stored, and exchanged as data. Best regards, James Kass -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Aug 4 01:30:43 2016 From: gwalla at gmail.com (Garth Wallace) Date: Wed, 3 Aug 2016 23:30:43 -0700 Subject: New olympic sport emoji In-Reply-To: References: Message-ID: Judging by the attached gif, it looks like they actually mean hashflags, not Unicode emoji. On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo > Organising Committee of the Olympic and Paralympic Games think twitter > shall add five new emoji for each of those new sports that just get > approved into 2020 Olympic game by IOC in four year's timr > https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 > , but had any proposal be submitted to Unicode about addition of symbol for > those sports into Unicode yet? > -------------- next part -------------- An HTML attachment was scrubbed... URL: From philip_chastney at yahoo.com Thu Aug 4 03:27:25 2016 From: philip_chastney at yahoo.com (philip chastney) Date: Thu, 4 Aug 2016 08:27:25 +0000 (UTC) Subject: combining marks for currency characters? general combining character? References: <1054061351.5517451.1470299245568.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1054061351.5517451.1470299245568.JavaMail.yahoo@mail.yahoo.com> FontLab provides facilities for combining two outlines which correspond to the set operations of union, intersection and set difference they take no discernible time to execute, and could therefore be made available at print-time, via the rendering engine this suggests that the specification of which outlines to combine should be done within HTML (or similar) this approach (i) would require no additional characters within Unicode, (ii) would allow greater generality (the symbols you mention are often used in mathematics to denote negation, while other symbols are combined in other contexts), (iii) the combined outline needs to be generated before rasterization, but (iv) the maths involved would pose no problem to the clever people who wrote the routines to rasterize outlines in the first place (though hinting would obviously no longer be possible, of course) all the best . . . /phil -------------------------------------------- On Wed, 3/8/16, Kim Slawson wrote: Subject: combining marks for currency characters? general combining character? To: unicode at unicode.org Date: Wednesday, 3 August, 2016, 7:26 PM It's nice to see a good selection of currency symbols defined in unicode, but I wonder if it might be useful to add a few combining marks for the purpose of constructing currency symbols. For example, many currency symbols use single or double horizontal lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these available as combining marks would simplify the creation of new currency symbols, as many are simply overstruck letters. Would these be good candidates for proposed combining characters? Alternately (and I have no clue if this has been addressed), why not allow arbitrary combining characters? ZWJ does not currently work for this, but it could be amended to, or another joining character introduced. ?Kim Slawson Kernel Panic Consulting kim at slawson.org 207-370-7401 From verdy_p at wanadoo.fr Thu Aug 4 09:33:28 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Aug 2016 16:33:28 +0200 Subject: combining marks for currency characters? general combining character? In-Reply-To: References:

Message-ID: May be, but using such sequence will not work in many cases: - the display will be almost always wrong due to lack of cont support for some unspecified combinations, or because the usage is too recent - the parsing will not recognize the sequznce as a currecy symbol but as a random "word" - the presence of ZWJ could violate expected data formats (currency amounts largely need to be parsed and processed automatically, they are not just standard text) - these symbols do not belong to any script even if they are most often derived from actual letters from a local script) - users will just prefer using the 3-letter ISO currency code or the name of the currency, or known abbreviations, using more conventional notations for abbreviations that you can detect in text: input with sequnce is just an horror Anyway, these symbols are not created very often. There's not a lot of currencies in the world. If one country decides changing its currency or assigning it a symbol, it will be announced largely in advance (before it gets legal tender) and the Unicode standard can track this in its yearly updates. Once it is announced, its usage will explode and users will want a simple symbol to be used in lots of context. So these sequences will typically have a temporary usage, at the early time of adoption in the interim time where fonts are still not updated and available in OSes, in contexts were using images or rich text formats allowing the inclusion of web fonts or embedded fonts will not work. But they will not be used in short messaging systems (chat, SMS, twitts...) where abbreviations and ISO currency codes will largely be prefered. 2016-08-04 3:40 GMT+02:00 James Kass : > Unicode encodes what is or what will be rather than what > might/should/could be. > > The ZWJ character is way to indicate a request for a more joined form of > the two characters surrounding it?at the encoding level. As such, it's > already in place in the standard. The ability to reasonably display > arbitrary combinations depends upon computer software, but such > combinations can already be entered, stored, and exchanged as data. > > Best regards, > > James Kass > -------------- next part -------------- An HTML attachment was scrubbed... URL: From tiemevanveen at hotmail.com Thu Aug 4 02:08:53 2016 From: tiemevanveen at hotmail.com (Tieme van Veen) Date: Thu, 4 Aug 2016 09:08:53 +0200 Subject: New olympic sport emoji In-Reply-To: References: , Message-ID: Nice! I think you're right, they're meaning the Rio-style emoji's that twitter appends after olympic hashes. Still, it would be cool if those 5 new sports could be expressed in emoji's right? People will need them a lot in 2020! I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I chose Climbing instead of SportClimbing to make the icon more generic and useful for all kinds of climbers instead of just 'SportClimbing'. Proposal will be ready by the end of the month, draft is here:https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn-SuIfwN0TD6i3RrkMQorCRY6g/edit# Surfing is already in ??, so is a baseball ?? and Martial arts. That leaves Skateboarding. Tieme From: gwalla at gmail.com Date: Wed, 3 Aug 2016 23:30:43 -0700 Subject: Re: New olympic sport emoji To: c933103 at gmail.com; unicode at unicode.org Judging by the attached gif, it looks like they actually mean hashflags, not Unicode emoji. On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo Organising Committee of the Olympic and Paralympic Games think twitter shall add five new emoji for each of those new sports that just get approved into 2020 Olympic game by IOC in four year's timr https://www.olympic.org/news/ioc-approves-five-new-sports-for-olympic-games-tokyo-2020 , but had any proposal be submitted to Unicode about addition of symbol for those sports into Unicode yet? -------------- next part -------------- An HTML attachment was scrubbed... URL: From verdy_p at wanadoo.fr Thu Aug 4 10:06:59 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 4 Aug 2016 17:06:59 +0200 Subject: New olympic sport emoji In-Reply-To: References:

Message-ID: For softball I would expect a better icon such as https://pixabay.com/static/uploads/photo/2014/04/02/14/13/softball-306540_960_720.png if you use only a ball, that ball should be yellow, not white, but it will be confusive with a tennis ball. 2016-08-04 9:08 GMT+02:00 Tieme van Veen : > Nice! > > I think you're right, they're meaning the Rio-style emoji's that twitter > appends after olympic hashes > . > > Still, it would be cool if those 5 new sports could be expressed in > emoji's right? People will need them a lot in 2020! > > I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I > chose Climbing instead of SportClimbing to make the icon more generic and > useful for all kinds of climbers instead of just 'SportClimbing'. > > Proposal will be ready by the end of the month, draft is here: > https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn- > SuIfwN0TD6i3RrkMQorCRY6g/edit# > > Surfing is already in ??, so is a baseball ?? and Martial arts[image: ??]. That > leaves Skateboarding. > > Tieme > > ------------------------------ > From: gwalla at gmail.com > Date: Wed, 3 Aug 2016 23:30:43 -0700 > Subject: Re: New olympic sport emoji > To: c933103 at gmail.com; unicode at unicode.org > > Judging by the attached gif, it looks like they actually mean hashflags, > not Unicode emoji. > > On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: > > In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo > Organising Committee of the Olympic and Paralympic Games think twitter > shall add five new emoji for each of those new sports that just get > approved into 2020 Olympic game by IOC in four year's timr > https://www.olympic.org/news/ioc-approves-five-new-sports- > for-olympic-games-tokyo-2020 , but had any proposal be submitted to > Unicode about addition of symbol for those sports into Unicode yet? > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From gwalla at gmail.com Thu Aug 4 11:19:49 2016 From: gwalla at gmail.com (Garth Wallace) Date: Thu, 4 Aug 2016 09:19:49 -0700 Subject: New olympic sport emoji In-Reply-To: References:

Message-ID: Personally, I think Unicode should just encode a set of sports pictograms of the Olympic type (stylized figures engaged in activity, rather than pieces of equipment) and be done with it, but the Consortium clearly disagrees. On Thu, Aug 4, 2016 at 8:06 AM, Philippe Verdy wrote: > For softball I would expect a better icon such as > https://pixabay.com/static/uploads/photo/2014/04/02/14/ > 13/softball-306540_960_720.png > > if you use only a ball, that ball should be yellow, not white, but it will > be confusive with a tennis ball. > > > 2016-08-04 9:08 GMT+02:00 Tieme van Veen : > >> Nice! >> >> I think you're right, they're meaning the Rio-style emoji's that twitter >> appends after olympic hashes >> . >> >> Still, it would be cool if those 5 new sports could be expressed in >> emoji's right? People will need them a lot in 2020! >> >> I'm working on a proposal for a 'Climbing' icon. That's one of the 5. I >> chose Climbing instead of SportClimbing to make the icon more generic and >> useful for all kinds of climbers instead of just 'SportClimbing'. >> >> Proposal will be ready by the end of the month, draft is here: >> https://docs.google.com/document/d/1t8-Lva7Rb9gpautHMn-SuIfw >> N0TD6i3RrkMQorCRY6g/edit# >> >> Surfing is already in ??, so is a baseball ?? and Martial arts[image: ??] >> . That leaves Skateboarding. >> >> Tieme >> >> ------------------------------ >> From: gwalla at gmail.com >> Date: Wed, 3 Aug 2016 23:30:43 -0700 >> Subject: Re: New olympic sport emoji >> To: c933103 at gmail.com; unicode at unicode.org >> >> Judging by the attached gif, it looks like they actually mean hashflags, >> not Unicode emoji. >> >> On Wed, Aug 3, 2016 at 3:57 PM, gfb hjjhjh wrote: >> >> In https://twitter.com/Tokyo2020/status/760930003760492544 , Tokyo >> Organising Committee of the Olympic and Paralympic Games think twitter >> shall add five new emoji for each of those new sports that just get >> approved into 2020 Olympic game by IOC in four year's timr >> https://www.olympic.org/news/ioc-approves-five-new-sports-fo >> r-olympic-games-tokyo-2020 , but had any proposal be submitted to >> Unicode about addition of symbol for those sports into Unicode yet? >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From asmusf at ix.netcom.com Thu Aug 4 12:44:29 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 4 Aug 2016 10:44:29 -0700 Subject: combining marks for currency characters? general combining character? In-Reply-To: References: Message-ID: On 8/3/2016 12:26 PM, Kim Slawson wrote: > It's nice to see a good selection of currency symbols defined in > unicode, but I wonder if it might be useful to add a few combining > marks for the purpose of constructing currency symbols. > > For example, many currency symbols use single or double horizontal > lines, vertical lines or solidi ( |, -, /, ||, =, // ). Having these > available as combining marks would simplify the creation of new > currency symbols, as many are simply overstruck letters. Unicode's policy is to disregard combining marks for overlays (as opposed to other categories of combining marks) and code the relevant combined glyph anyway. That goes for letters that are members for alphabets and is done for a number of reasons that all equally well apply to currency symbols. So, the short answer is that even with many overly marks already defined, these would be disregarded as would any additional ones. They are generically useful in some cases, such as to indicate negation for arbitrary mathematical symbols and the like, but not to compose letterlike glyphs. A./ From c933103 at gmail.com Thu Aug 4 13:32:14 2016 From: c933103 at gmail.com (gfb hjjhjh) Date: Fri, 5 Aug 2016 02:32:14 +0800 Subject: Implementation of ideographic description characters Message-ID: Hello, As I read that it is possible for an implementation of Unicode that can render those ideographic description characters into rendering the kanji it describe, but is there any known/existing system or font or implementation that would do exactly this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at gmail.com Thu Aug 4 13:49:35 2016 From: leoboiko at gmail.com (Leonardo Boiko) Date: Thu, 4 Aug 2016 15:49:35 -0300 Subject: Implementation of ideographic description characters In-Reply-To: References: Message-ID: Hi, the IDS provide too little information for rendering kanji properly. Take a look into https://en.m.wikipedia.org/wiki/Chinese_character_description_languages . Hello, As I read that it is possible for an implementation of Unicode that can render those ideographic description characters into rendering the kanji it describe, but is there any known/existing system or font or implementation that would do exactly this? -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Thu Aug 4 14:37:14 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Thu, 4 Aug 2016 12:37:14 -0700 Subject: Whitespace characters in Unicode Message-ID: Hi Unicode Folks: I am trying to come up with a sensible sets of characters that are considered whitespace or newlines in Unicode, and to understand the relative stability policy with respect to them. (This is for a formal syntax where the definition of "whitespace" matters, e.g., to separate identifiers, and I want to be as conservative as possible.) Please let me know if the stuff below is correct, or needs work. The following characters / sequences are considered line breaking characters, per UAX #14 and Section 5.8 of UNICODE: CRLF CR LF FF VT NEL LS PS So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination U+000D U+000A (treated as one line break). These characters / sequences are called "newlines". There will not be any additional code points that are assigned to be line breaks. (Correct?) CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. These are distinguished from other codes (above) that also mean line breaks, mainly because of historical and widespread use of them. There are several formatting characters that affect word wrapping and line breaking, as discussed in those documents...but they are not line breaking characters. **** The following characters are whitespaces: characters (code points) with the property WSpace=Y (or White_Space). This is: newlines U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 Assigned characters that are not listed above, can never be whitespace (according to Unicode). However, the set is not closed, so unassigned code points *could* be assigned to whitespace. It is (unlikely? very unlikely? Pretty much never going to happen?) that additional code points will be assigned to whitespace. **** There are some other characters that Unicode does not consider whitespace, but deserve discussion: U+180E MONGOLIAN VOWEL SEPARATOR: U+200B ZERO WIDTH SPACE U+200C ZERO WIDTH NON-JOINER U+200D ZERO WIDTH JOINER U+200E LEFT-TO-RIGHT MARK* U+200F RIGHT-TO-LEFT MARK* U+2060 WORD JOINER U+FEFF ZERO WIDTH NON-BREAKING SPACE *These appear in Pattern_White_Space, but Pattern_White_Space excludes U+2000-200A characters, which are obviously spaces. This is confusing and I would appreciate clarification /why/ Pattern_White_Space is significantly disjoint from White_Space. ******** The borderline characters above are not considered WSpace=Y, but sometimes might have space-like properties. ZWP and ZWNBP are obviously "space" characters, but they never generate whitespace. I suppose that conversely LTRM and RTLM are obviously "not space" characters, but they could generate whitespace under certain circumstances. Ditto for other formatting characters in general (for which the class is much larger). Therefore I guess a Unicode definition of "whitespace" (or "space characters") is: an assigned code point that *always* (is supposed to) generates white space (empty space between graphemes). ******** Are there other standards that Unicode people recommend, that have addressed whether certain borderline characters are considered whitespace vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax component)? Regards, Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From mark at macchiato.com Thu Aug 4 14:51:06 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 4 Aug 2016 12:51:06 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: There are 25 Whitespace characters. Here they are grouped by LineBreak property: http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%3Awhitespace%3A&g=Lb&i= Don't have time to respond more now. Mark On Thu, Aug 4, 2016 at 12:37 PM, Sean Leonard wrote: > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand the > relative stability policy with respect to them. (This is for a formal > syntax where the definition of "whitespace" matters, e.g., to separate > identifiers, and I want to be as conservative as possible.) Please let me > know if the stuff below is correct, or needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination > U+000D U+000A (treated as one line break). These characters / sequences are > called "newlines". > > There will not be any additional code points that are assigned to be line > breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. > These are distinguished from other codes (above) that also mean line > breaks, mainly because of historical and widespread use of them. > > There are several formatting characters that affect word wrapping and line > breaking, as discussed in those documents...but they are not line breaking > characters. > > **** > > The following characters are whitespaces: characters (code points) with > the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be whitespace > (according to Unicode). However, the set is not closed, so unassigned code > points *could* be assigned to whitespace. It is (unlikely? very unlikely? > Pretty much never going to happen?) that additional code points will be > assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider whitespace, > but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: 2014/12/01/when-is-an-identifier-not-an-identifier- > attack-of-the-mongolian-vowel-separator/> > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space excludes > U+2000-200A characters, which are obviously spaces. This is confusing and I > would appreciate clarification *why* Pattern_White_Space is significantly > disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but sometimes > might have space-like properties. ZWP and ZWNBP are obviously "space" > characters, but they never generate whitespace. I suppose that conversely > LTRM and RTLM are obviously "not space" characters, but they could generate > whitespace under certain circumstances. Ditto for other formatting > characters in general (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed to) > generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered whitespace > vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax > component)? > > Regards, > > Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Thu Aug 4 15:17:04 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 4 Aug 2016 17:17:04 -0300 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: What Mark Davis said; also, depending on what you need, consider taking a look at the definitions used by Unicode regexpes, at http://unicode.org/reports/tr18/ . 2016-08-04 16:37 GMT-03:00 Sean Leonard : > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand the > relative stability policy with respect to them. (This is for a formal > syntax where the definition of "whitespace" matters, e.g., to separate > identifiers, and I want to be as conservative as possible.) Please let me > know if the stuff below is correct, or needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination > U+000D U+000A (treated as one line break). These characters / sequences are > called "newlines". > > There will not be any additional code points that are assigned to be line > breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. > These are distinguished from other codes (above) that also mean line > breaks, mainly because of historical and widespread use of them. > > There are several formatting characters that affect word wrapping and line > breaking, as discussed in those documents...but they are not line breaking > characters. > > **** > > The following characters are whitespaces: characters (code points) with > the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be whitespace > (according to Unicode). However, the set is not closed, so unassigned code > points *could* be assigned to whitespace. It is (unlikely? very unlikely? > Pretty much never going to happen?) that additional code points will be > assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider whitespace, > but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: 2014/12/01/when-is-an-identifier-not-an-identifier- > attack-of-the-mongolian-vowel-separator/> > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space excludes > U+2000-200A characters, which are obviously spaces. This is confusing and I > would appreciate clarification *why* Pattern_White_Space is significantly > disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but sometimes > might have space-like properties. ZWP and ZWNBP are obviously "space" > characters, but they never generate whitespace. I suppose that conversely > LTRM and RTLM are obviously "not space" characters, but they could generate > whitespace under certain circumstances. Ditto for other formatting > characters in general (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed to) > generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered whitespace > vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax > component)? > > Regards, > > Sean > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Thu Aug 4 15:44:46 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Thu, 4 Aug 2016 13:44:46 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: Message-ID: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> I read through TR18...it mainly says that == \s == \p{Whitespace} == property White_Space is true. Does it say anything else or more significant than that, that I'm missing? Sean On 8/4/2016 1:17 PM, Leonardo Boiko wrote: > What Mark Davis said; also, depending on what you need, consider > taking a look at the definitions used by Unicode regexpes, at > http://unicode.org/reports/tr18/ . > > 2016-08-04 16:37 GMT-03:00 Sean Leonard >: > > Hi Unicode Folks: > > I am trying to come up with a sensible sets of characters that are > considered whitespace or newlines in Unicode, and to understand > the relative stability policy with respect to them. (This is for a > formal syntax where the definition of "whitespace" matters, e.g., > to separate identifiers, and I want to be as conservative as > possible.) Please let me know if the stuff below is correct, or > needs work. > > The following characters / sequences are considered line breaking > characters, per UAX #14 and Section 5.8 of UNICODE: > > CRLF CR LF FF VT NEL LS PS > > So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the > combination U+000D U+000A (treated as one line break). These > characters / sequences are called "newlines". > > There will not be any additional code points that are assigned to > be line breaks. (Correct?) > > CRLF, CR, LF, and NEL are also considered "newline functions" or > NLF. These are distinguished from other codes (above) that also > mean line breaks, mainly because of historical and widespread use > of them. > > There are several formatting characters that affect word wrapping > and line breaking, as discussed in those documents...but they are > not line breaking characters. > > **** > > The following characters are whitespaces: characters (code points) > with the property WSpace=Y (or White_Space). This is: > > newlines > U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 > > Assigned characters that are not listed above, can never be > whitespace (according to Unicode). However, the set is not closed, > so unassigned code points *could* be assigned to whitespace. It is > (unlikely? very unlikely? Pretty much never going to happen?) that > additional code points will be assigned to whitespace. > > **** > > There are some other characters that Unicode does not consider > whitespace, but deserve discussion: > U+180E MONGOLIAN VOWEL SEPARATOR: > > > U+200B ZERO WIDTH SPACE > U+200C ZERO WIDTH NON-JOINER > U+200D ZERO WIDTH JOINER > U+200E LEFT-TO-RIGHT MARK* > U+200F RIGHT-TO-LEFT MARK* > U+2060 WORD JOINER > U+FEFF ZERO WIDTH NON-BREAKING SPACE > > *These appear in Pattern_White_Space, but Pattern_White_Space > excludes U+2000-200A characters, which are obviously spaces. This > is confusing and I would appreciate clarification /why/ > Pattern_White_Space is significantly disjoint from White_Space. > > ******** > The borderline characters above are not considered WSpace=Y, but > sometimes might have space-like properties. ZWP and ZWNBP are > obviously "space" characters, but they never generate whitespace. > I suppose that conversely LTRM and RTLM are obviously "not space" > characters, but they could generate whitespace under certain > circumstances. Ditto for other formatting characters in general > (for which the class is much larger). > > Therefore I guess a Unicode definition of "whitespace" (or "space > characters") is: an assigned code point that *always* (is supposed > to) generates white space (empty space between graphemes). > > ******** > > Are there other standards that Unicode people recommend, that have > addressed whether certain borderline characters are considered > whitespace vs. non-whitespace (e.g., possibly acceptable as an > identifier or syntax component)? > > Regards, > > Sean > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From leoboiko at namakajiri.net Thu Aug 4 16:28:55 2016 From: leoboiko at namakajiri.net (Leonardo Boiko) Date: Thu, 4 Aug 2016 18:28:55 -0300 Subject: Whitespace characters in Unicode In-Reply-To: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: I'm sorry; I thought that, when you wanted to separate identifiers, it might be interesting to follow existing regexps definitions; this way your syntax would play along with already-existing tools (e.g. you'd be making it easy for someone to pipe your language into grep -P "\p{Whitespace}"). But I was talking out of my depth; I've never worked with defining Unicode identifiers, so I'm not really qualified to answer. I'm sure Davis and the others can give better answers to your questions. Meanwhile, I see that UAX #31 goes further into Unicode identifiers. It says that Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended for use in regexp-like "patterns" which mix literal characters, whitespace, and syntax (special characters), where the latter two would e.g. require quoting. For example, Perl has a "/x" flag which makes unquoted Pattern_White_Space characters be ignored in regexpes (so that you can make then less illegible). However, UAX #31 it also gives a Default Identifier Syntax, which bounds identifiers not by Whitespace but by their start characters, identified by ID_Start, defined like this: > ID_Start characters are derived from the Unicode General_Category of uppercase letters, lowercase letters, titlecase letters, modifier letters, other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax and Pattern_White_Space code points. So it makes reference only to Pattern_White_Space and not Whitespace. On the other hand, I guess the listing above will exclude Whitespace characters, since they don't count as any of letters, numbers, or Other_ID_Start? None of that is guaranteed to be stable, though. UAX #31 includes a separate definition for "Immutable identifiers", which are, and suggests various compromises between them. 2016-08-04 17:44 GMT-03:00 Sean Leonard : > I read through TR18...it mainly says that == \s == \p{Whitespace} > == property White_Space is true. Does it say anything else or more > significant than that, that I'm missing? > > Sean > > > On 8/4/2016 1:17 PM, Leonardo Boiko wrote: > > What Mark Davis said; also, depending on what you need, consider taking a > look at the definitions used by Unicode regexpes, at > http://unicode.org/reports/tr18/ . > > 2016-08-04 16:37 GMT-03:00 Sean Leonard : > >> Hi Unicode Folks: >> >> I am trying to come up with a sensible sets of characters that are >> considered whitespace or newlines in Unicode, and to understand the >> relative stability policy with respect to them. (This is for a formal >> syntax where the definition of "whitespace" matters, e.g., to separate >> identifiers, and I want to be as conservative as possible.) Please let me >> know if the stuff below is correct, or needs work. >> >> The following characters / sequences are considered line breaking >> characters, per UAX #14 and Section 5.8 of UNICODE: >> >> CRLF CR LF FF VT NEL LS PS >> >> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the combination >> U+000D U+000A (treated as one line break). These characters / sequences are >> called "newlines". >> >> There will not be any additional code points that are assigned to be line >> breaks. (Correct?) >> >> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >> These are distinguished from other codes (above) that also mean line >> breaks, mainly because of historical and widespread use of them. >> >> There are several formatting characters that affect word wrapping and >> line breaking, as discussed in those documents...but they are not line >> breaking characters. >> >> **** >> >> The following characters are whitespaces: characters (code points) with >> the property WSpace=Y (or White_Space). This is: >> >> newlines >> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >> >> Assigned characters that are not listed above, can never be whitespace >> (according to Unicode). However, the set is not closed, so unassigned code >> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >> Pretty much never going to happen?) that additional code points will be >> assigned to whitespace. >> >> **** >> >> There are some other characters that Unicode does not consider >> whitespace, but deserve discussion: >> U+180E MONGOLIAN VOWEL SEPARATOR: > 2014/12/01/when-is-an-identifier-not-an-identifier-attack- >> of-the-mongolian-vowel-separator/> >> >> U+200B ZERO WIDTH SPACE >> U+200C ZERO WIDTH NON-JOINER >> U+200D ZERO WIDTH JOINER >> U+200E LEFT-TO-RIGHT MARK* >> U+200F RIGHT-TO-LEFT MARK* >> U+2060 WORD JOINER >> U+FEFF ZERO WIDTH NON-BREAKING SPACE >> >> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >> U+2000-200A characters, which are obviously spaces. This is confusing and I >> would appreciate clarification *why* Pattern_White_Space is >> significantly disjoint from White_Space. >> >> ******** >> The borderline characters above are not considered WSpace=Y, but >> sometimes might have space-like properties. ZWP and ZWNBP are obviously >> "space" characters, but they never generate whitespace. I suppose that >> conversely LTRM and RTLM are obviously "not space" characters, but they >> could generate whitespace under certain circumstances. Ditto for other >> formatting characters in general (for which the class is much larger). >> >> Therefore I guess a Unicode definition of "whitespace" (or "space >> characters") is: an assigned code point that *always* (is supposed to) >> generates white space (empty space between graphemes). >> >> ******** >> >> Are there other standards that Unicode people recommend, that have >> addressed whether certain borderline characters are considered whitespace >> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >> component)? >> >> Regards, >> >> Sean >> > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Thu Aug 4 17:19:31 2016 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Thu, 4 Aug 2016 23:19:31 +0100 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: I'm not a Unicode expert, but I couldn't stop thinking about the following comic after reading "I am trying to come up with a sensible sets of characters that are considered whitespace" https://xkcd.com/927/ Apologies for bringing pretty much nothing to this discussion but I'm pretty sure there's much more to discuss in this ML than another whitespace set on top of 25 characters already. Thanks for your patience and your understanding. Have a great weekend everyone! Best Regards On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko wrote: > I'm sorry; I thought that, when you wanted to separate identifiers, it > might be interesting to follow existing regexps definitions; this way your > syntax would play along with already-existing tools (e.g. you'd be making > it easy for someone to pipe your language into grep -P "\p{Whitespace}"). > > But I was talking out of my depth; I've never worked with defining Unicode > identifiers, so I'm not really qualified to answer. I'm sure Davis and the > others can give better answers to your questions. Meanwhile, I see that > UAX #31 goes further into Unicode identifiers. It says that > Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended > for use in regexp-like "patterns" which mix literal characters, whitespace, > and syntax (special characters), where the latter two would e.g. require > quoting. For example, Perl has a "/x" flag which makes unquoted > Pattern_White_Space characters be ignored in regexpes (so that you can make > then less illegible). > > However, UAX #31 it also gives a Default Identifier Syntax, which bounds > identifiers not by Whitespace but by their start characters, identified by > ID_Start, defined like this: > > > ID_Start characters are derived from the Unicode General_Category of > uppercase letters, lowercase letters, titlecase letters, modifier letters, > other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax > and Pattern_White_Space code points. > > So it makes reference only to Pattern_White_Space and not Whitespace. On > the other hand, I guess the listing above will exclude Whitespace > characters, since they don't count as any of letters, numbers, or > Other_ID_Start? > > None of that is guaranteed to be stable, though. UAX #31 includes a > separate definition for "Immutable identifiers", which are, and suggests > various compromises between them. > > > 2016-08-04 17:44 GMT-03:00 Sean Leonard : > >> I read through TR18...it mainly says that == \s == \p{Whitespace} >> == property White_Space is true. Does it say anything else or more >> significant than that, that I'm missing? >> >> Sean >> >> >> On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >> >> What Mark Davis said; also, depending on what you need, consider taking a >> look at the definitions used by Unicode regexpes, at >> http://unicode.org/reports/tr18/ . >> >> 2016-08-04 16:37 GMT-03:00 Sean Leonard : >> >>> Hi Unicode Folks: >>> >>> I am trying to come up with a sensible sets of characters that are >>> considered whitespace or newlines in Unicode, and to understand the >>> relative stability policy with respect to them. (This is for a formal >>> syntax where the definition of "whitespace" matters, e.g., to separate >>> identifiers, and I want to be as conservative as possible.) Please let me >>> know if the stuff below is correct, or needs work. >>> >>> The following characters / sequences are considered line breaking >>> characters, per UAX #14 and Section 5.8 of UNICODE: >>> >>> CRLF CR LF FF VT NEL LS PS >>> >>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >>> combination U+000D U+000A (treated as one line break). These characters / >>> sequences are called "newlines". >>> >>> There will not be any additional code points that are assigned to be >>> line breaks. (Correct?) >>> >>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >>> These are distinguished from other codes (above) that also mean line >>> breaks, mainly because of historical and widespread use of them. >>> >>> There are several formatting characters that affect word wrapping and >>> line breaking, as discussed in those documents...but they are not line >>> breaking characters. >>> >>> **** >>> >>> The following characters are whitespaces: characters (code points) with >>> the property WSpace=Y (or White_Space). This is: >>> >>> newlines >>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >>> >>> Assigned characters that are not listed above, can never be whitespace >>> (according to Unicode). However, the set is not closed, so unassigned code >>> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >>> Pretty much never going to happen?) that additional code points will be >>> assigned to whitespace. >>> >>> **** >>> >>> There are some other characters that Unicode does not consider >>> whitespace, but deserve discussion: >>> U+180E MONGOLIAN VOWEL SEPARATOR: >> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of >>> -the-mongolian-vowel-separator/> >>> >>> U+200B ZERO WIDTH SPACE >>> U+200C ZERO WIDTH NON-JOINER >>> U+200D ZERO WIDTH JOINER >>> U+200E LEFT-TO-RIGHT MARK* >>> U+200F RIGHT-TO-LEFT MARK* >>> U+2060 WORD JOINER >>> U+FEFF ZERO WIDTH NON-BREAKING SPACE >>> >>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >>> U+2000-200A characters, which are obviously spaces. This is confusing and I >>> would appreciate clarification *why* Pattern_White_Space is >>> significantly disjoint from White_Space. >>> >>> ******** >>> The borderline characters above are not considered WSpace=Y, but >>> sometimes might have space-like properties. ZWP and ZWNBP are obviously >>> "space" characters, but they never generate whitespace. I suppose that >>> conversely LTRM and RTLM are obviously "not space" characters, but they >>> could generate whitespace under certain circumstances. Ditto for other >>> formatting characters in general (for which the class is much larger). >>> >>> Therefore I guess a Unicode definition of "whitespace" (or "space >>> characters") is: an assigned code point that *always* (is supposed to) >>> generates white space (empty space between graphemes). >>> >>> ******** >>> >>> Are there other standards that Unicode people recommend, that have >>> addressed whether certain borderline characters are considered whitespace >>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >>> component)? >>> >>> Regards, >>> >>> Sean >>> >> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andrea.giammarchi at gmail.com Thu Aug 4 17:36:32 2016 From: andrea.giammarchi at gmail.com (Andrea Giammarchi) Date: Thu, 4 Aug 2016 23:36:32 +0100 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: Actually my apologies for my instinctive and quite rude answer, I've misunderstood the initial email thinking Sean was proposing extra whitespace for clarifications. I won't react a quickly in the future, go on with your question Sean, and I hope you'll get it right. Best Regards On Thu, Aug 4, 2016 at 11:19 PM, Andrea Giammarchi < andrea.giammarchi at gmail.com> wrote: > I'm not a Unicode expert, but I couldn't stop thinking about the following > comic after reading "I am trying to come up with a sensible sets of > characters that are considered whitespace" https://xkcd.com/927/ > > Apologies for bringing pretty much nothing to this discussion but I'm > pretty sure there's much more to discuss in this ML than another whitespace > set on top of 25 characters already. > > Thanks for your patience and your understanding. > > Have a great weekend everyone! > Best Regards > > On Thu, Aug 4, 2016 at 10:28 PM, Leonardo Boiko > wrote: > >> I'm sorry; I thought that, when you wanted to separate identifiers, it >> might be interesting to follow existing regexps definitions; this way your >> syntax would play along with already-existing tools (e.g. you'd be making >> it easy for someone to pipe your language into grep -P "\p{Whitespace}"). >> >> But I was talking out of my depth; I've never worked with defining >> Unicode identifiers, so I'm not really qualified to answer. I'm sure Davis >> and the others can give better answers to your questions. Meanwhile, I see >> that UAX #31 goes further into Unicode identifiers. It says that >> Pattern_White_Space is stable (unlike Whitespace, perhaps?), and intended >> for use in regexp-like "patterns" which mix literal characters, whitespace, >> and syntax (special characters), where the latter two would e.g. require >> quoting. For example, Perl has a "/x" flag which makes unquoted >> Pattern_White_Space characters be ignored in regexpes (so that you can make >> then less illegible). >> >> However, UAX #31 it also gives a Default Identifier Syntax, which bounds >> identifiers not by Whitespace but by their start characters, identified by >> ID_Start, defined like this: >> >> > ID_Start characters are derived from the Unicode General_Category of >> uppercase letters, lowercase letters, titlecase letters, modifier letters, >> other letters, letter numbers, plus Other_ID_Start, minus Pattern_Syntax >> and Pattern_White_Space code points. >> >> So it makes reference only to Pattern_White_Space and not Whitespace. On >> the other hand, I guess the listing above will exclude Whitespace >> characters, since they don't count as any of letters, numbers, or >> Other_ID_Start? >> >> None of that is guaranteed to be stable, though. UAX #31 includes a >> separate definition for "Immutable identifiers", which are, and suggests >> various compromises between them. >> >> >> 2016-08-04 17:44 GMT-03:00 Sean Leonard : >> >>> I read through TR18...it mainly says that == \s == >>> \p{Whitespace} == property White_Space is true. Does it say anything else >>> or more significant than that, that I'm missing? >>> >>> Sean >>> >>> >>> On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >>> >>> What Mark Davis said; also, depending on what you need, consider taking >>> a look at the definitions used by Unicode regexpes, at >>> http://unicode.org/reports/tr18/ . >>> >>> 2016-08-04 16:37 GMT-03:00 Sean Leonard : >>> >>>> Hi Unicode Folks: >>>> >>>> I am trying to come up with a sensible sets of characters that are >>>> considered whitespace or newlines in Unicode, and to understand the >>>> relative stability policy with respect to them. (This is for a formal >>>> syntax where the definition of "whitespace" matters, e.g., to separate >>>> identifiers, and I want to be as conservative as possible.) Please let me >>>> know if the stuff below is correct, or needs work. >>>> >>>> The following characters / sequences are considered line breaking >>>> characters, per UAX #14 and Section 5.8 of UNICODE: >>>> >>>> CRLF CR LF FF VT NEL LS PS >>>> >>>> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >>>> combination U+000D U+000A (treated as one line break). These characters / >>>> sequences are called "newlines". >>>> >>>> There will not be any additional code points that are assigned to be >>>> line breaks. (Correct?) >>>> >>>> CRLF, CR, LF, and NEL are also considered "newline functions" or NLF. >>>> These are distinguished from other codes (above) that also mean line >>>> breaks, mainly because of historical and widespread use of them. >>>> >>>> There are several formatting characters that affect word wrapping and >>>> line breaking, as discussed in those documents...but they are not line >>>> breaking characters. >>>> >>>> **** >>>> >>>> The following characters are whitespaces: characters (code points) with >>>> the property WSpace=Y (or White_Space). This is: >>>> >>>> newlines >>>> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >>>> >>>> Assigned characters that are not listed above, can never be whitespace >>>> (according to Unicode). However, the set is not closed, so unassigned code >>>> points *could* be assigned to whitespace. It is (unlikely? very unlikely? >>>> Pretty much never going to happen?) that additional code points will be >>>> assigned to whitespace. >>>> >>>> **** >>>> >>>> There are some other characters that Unicode does not consider >>>> whitespace, but deserve discussion: >>>> U+180E MONGOLIAN VOWEL SEPARATOR: >>> 2014/12/01/when-is-an-identifier-not-an-identifier-attack-of >>>> -the-mongolian-vowel-separator/> >>>> >>>> U+200B ZERO WIDTH SPACE >>>> U+200C ZERO WIDTH NON-JOINER >>>> U+200D ZERO WIDTH JOINER >>>> U+200E LEFT-TO-RIGHT MARK* >>>> U+200F RIGHT-TO-LEFT MARK* >>>> U+2060 WORD JOINER >>>> U+FEFF ZERO WIDTH NON-BREAKING SPACE >>>> >>>> *These appear in Pattern_White_Space, but Pattern_White_Space excludes >>>> U+2000-200A characters, which are obviously spaces. This is confusing and I >>>> would appreciate clarification *why* Pattern_White_Space is >>>> significantly disjoint from White_Space. >>>> >>>> ******** >>>> The borderline characters above are not considered WSpace=Y, but >>>> sometimes might have space-like properties. ZWP and ZWNBP are obviously >>>> "space" characters, but they never generate whitespace. I suppose that >>>> conversely LTRM and RTLM are obviously "not space" characters, but they >>>> could generate whitespace under certain circumstances. Ditto for other >>>> formatting characters in general (for which the class is much larger). >>>> >>>> Therefore I guess a Unicode definition of "whitespace" (or "space >>>> characters") is: an assigned code point that *always* (is supposed to) >>>> generates white space (empty space between graphemes). >>>> >>>> ******** >>>> >>>> Are there other standards that Unicode people recommend, that have >>>> addressed whether certain borderline characters are considered whitespace >>>> vs. non-whitespace (e.g., possibly acceptable as an identifier or syntax >>>> component)? >>>> >>>> Regards, >>>> >>>> Sean >>>> >>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Fri Aug 5 10:52:56 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Fri, 5 Aug 2016 08:52:56 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> Message-ID: <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Here are specific questions (perhaps Mark Davis, but anyone really with experience, can respond): As Mark said, there are 25 whitespace characters. (I forgot to include HT, so that makes 25 from my original post.) What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? What are "Unicode-y" ways to compute word boundaries? Related to prior question--I suppose ZWSP is not "whitespace", but like whitespace, it separates words. I suppose that since it is not printable, it is "confusing", and therefore should be avoided in contexts where the printed representation of Unicode code points matters. Why is Pattern_White_Space significantly disjoint from White_Space, namely, why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the U+2000 range? Any implementation experience from other standards authors/implementers who have run into problems with shifty whitespace definitions? Regards, Sean On 8/4/2016 2:28 PM, Leonardo Boiko wrote: > I'm sorry; I thought that, when you wanted to separate identifiers, it > might be interesting to follow existing regexps definitions; this way > your syntax would play along with already-existing tools (e.g. you'd > be making it easy for someone to pipe your language into grep -P > "\p{Whitespace}"). > > But I was talking out of my depth; I've never worked with defining > Unicode identifiers, so I'm not really qualified to answer. I'm sure > Davis and the others can give better answers to your questions. > Meanwhile, I see that UAX #31 goes further into Unicode identifiers. > It says that Pattern_White_Space is stable (unlike Whitespace, > perhaps?), and intended for use in regexp-like "patterns" which mix > literal characters, whitespace, and syntax (special characters), where > the latter two would e.g. require quoting. For example, Perl has a > "/x" flag which makes unquoted Pattern_White_Space characters be > ignored in regexpes (so that you can make then less illegible). > > However, UAX #31 it also gives a Default Identifier Syntax, which > bounds identifiers not by Whitespace but by their start characters, > identified by ID_Start, defined like this: > > |> ID_Start| characters are derived from the Unicode General_Category > of uppercase letters, lowercase letters, titlecase letters, modifier > letters, other letters, letter numbers, plus Other_ID_Start, minus > Pattern_Syntax and Pattern_White_Space code points. > > So it makes reference only to Pattern_White_Space and not Whitespace. > On the other hand, I guess the listing above will exclude Whitespace > characters, since they don't count as any of letters, numbers, or > Other_ID_Start? > > None of that is guaranteed to be stable, though. UAX #31 includes a > separate definition for "Immutable identifiers", which are, and > suggests various compromises between them. > > > 2016-08-04 17:44 GMT-03:00 Sean Leonard >: > > I read through TR18...it mainly says that == \s == > \p{Whitespace} == property White_Space is true. Does it say > anything else or more significant than that, that I'm missing? > > Sean > > > On 8/4/2016 1:17 PM, Leonardo Boiko wrote: >> What Mark Davis said; also, depending on what you need, consider >> taking a look at the definitions used by Unicode regexpes, at >> http://unicode.org/reports/tr18/ . >> >> 2016-08-04 16:37 GMT-03:00 Sean Leonard >> >: >> >> Hi Unicode Folks: >> >> I am trying to come up with a sensible sets of characters >> that are considered whitespace or newlines in Unicode, and to >> understand the relative stability policy with respect to >> them. (This is for a formal syntax where the definition of >> "whitespace" matters, e.g., to separate identifiers, and I >> want to be as conservative as possible.) Please let me know >> if the stuff below is correct, or needs work. >> >> The following characters / sequences are considered line >> breaking characters, per UAX #14 and Section 5.8 of UNICODE: >> >> CRLF CR LF FF VT NEL LS PS >> >> So, basically: U+000A-U+000D, U+0085, U+2028, U+2029, and the >> combination U+000D U+000A (treated as one line break). These >> characters / sequences are called "newlines". >> >> There will not be any additional code points that are >> assigned to be line breaks. (Correct?) >> >> CRLF, CR, LF, and NEL are also considered "newline functions" >> or NLF. These are distinguished from other codes (above) that >> also mean line breaks, mainly because of historical and >> widespread use of them. >> >> There are several formatting characters that affect word >> wrapping and line breaking, as discussed in those >> documents...but they are not line breaking characters. >> >> **** >> >> The following characters are whitespaces: characters (code >> points) with the property WSpace=Y (or White_Space). This is: >> >> newlines >> U+0020 U+00A0 U+1680 U+2000-200A U+202F U+205F U+3000 >> >> Assigned characters that are not listed above, can never be >> whitespace (according to Unicode). However, the set is not >> closed, so unassigned code points *could* be assigned to >> whitespace. It is (unlikely? very unlikely? Pretty much never >> going to happen?) that additional code points will be >> assigned to whitespace. >> >> **** >> >> There are some other characters that Unicode does not >> consider whitespace, but deserve discussion: >> U+180E MONGOLIAN VOWEL SEPARATOR: >> >> >> U+200B ZERO WIDTH SPACE >> U+200C ZERO WIDTH NON-JOINER >> U+200D ZERO WIDTH JOINER >> U+200E LEFT-TO-RIGHT MARK* >> U+200F RIGHT-TO-LEFT MARK* >> U+2060 WORD JOINER >> U+FEFF ZERO WIDTH NON-BREAKING SPACE >> >> *These appear in Pattern_White_Space, but Pattern_White_Space >> excludes U+2000-200A characters, which are obviously spaces. >> This is confusing and I would appreciate clarification /why/ >> Pattern_White_Space is significantly disjoint from White_Space. >> >> ******** >> The borderline characters above are not considered WSpace=Y, >> but sometimes might have space-like properties. ZWP and ZWNBP >> are obviously "space" characters, but they never generate >> whitespace. I suppose that conversely LTRM and RTLM are >> obviously "not space" characters, but they could generate >> whitespace under certain circumstances. Ditto for other >> formatting characters in general (for which the class is much >> larger). >> >> Therefore I guess a Unicode definition of "whitespace" (or >> "space characters") is: an assigned code point that *always* >> (is supposed to) generates white space (empty space between >> graphemes). >> >> ******** >> >> Are there other standards that Unicode people recommend, that >> have addressed whether certain borderline characters are >> considered whitespace vs. non-whitespace (e.g., possibly >> acceptable as an identifier or syntax component)? >> >> Regards, >> >> Sean >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From markus.icu at gmail.com Fri Aug 5 12:07:17 2016 From: markus.icu at gmail.com (Markus Scherer) Date: Fri, 5 Aug 2016 10:07:17 -0700 Subject: Whitespace characters in Unicode In-Reply-To: <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard wrote: > What makes a character a "whitespace" in Unicode, e.g., why are ZWSP and > ZWNBSP not "whitespace" even though they clearly say "SPACE" in them? > I think "white space" basically wants to have an advance width (occupy space) but no ink (all white, no black) :-) ZWSP and ZWNBSP affect word and line breaking but have no advance width. Note that character names can be misleading, plain wrong, or even just misspelled, but they cannot be changed. Best to read the documentation. The charts are a good start: http://www.unicode.org/charts/PDF/U2000.pdf http://www.unicode.org/charts/PDF/UFE70.pdf In particular, don't build sets of Unicode characters just based on character name patterns. Use character properties as much as possible. What are "Unicode-y" ways to compute word boundaries? > http://www.unicode.org/reports/tr29/#Word_Boundaries Related to prior question--I suppose ZWSP is not "whitespace", but like > whitespace, it separates words. I suppose that since it is not printable, > it is "confusing", and therefore should be avoided in contexts where the > printed representation of Unicode code points matters. > Depends on what you do. Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping and line breaking in a browser or text field/editor. They are not allowed in identifiers, and removed from domain names (UTS #46). Why is Pattern_White_Space significantly disjoint from White_Space, namely, > why does Pattern_White_Space include LTRM and RTLM (and notably LS and PS) > yet omit the spaces U+1680 and in the U+2000 range? > We wanted a simple, immutable definition for rule and pattern strings that programmers write and maintain. We included LRM and RLM so that they can be used (and will be ignored) in rules, for example collation rule strings, to keep them moderately readable when they contain RTL characters. Typographic spaces are unnecessary in this context, and could be confusing. In hindsight, LS and PS are probably mistakes. When we came up with Pattern_White_Space, we still liked the idea of unambiguous end-of-line controls, but in practice it looks like no one really uses them. Anyone who cares uses markup or rich-text formats. (Markup was not common when Unicode was "born".) Any implementation experience from other standards authors/implementers who > have run into problems with shifty whitespace definitions? > Use properties, not character name patterns. If you have strong reasons not to use a property as-is, then still use it, just with inclusion & exclusion overrides. Best regards, markus -------------- next part -------------- An HTML attachment was scrubbed... URL: From doug at ewellic.org Sat Aug 6 13:30:31 2016 From: doug at ewellic.org (Doug Ewell) Date: Sat, 6 Aug 2016 12:30:31 -0600 Subject: LS and RS (was: Re: Whitespace characters in Unicode) In-Reply-To: References: Message-ID: Markus Scherer wrote: > In hindsight, LS and PS are probably mistakes. When we came up > with Pattern_White_Space, we still liked the idea of unambiguous > end-of-line controls, but in practice it looks like no one really uses > them. Anyone who cares uses markup or rich-text formats. (Markup was > not common when Unicode was "born".) I've often felt that the rise of UTF-8 spelled the end for LS and PS. Unicode was originally a completely new text format, exactly 16 bits per character. Conversion to ASCII and other byte-based encodings was an explicit process. Dedicated characters for LS and PS were a simplification, removing the notorious confusion over CR versus LF versus CRLF. UTF-8 brought ASCII backward compatibility to Unicode, removing early objections that "Unicode will double my text size" but requiring continued use of ASCII controls to maintain that compatibility. Implementers saw the existing CR/LF/CRLF muddle as a problem already solved, and LS and PS as new complications with no historical justification. Additionally, in UTF-8, either LS or PS actually takes more bytes than CR plus LF, so the "increased text size" argument also discouraged use of the new controls. -- Doug Ewell | Thornton, CO, US | ewellic.org From lists+unicode at seantek.com Sun Aug 7 18:08:58 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 7 Aug 2016 16:08:58 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: <44112777-6d5b-de46-a504-b435049248a2@seantek.com> On 8/5/2016 10:07 AM, Markus Scherer wrote: > On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard > > wrote: > > What makes a character a "whitespace" in Unicode, e.g., why are > ZWSP and ZWNBSP not "whitespace" even though they clearly say > "SPACE" in them? > > > I think "white space" basically wants to have an advance width (occupy > space) but no ink (all white, no black) :-) Yes, that is the thought that I had as well: whitespace characters always generate blank space between graphemes, whether horizontal or vertical. > > ZWSP and ZWNBSP affect word and line breaking but have no advance width. I suppose that these are "SPACE" characters, but not "WHITE space" characters, since there is no white in them. :) > > Note that character names can be misleading, plain wrong, or even just > misspelled, but they cannot be changed. Best to read the > documentation. The charts are a good start: > http://www.unicode.org/charts/PDF/U2000.pdf > http://www.unicode.org/charts/PDF/UFE70.pdf > > In particular, don't build sets of Unicode characters just based on > character name patterns. Use character properties as much as possible. > > What are "Unicode-y" ways to compute word boundaries? > > > http://www.unicode.org/reports/tr29/#Word_Boundaries > > Related to prior question--I suppose ZWSP is not "whitespace", but > like whitespace, it separates words. I suppose that since it is > not printable, it is "confusing", and therefore should be avoided > in contexts where the printed representation of Unicode code > points matters. > > > Depends on what you do. > > Normal text needs ZWSP & ZWNBSP, for example for proper word wrapping > and line breaking in a browser or text field/editor. > > They are not allowed in identifiers, and removed from domain names > (UTS #46). > > Why is Pattern_White_Space significantly disjoint from > White_Space, namely, why does Pattern_White_Space include LTRM and > RTLM (and notably LS and PS) yet omit the spaces U+1680 and in the > U+2000 range? > > > We wanted a simple, immutable definition for rule and pattern strings > that programmers write and maintain. We included LRM and RLM so that > they can be used (and will be ignored) in rules, for example collation > rule strings, to keep them moderately readable when they contain RTL > characters. Typographic spaces are unnecessary in this context, and > could be confusing. > > In hindsight, LS and PS are probably mistakes. When we came up > with Pattern_White_Space, we still liked the idea of unambiguous > end-of-line controls, but in practice it looks like no one really uses > them. Anyone who cares uses markup or rich-text formats. (Markup was > not common when Unicode was "born".) I like the premise of LS and PS: one (well, two) unambiguous characters to rule them all. But the execution was lacking, to put it mildly. And there aren't two keys on a common keyboard to distinguish between line and paragraph separation. On 8/6/2016 11:30 AM, Doug Ewell wrote: > Additionally, in UTF-8, either LS or PS actually takes more bytes than > CR plus LF, so the "increased text size" argument also discouraged use > of the new controls. That is true, it takes 3 bytes. However, the original UTF-8 proposal encoded U+0080 - U+207F in two octets: https://en.wikipedia.org/wiki/UTF-8 : |10xxxxxx| |1xxxxxxx| So, the space block /just barely makes it/. Was this intentional during the original design of UTF-8, or just a coincidence? I think it was more than a coincidence. It is regrettable that the space block was too high to work in the final version of UTF-8...maybe it should have gone below U+07FF. (More motivation for my whitespace question in following message...) Sean -------------- next part -------------- An HTML attachment was scrubbed... URL: From lists+unicode at seantek.com Sun Aug 7 18:46:27 2016 From: lists+unicode at seantek.com (Sean Leonard) Date: Sun, 7 Aug 2016 16:46:27 -0700 Subject: Whitespace characters in Unicode In-Reply-To: References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> Message-ID: <58c83966-a9ba-8c97-dcfb-0fc9dbd5bef3@seantek.com> On 8/5/2016 10:07 AM, Markus Scherer wrote: > On Fri, Aug 5, 2016 at 8:52 AM, Sean Leonard > > wrote: > > What makes a character a "whitespace" in Unicode, e.g., why are > ZWSP and ZWNBSP not "whitespace" even though they clearly say > "SPACE" in them? > > > Any implementation experience from other standards > authors/implementers who have run into problems with shifty > whitespace definitions? > > > Use properties, not character name patterns. If you have strong > reasons not to use a property as-is, then still use it, just with > inclusion & exclusion overrides. Short answer: I cannot use character properties, and cannot use exclusion overrides. As I have posted publicly, I am proposing some experimental Unicode-friendly extensions to IETF ABNF (currently in https://tools.ietf.org/html/draft-seantek-abnf-more-core-rules-05 , going to change that around a bit). There is (some) renewed interest in that part of the work since RFCs will permit UTF-8 in certain places, and IETF protocols are supposed to march towards "Net-Unicode" per RFC 5198. Being a BNF, ABNF does not have exclusion, only incremental alternatives. Character properties would require a runtime library, which significantly goes against the purpose of (A)BNF. The current proposed core rules have (scalar values = doughnut hole for surrogates) and (scalar values without the ASCII range). While these are technically accurate, they will not be particularly useful for protocol designers as they are over-inclusive. One of the rules I am working on is , which is like except for Unicode. That eliminates the noncharacter code points (which, technically, are characters...that are defined as "not characters") as well as NULL, which is already eliminated by . I was going to avoid extending (which is U+0021-U+007E, i.e., no spaces and no control characters) because it's a bit too complicated. However, there are actual protocols, including a protocol that I am working on, that define parts of the repertoire as "graphic symbols and spacing characters", and elsewhere, "graphic symbols" (i.e., no spaces and no control characters). So the space characters are relevant at a level beneath requiring a full Unicode runtime to get at the character properties. The newline issue is related but separate, and since IETF continues to use CRLF as the standard for interchange, I don't see a reason to touch it further. Best regards, Sean From duerst at it.aoyama.ac.jp Mon Aug 8 02:07:59 2016 From: duerst at it.aoyama.ac.jp (=?UTF-8?Q?Martin_J._D=c3=bcrst?=) Date: Mon, 8 Aug 2016 16:07:59 +0900 Subject: Whitespace characters in Unicode In-Reply-To: <44112777-6d5b-de46-a504-b435049248a2@seantek.com> References: <60c6e05b-990f-285a-891d-bcff5bfb3e04@seantek.com> <22aafe8b-4e21-8582-945e-d79ab62a975f@seantek.com> <44112777-6d5b-de46-a504-b435049248a2@seantek.com> Message-ID: <59cd0feb-ac86-c520-c25b-19c2aa7f90fc@it.aoyama.ac.jp> On 2016/08/08 08:08, Sean Leonard wrote: > On 8/6/2016 11:30 AM, Doug Ewell wrote: >> Additionally, in UTF-8, either LS or PS actually takes more bytes than >> CR plus LF, so the "increased text size" argument also discouraged use >> of the new controls. > > That is true, it takes 3 bytes. However, the original UTF-8 proposal The term "original UTF-8 proposal" is quite misleading, because that proposal was never labeled as UTF-8. "FSS-UTF draft version" would be much better. > encoded U+0080 - U+207F in two octets: > https://en.wikipedia.org/wiki/UTF-8 : > |10xxxxxx| |1xxxxxxx| > > > So, the space block /just barely makes it/. Was this intentional during > the original design of UTF-8, or just a coincidence? I think it was more > than a coincidence. Just a coincidence, I'd say. When designing such schemes, trying to be compact is obviously one of the goals. But "how can I design it so that these two characters still make it as two bytes" isn't. > It is regrettable that the space block was too high > to work in the final version of UTF-8...maybe it should have gone below > U+07FF. There aren't too many line breaks (and usually even less paragraph breaks) in a text, so the overall effect of the encoding length for LS or PS were really not that much of an issue. The main reason for why they didn't spread was that everybody was already dealing with several variants of line breaks and didn't want more of these, even at the prospect of (potentially, eventually, in the very, very long run maybe) have only a single one. Regards, Martin. From doug at ewellic.org Mon Aug 8 11:30:04 2016 From: doug at ewellic.org (Doug Ewell) Date: Mon, 08 Aug 2016 09:30:04 -0700 Subject: Whitespace characters in Unicode Message-ID: <20160808093004.665a7a7059d7ee80bb4d670165c8327d.da7b3527fd.wbe@email03.godaddy.com> Martin J. D?rst wrote: >> encoded U+0080 - U+207F in two octets: >> https://en.wikipedia.org/wiki/UTF-8 : >> |10xxxxxx| |1xxxxxxx| >> >> So, the space block /just barely makes it/. Was this intentional >> during the original design of UTF-8, or just a coincidence? I think >> it was more than a coincidence. > > Just a coincidence, I'd say. When designing such schemes, trying to be > compact is obviously one of the goals. But "how can I design it so > that these two characters still make it as two bytes" isn't. For actual Unicode compression schemes (SCSU and BOCU-1), certain design elements do exist to allow certain character blocks "in widespread use" to fit in minimal space. For byte-based UTFs, that wasn't a goal at all. ASCII in one byte was a given -- for compatibility with existing software, not favoritism toward English as was sometimes claimed -- but otherwise, algorithmic simplicity and reasonable overall efficiency were more important than optimizing for certain blocks. Replacing one encoding with ranges like "U+2080 through U+8207F" with another which architecturally allows non-shortest sequences, and then disallowing them, is simply a matter of different engineering solutions to the same problem. Each adds simplicity in one place and complexity in another. UTF-8 happened to tick more additional boxes (e.g. self-synchronization) than the others. -- Doug Ewell | Thornton, CO, US | ewellic.org From costello at mitre.org Wed Aug 10 03:45:08 2016 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Aug 2016 08:45:08 +0000 Subject: less-than or equal to with dot in the less-than part? Message-ID: Hi Folks, Here is the "less-than with dot" symbol: ? Here is the "less-than or equal to" symbol: ? I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. /Roger From andrewcwest at gmail.com Wed Aug 10 04:08:22 2016 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 10 Aug 2016 10:08:22 +0100 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References: Message-ID: On 10 August 2016 at 09:45, Costello, Roger L. wrote: > > Here is the "less-than with dot" symbol: ? > Here is the "less-than or equal to" symbol: ? > > I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html Andrew From costello at mitre.org Wed Aug 10 06:21:53 2016 From: costello at mitre.org (Costello, Roger L.) Date: Wed, 10 Aug 2016 11:21:53 +0000 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References:

Message-ID: Andrew West graciously pointed me to this symbol: U+2A7F ? Thank you Andrew! Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) /Roger -----Original Message----- From: Andrew West [mailto:andrewcwest at gmail.com] Sent: Wednesday, August 10, 2016 5:08 AM To: Costello, Roger L. Cc: unicode at unicode.org Subject: Re: less-than or equal to with dot in the less-than part? On 10 August 2016 at 09:45, Costello, Roger L. wrote: > > Here is the "less-than with dot" symbol: ? Here is the "less-than or > equal to" symbol: ? > > I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html Andrew From andrewcwest at gmail.com Wed Aug 10 07:06:38 2016 From: andrewcwest at gmail.com (Andrew West) Date: Wed, 10 Aug 2016 13:06:38 +0100 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References:

Message-ID: On 10 August 2016 at 12:21, Costello, Roger L. wrote: > > Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) No, but there are lots of standardized variants for mathematical glyph variants of this sort (see first section of http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so you could ask the UTC to define two more mathematical standardized variants: 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO WITH DOT INSIDE Then all you would need is to get someone to support the new standardized variants in a math font. Andrew From asmusf at ix.netcom.com Wed Aug 10 11:14:44 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Wed, 10 Aug 2016 09:14:44 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References:

Message-ID: On 8/10/2016 2:08 AM, Andrew West wrote: > On 10 August 2016 at 09:45, Costello, Roger L. wrote: >> Here is the "less-than with dot" symbol: ? >> Here is the "less-than or equal to" symbol: ? >> >> I need a symbol that is a combination: less-than or equal to with dot in the less-than part. Is there such a symbol in Unicode? The book "Parsing Techniques" uses this symbol on the bottom of page 273. > http://www.unicode.org/mail-arch/unicode-ml/y2016-m06/0117.html The one sentence you need in following that link is: "No, but there are U+2A7F ? and U+2A80 ? with slanted equals which might suffice. " The principle seems to be that Unicode separately encodes slanted from non-slanted less-than-or-equal (and similar symbols), but has not done so for the ones with dot. The question would be whether the reason for making the distinction for the non-dotted code points also holds for the dotted ones. If it does, this might be an omission, if not, as Andrew said, the existing forms might suffice. A./ From asmusf at ix.netcom.com Wed Aug 10 11:16:45 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Wed, 10 Aug 2016 09:16:45 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: References:

Message-ID: <4ed179c8-5535-d0cd-3543-0d55d6312825@ix.netcom.com> On 8/10/2016 5:06 AM, Andrew West wrote: > On 10 August 2016 at 12:21, Costello, Roger L. wrote: >> Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) > No, but there are lots of standardized variants for mathematical glyph > variants of this sort (see first section of > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so > you could ask the UTC to define two more mathematical standardized > variants: > > 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE > 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > Then all you would need is to get someone to support the new > standardized variants in a math font. > Unicode does not use standardized variants for that particular distinctions in the undotted part of that family of symbols. A./ From philip_chastney at yahoo.com Thu Aug 11 02:33:46 2016 From: philip_chastney at yahoo.com (philip chastney) Date: Thu, 11 Aug 2016 07:33:46 +0000 (UTC) Subject: less-than or equal to with dot in the less-than part? References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> there is another issue with these symbols -- they appear among the mathematical symbols but, in the reference given, they are used as delimiters I know of no other application for these symbols other than as delimiters -- are they used as mathematical operators? and how, in general, would one define the properties for characters which may sometimes be operators, and sometimes be delimiters? /phil -------------------------------------------- On Wed, 10/8/16, Asmus Freytag (c) wrote: Subject: Re: less-than or equal to with dot in the less-than part? To: unicode at unicode.org Date: Wednesday, 10 August, 2016, 4:16 PM On 8/10/2016 5:06 AM, Andrew West wrote: > On 10 August 2016 at 12:21, Costello, Roger L. wrote: >> Do you know if there is another version of the symbol, but with a straight equals sign rather than a slanted equals sign? (The book that I referred to uses a straight equals sign not a slanted equals sign) > No, but there are lots of standardized variants for mathematical glyph > variants of this sort (see first section of > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), so > you could ask the UTC to define two more mathematical standardized > variants: > > 2A7F FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO WITH DOT INSIDE > 2A80 FE00; with straight equal; # GREATER-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > Then all you would need is to get someone to support the new > standardized variants in a math font. > Unicode does not use standardized variants for that particular distinctions in the undotted part of that family of symbols. A./ From asmusf at ix.netcom.com Thu Aug 11 03:24:30 2016 From: asmusf at ix.netcom.com (Asmus Freytag (c)) Date: Thu, 11 Aug 2016 01:24:30 -0700 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> Message-ID: <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> On 8/11/2016 12:33 AM, philip chastney wrote: > there is another issue with these symbols -- they appear among the mathematical symbols but, in the reference given, they are used as delimiters > > I know of no other application for these symbols other than as delimiters -- are they used as mathematical operators? > > and how, in general, would one define the properties for characters which may sometimes be operators, and sometimes be delimiters? First and foremost. If the precise form of these (straight equals, but dotted) corresponds to a delimiter, whereas the other form (slanted equals) is an operator, then that would be even more reason to not unify these (whether with or without a variation sequence). Are the already encoded ones given the property of relational operators? Nothing prevents anyone from using an integral sing as a funny-looking fence. I would find it acceptable if the informative properties were based on majority or customary use (in the hopes that that would allow some picking of a preferred preference). A./ > /phil > > -------------------------------------------- > On Wed, 10/8/16, Asmus Freytag (c) wrote: > > Subject: Re: less-than or equal to with dot in the less-than part? > To: unicode at unicode.org > Date: Wednesday, 10 August, 2016, 4:16 PM > > On 8/10/2016 5:06 AM, > Andrew West wrote: > > On 10 August 2016 at > 12:21, Costello, Roger L. > wrote: > >> Do you know if there is > another version of the symbol, but with a straight equals > sign rather than a slanted equals sign? (The book that I > referred to uses a straight equals sign not a slanted equals > sign) > > No, but there are lots of > standardized variants for mathematical glyph > > variants of this sort (see first section > of > > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), > so > > you could ask the UTC to define two > more mathematical standardized > > > variants: > > > > 2A7F > FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO > WITH DOT INSIDE > > 2A80 FE00; with > straight equal; # GREATER-THAN OR SLANTED EQUAL TO > > WITH DOT INSIDE > > > > Then all you would need is to get someone > to support the new > > standardized > variants in a math font. > > > > Unicode does not use > standardized variants for that particular > distinctions in the undotted part of that > family of symbols. > > A./ > > From verdy_p at wanadoo.fr Thu Aug 11 10:20:43 2016 From: verdy_p at wanadoo.fr (Philippe Verdy) Date: Thu, 11 Aug 2016 17:20:43 +0200 Subject: less-than or equal to with dot in the less-than part? In-Reply-To: <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> References: <212348262.12842317.1470900826294.JavaMail.yahoo.ref@mail.yahoo.com> <212348262.12842317.1470900826294.JavaMail.yahoo@mail.yahoo.com> <6525596f-a1f0-08c8-a4a5-4d34ce469c3d@ix.netcom.com> Message-ID: the =equal sign= is also used as a delimiter (fancy quotation marks and brackets), this is also the case for < and > (see XML, also used as quotation marks in some contexts that want more). I don't see why these simple math operators would be restriced to math. Same remark about ++plus++ signs (emphasis marks). In those usages however, I do not think that there's a significant difference between the slanted or straight variants, fonts could choose one variant or the other. In maths, there's normally no difference, but possibly in some cases these could be distinctive (mathematicians love creating distinctive but simple symbols that are easily recognized because they need many distinctions when they work on various kinds of generalizations or extensions to wider topologies exhibiting some differences). 2016-08-11 10:24 GMT+02:00 Asmus Freytag (c) : > On 8/11/2016 12:33 AM, philip chastney wrote: > >> there is another issue with these symbols -- they appear among the >> mathematical symbols but, in the reference given, they are used as >> delimiters >> >> I know of no other application for these symbols other than as >> delimiters -- are they used as mathematical operators? >> >> and how, in general, would one define the properties for characters which >> may sometimes be operators, and sometimes be delimiters? >> > > First and foremost. If the precise form of these (straight equals, but > dotted) corresponds to a delimiter, whereas the other form (slanted equals) > is an operator, then that would be even more reason to not unify these > (whether with or without a variation sequence). > > Are the already encoded ones given the property of relational operators? > > Nothing prevents anyone from using an integral sing as a funny-looking > fence. I would find it acceptable if the informative properties were based > on majority or customary use (in the hopes that that would allow some > picking of a preferred preference). > > A./ > > /phil >> >> -------------------------------------------- >> On Wed, 10/8/16, Asmus Freytag (c) wrote: >> >> Subject: Re: less-than or equal to with dot in the less-than part? >> To: unicode at unicode.org >> Date: Wednesday, 10 August, 2016, 4:16 PM >> On 8/10/2016 5:06 AM, >> Andrew West wrote: >> > On 10 August 2016 at >> 12:21, Costello, Roger L. >> wrote: >> >> Do you know if there is >> another version of the symbol, but with a straight equals >> sign rather than a slanted equals sign? (The book that I >> referred to uses a straight equals sign not a slanted equals >> sign) >> > No, but there are lots of >> standardized variants for mathematical glyph >> > variants of this sort (see first section >> of >> > http://www.unicode.org/Public/UNIDATA/StandardizedVariants.txt), >> so >> > you could ask the UTC to define two >> more mathematical standardized >> > >> variants: >> > >> > 2A7F >> FE00; with straight equal; # LESS-THAN OR SLANTED EQUAL TO >> WITH DOT INSIDE >> > 2A80 FE00; with >> straight equal; # GREATER-THAN OR SLANTED EQUAL TO >> > WITH DOT INSIDE >> > >> > Then all you would need is to get someone >> to support the new >> > standardized >> variants in a math font. >> > >> Unicode does not use >> standardized variants for that particular >> distinctions in the undotted part of that >> family of symbols. >> A./ >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Aug 11 13:29:21 2016 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 11 Aug 2016 12:29:21 -0600 Subject: Where are the tools to generate posix and json from cldr? Message-ID: I can't find these that are mentioned in http://cldr.unicode.org/ "For those interested in the source CLDR data, it is available for each release in the XML format specified by LDML. There are also tools that will convert to JSON and POSIX format. For more information, see CLDR Releases/Downloads." If you follow that link, the page contains this text: "POSIX Data "Note: Beginning with CLDR v21, the CLDR project will no longer publish POSIX-format locale sources as part of its distribution. The POSIX locale generation tools will continue to be made available as a part of the release. Developers who require POSIX compliant locales can generate them using these tools." But I can't find those tools. From mark at macchiato.com Thu Aug 11 13:59:35 2016 From: mark at macchiato.com (=?UTF-8?B?TWFyayBEYXZpcyDimJXvuI8=?=) Date: Thu, 11 Aug 2016 20:59:35 +0200 Subject: Where are the tools to generate posix and json from cldr? In-Reply-To: References: Message-ID: ?That is a bit obscure! We stopped generating the source for POSIX because essentially every user customized it in some way, so was better to do with a tool. We need to add a pointer to where to get the tools and how to use them. http://cldr.unicode.org/index/downloads#Repository_Organization shows where they are. Above that are the details for SVN access.? But we really need a page that describes the specific tools and how to use them. Filed as http://unicode.org/cldr/trac/ticket/9695 Mark On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson wrote: > I can't find these that are mentioned in http://cldr.unicode.org/ > > "For those interested in the source CLDR data, it is available for each > release in the XML format specified by LDML. There are also tools that will > convert to JSON and POSIX format. For more information, see CLDR > Releases/Downloads." > > If you follow that link, the page contains this text: > > "POSIX Data > > "Note: Beginning with CLDR v21, the CLDR project will no longer publish > POSIX-format locale sources as part of its distribution. The POSIX locale > generation tools will continue to be made available as a part of the > release. Developers who require POSIX compliant locales can generate them > using these tools." > > But I can't find those tools. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From public at khwilliamson.com Thu Aug 11 14:19:11 2016 From: public at khwilliamson.com (Karl Williamson) Date: Thu, 11 Aug 2016 13:19:11 -0600 Subject: Where are the tools to generate posix and json from cldr? In-Reply-To: References: Message-ID: <1d2adb32-9c64-610c-924f-5dda05bd2184@khwilliamson.com> On 08/11/2016 12:59 PM, Mark Davis ?? wrote: > ?That is a bit obscure! We stopped generating the source for POSIX > because essentially every user customized it in some way, so was better > to do with a tool. We need to add a pointer to where to get the tools > and how to use them. > > http://cldr.unicode.org/index/downloads#Repository_Organization shows > where they are. > Above that are the details for SVN access.? > But we really need a page that describes the specific tools and how to > use them. Filed as http://unicode.org/cldr/trac/ticket/9695 > > Mark I had looked at that, and downloaded the latest data, and still could not find the tools in it. One would think that the tools directory contains it, and I did not look in every sub-directory in it, but none looked likely. I then tried transforms, but came up empty there too. > ////// > > On Thu, Aug 11, 2016 at 8:29 PM, Karl Williamson > > wrote: > > I can't find these that are mentioned in http://cldr.unicode.org/ > > "For those interested in the source CLDR data, it is available for > each release in the XML format specified by LDML. There are also > tools that will convert to JSON and POSIX format. For more > information, see CLDR Releases/Downloads." > > If you follow that link, the page contains this text: > > "POSIX Data > > "Note: Beginning with CLDR v21, the CLDR project will no longer > publish POSIX-format locale sources as part of its distribution. > The POSIX locale generation tools will continue to be made available > as a part of the release. Developers who require POSIX compliant > locales can generate them using these tools." > > But I can't find those tools. > > From taylorcanning at outlook.com Thu Aug 11 20:32:49 2016 From: taylorcanning at outlook.com (Taylor Canning) Date: Fri, 12 Aug 2016 01:32:49 +0000 Subject: Myanmar character set Message-ID: Hi there, has anyone had any issues with the Myanmar character set ? i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S?Gaw Karen language, which is a lot of people ! Thanks, Taylor Sent from my Windows 10 phone -------------- next part -------------- An HTML attachment was scrubbed... URL: From lang.support at gmail.com Thu Aug 11 22:50:37 2016 From: lang.support at gmail.com (Andrew Cunningham) Date: Fri, 12 Aug 2016 13:50:37 +1000 Subject: Myanmar character set In-Reply-To: References: Message-ID: Hi Taylor, This should work fine in theory. Are you using a mymr or mym2 style opentype font? What applications, operating system and fonts are you using? Andrew On 12 Aug 2016 12:55 pm, "Taylor Canning" wrote: > Hi there, has anyone had any issues with the Myanmar character set ? i > have raised an issue recently where the combination ? and ? does not > combine correctly to make ?? on my windows devices. It used to work just > fine. It is am extremely common tonal marker and is a big issue for anyone > who types the S?Gaw Karen language, which is a lot of people ! > > Thanks, Taylor > > > > Sent from my Windows 10 phone > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Aug 12 06:41:48 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 12 Aug 2016 13:41:48 +0200 Subject: [UTR#51-8] 1.4.3 Emoji Variation Sequences: Female/Venus and Male/Mars Signs Message-ID: > 2640 FE0E; text style; # FEMALE SIGN > 2640 FE0F; emoji style; # FEMALE SIGN > 2642 FE0E; text style; # MALE SIGN > 2642 FE0F; emoji style; # MALE SIGN Since U+240 and U+2642 double as symbols for the planets (and ancient gods) Venus and Mars, respectively, users will rightfully expect VS-16 to have an effect on the other planet symbols as well (probably including U+2647 Pluto). Both symbols are also sometimes used to represent Friday and Tuesday, respectively, so some users may expect the symbols for the other 5 days of the week also react on U+FE0E/F. 1. Monday ? U+263D Moon or ? U+263E 2. Tuesday ? U+2642 Mars 3. Wednesday ? U+263F Mercury 4. Thursday ? U+2643 Jupiter 5. Friday ? U+2640 Venus 6. Saturday ? U+2644 Saturn 7. Sunday ? U+2609 Sun or ? U+263C U+2640/2 are also part of common sets of gender, sex and sexuality symbols which, again, some users will expect to have emoji forms now and ? be prepared for the ?????? ? also work in ZWJ or Open Type ligature sequences. (I?m not sure how lesbian or gay versions of emojis, as proposed before in L2/15-013 for instance, could become anything other than stereotypical through offensive.) The real-world use may be a bit different from what the annotations in the standard say, e.g. distinction of transgender and intersex or sexuality and gender identity: > * ? U+26A2 Doubled Female Sign > = lesbianism > * ? U+26A3 Doubled Male Sign > ? a glyph variant has the two circles on the same line > = male homosexuality > * ? U+26A4 Interlocked Female and Male Sign > ? a glyph variant has the two circles on the same line > = bisexuality > * ? U+26A5 Male and Female Sign > = transgendered sexuality > = hermaphrodite (in entomology) > * ? U+26A6 Male with Stroke Sign > = transgendered sexuality > * ? U+26A7 Male with Stroke and Male and Female Sign > = transgendered sexuality > * ? U+26B2 Neuter Lastly, the 2 signs are also recognized by Unicode to be alchemical symbols of copper and iron, respectively, but since that set is much larger and even more esoteric I expect not much demand for emoji versions of all of them. In conclusion, I see no good way other than to add a lot of additional codepoints from the Miscellaneous Symbols block to StandardizedVariants.txt. Cheers Christoph From christoph.paeper at crissov.de Fri Aug 12 07:09:09 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Fri, 12 Aug 2016 14:09:09 +0200 Subject: [UTR#51-8] 2.4 Emoji Implementation Notes Message-ID: > ? including the user of ? Should be just ?use?. > * emoji zwj sequence > - may have an emoji variation selector. > - should be displayed with an emoji presentation by default, even when an emoji zwj element is a singleton with Emoji_Presentation=No. ?zwj? should be ?ZWJ? in all instances, also found elsewhere. If I don?t misread, this seems to be saying nothing about a (hypothetical) emoji ZWJ sequence consisting of 2 or more elements with `Emoji_Presentation=No` without any VS-16. What?s the actual intention? 1. If there?s any VS-16 or any character with `Emoji_Presentation=Yes` in a ZWJ sequence, the whole sequence SHOULD be treated as emoji(s). 2. A ZWJ sequence SHOULD be treated as emoji(s) if it contains only characters that either have `Emoji_Presentation=Yes` or whose glyph *can* be affected by VS-16. Only #2 would cover a ZWJ sequence of `Emoji_Presentation=No` characters without any VS-16 stuck on them. From zelpahd at gmail.com Fri Aug 12 02:44:10 2016 From: zelpahd at gmail.com (zelpa) Date: Fri, 12 Aug 2016 17:44:10 +1000 Subject: ZWJ sequences in UTR #51 v4 Message-ID: Some of the ZWJ sequences in the latest revision seem sort of arbitrary, why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? The current proposition also doesn't seem to allow for a gender-neutral doctor(?) -------------- next part -------------- An HTML attachment was scrubbed... URL: From davidj_faulks at yahoo.ca Fri Aug 12 11:54:25 2016 From: davidj_faulks at yahoo.ca (David Faulks) Date: Fri, 12 Aug 2016 16:54:25 +0000 (UTC) Subject: ZWJ sequences in UTR #51 v4 References: <1378418133.13608237.1471020865650.JavaMail.yahoo.ref@mail.yahoo.com> Message-ID: <1378418133.13608237.1471020865650.JavaMail.yahoo@mail.yahoo.com> The problem with a ?Doctor Emoji? is that new characters need to be approved by the ISO (International Standards Organization), in a long process, which means that any new characters will not be available (officially) until Unicode 10 in June of next year. Vendors have decided they want these gendered Emoji ASAP (see the latest news). The work-around is to treat **sequences** of existing characters as a new Emoji, like some sort of very weird ligature. Unicode is scrambling to catch up to what Vendors have suddenly decided they want (although in my opinion, this could have been predicted last year). David -------------------------------------------- On Fri, 8/12/16, zelpa wrote: Subject: ZWJ sequences in UTR #51 v4 To: unicode at unicode.org Received: Friday, August 12, 2016, 3:44 AM Some of the ZWJ sequences in the latest revision seem sort of arbitrary, why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? The current proposition also doesn't seem to allow for a gender-neutral doctor(?) From Andrew.Glass at microsoft.com Fri Aug 12 14:02:23 2016 From: Andrew.Glass at microsoft.com (Andrew Glass) Date: Fri, 12 Aug 2016 19:02:23 +0000 Subject: Myanmar character set In-Reply-To: References:

Message-ID: Hi Taylor and Andrew, This is a known issue with the Myanmar engine on Windows. We are tracking the issue, but don?t have a date for the fix at this time. Cheers, Andrew From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Andrew Cunningham Sent: Thursday, August 11, 2016 8:51 PM To: Taylor Canning Cc: Unicode Mailing List Subject: Re: Myanmar character set Hi Taylor, This should work fine in theory. Are you using a mymr or mym2 style opentype font? What applications, operating system and fonts are you using? Andrew On 12 Aug 2016 12:55 pm, "Taylor Canning" > wrote: Hi there, has anyone had any issues with the Myanmar character set ? i have raised an issue recently where the combination ? and ? does not combine correctly to make ?? on my windows devices. It used to work just fine. It is am extremely common tonal marker and is a big issue for anyone who types the S?Gaw Karen language, which is a lot of people ! Thanks, Taylor Sent from my Windows 10 phone -------------- next part -------------- An HTML attachment was scrubbed... URL: From christoph.paeper at crissov.de Fri Aug 12 18:29:47 2016 From: christoph.paeper at crissov.de (=?utf-8?Q?Christoph_P=C3=A4per?=) Date: Sat, 13 Aug 2016 01:29:47 +0200 Subject: ZWJ sequences in UTR #51 v4 In-Reply-To: References: Message-ID: zelpa : > > Some of the ZWJ sequences in the latest revision seem sort of arbitrary, ?Some?? It?s a fundamental principle of linguistics that signs connect representation and meaning arbitrarily, but this doesn?t apply to pictures and proto-writing, which are not (quite/yet) linguistic signs. > why is male health worker Man + Staff of Asclepius instead of introducing a Doctor emoji and simply using the female of male modifiers? I do agree with the general approach to encode additional professions as ZWJ sequences. Ideally, people would already be using emoji sequences for professions (without ZWJ, ?emoji words?) and there was research of such compounds, so Unicode could document existing conventions. Otherwise, one could also go ahead and conduct a user study by letting a representative sample of people express a meaning with a restricted repertoire (i.e. emojis already in Unicode). Alas, neither seems not to have been done, instead a committee of experts chose canonic sequences based upon vendor proposals (Google and Apple). Interestingly, the result ? currently in beta state ? is not systematic in any way whatsoever: Professions are arbitrarily identified by a tool ????????????, clothing ??, accessory ??, product ??, building ????, vehicle ?????? or already conventionalized symbol ??. Often these are directly featured in the example image, but not always. Chances are high that sequences in the wild, which are intended to represent the same professions, are using different components. With family emojis, ZWJ sequences (and Fitzpatrick modifiers) are very similar to classic ligatures, because the resulting glyph is just an elaborate composition of its bases. If the example images were intuitively obvious or mandatory design recommendations, this could also be true for many of the new profession emoji sequences, but this is in fact not the case since 1) font vendors are free to design an arbitrary iconographic *picture to represent the compound meaning*, 2) the sequences are not empirically founded and 3) are culturally biased (e.g. ?????). If future emoji selection UIs offered the sequences by showing precomposed glyphs (like many do with families and flags), the problem would be hidden away for a while, but this will become unmanageable eventually. I expect IMEs to adopt a different approach soon: auto-correction. If a user successively enters two emojis that form an officially registered ZWJ sequence, the system will automatically insert U+200D and use a single glyph ? hopefully the user will be able to revert or edit that composition, e.g. ZWJ?ZWNJ. The system will also try to identify juxtaposed (e.g. ????) or synonymous sequences (e.g. ???? or ???? for a farmer and ???? or ???? for a health worker) and suggest to replace them by the canonic sequence or even by a single character (e.g. ????, ??? or ???? to ??). That?s basically `<3` and `:-)` TNG. To make it simpler to learn the canonic sequences I?d strongly urge the people in charge to select as few generic patterns as possible, e.g. + or , and this should be based upon actual research. > The current proposition also doesn't seem to allow for a gender-neutral doctor(?) Yes, this is a problem with the ZWJ